Written Japanese Corpus
Outline
This website uses Wikipedia Japanese article data as a corpus.
13,828,652 sentences, 384,648,362 words, and 1,502,987 unique words.
With this corpus we can research word collocation and can learn written Japanese either.
Display Format
There are two methods of display format.
- Display "whole sentences"
- Display "N-gram"
Available N-gram options are 3g, 5g, 7g, 9g, 11g and 13g.
Whole Sentence | Display sentences containing search query. |
3g | Search query and a word before and after the query. The sum is 3 words. |
5g | Search query and 2 word before and after the query. The sum is 5 words. |
7g | Search query and 3 word before and after the query. The sum is 7 words. |
9g | Search query and 4 word before and after the query. The sum is 9 words. |
11g | Search query and 5 word before and after the query. The sum is 11 words. |
13g | Search query and 6 word before and after the query. The sum is 13 words. |
Search 2 words or more
Separate the words by space.
Example: 猫 と
Search Results
Parts of a sentences matching with query is highlighted in red.
When the system shows a result with whole sentence option, the number of match means the number of sentences matched with search query.
When the system shows a result with N-gram option, the number of match means the number of sentences matched with search query either. However, if there are some parts matching with search query, the number of displaying results may be increased compared with the number of the sentences matched with search query.
ASCII and other languages are not covered.
About Data
There is a summary about corpus data from how to obtain Wikipedia Japanese article data to how to process the data. → Link:Wikipediaの記事データからコーパスを作成する方法
As of June 1, 2015, 13,828,652 sentences, 384,648,362 words, and 1,502,987 unique words (Except ASCII characters).
Notice
This website uses Wikipedia's data and is "under the Creative Commons Attribution-Share-Alike License 3.0 or later".
NEITHER THIS WEBSITE NOR I SHALL BE LIABLE TO YOU OR TO ANY OTHER PARTY FOR ANY DAMAGES, COSTS, OR LOSSES. And this website and I have no relevant to Wikipedia.