Corpora

Obviously, you need Japanese corpora in order to do research in Japanese computational linguistics. Here are the corpora I’ve discovered so far:

  • Japanese FrameNet – The Japanese version of the FrameNet corpus. The corpus is being worked on by Kyoko Ohara at Keio University (慶應義塾大学) and her team. This corpus is the main focus of my research.
  • Balanced Corpus of Contemporary Written Japanese – A balanced corpus of one hundred million words of contemporary written Japanese. It is a nation wide project funded partially by MEXT (文部科学省 – the Japanese ministry of education). The Japanese FrameNet project is also part of this project. Currently, ten million words from the BCCWJ are publicly available on the web for full-text retrieval.
  • The Tanaka Corpus – A parallel English/Japanese corpus compiled by Professor Yasuhito Tanaka at Hyogo University and his students. The corpus has been edited and revised by Jim Breen and Paul Blay. Paul Blay now maintains the corpus. The corpus is in the public domain and therefore freely available.
  • Japanese-English News Article Alignment Data – A parallel English/Japanese corpus compiled by Masao Utiyama and Hitoshi Isahara at the National Institute of Information and Communications Technology in Kyoto. The corpus has a license that makes it free for research and educational uses. A copy of the license must by signed and mailed in before access is granted to the corpus.
  • The Taiyo Corpus – A text corpus in XML format of the periodical Taiyo, which was read by a wide range of readers from the end of 19th to the beginning of the 20th centuries.
  • The Kyoto University Text Corpus – A text corpus compiled from the Mainichi Shinbun Newspaper. It contains all articles (~20,000 sentences) from January 1st, 1995 through January 17th, 1995, and all editorial articles (~20,000 sentences) from January to December. The corpus contains annotations for morphological and syntactic structure.  (Requires a copy of the 1995 Mainichi Shinbun data.)
  • NAIST Text Corpus – A text corpus that uses the same section of the Mainichi Shinbun Newspaper used in the Kyoto University Text Corpus. The NAIST corpus contains annotations for predicate-argument relation (surface case: nominative, accusative, and dative cases), event noun and its relation (surface case: nominative, accusative, and dative cases), and coreference information.  (Requires a copy of the 1995 Mainichi Shinbun data.)
  • Lexeed and The Hinoki Treebank – Lexeed is a thesaurus and The Hinoki Treebank is a text corpus that is annotated with syntactic information as well as word sense tags. The syntactic analysis used is a type of head driven phrase structure grammar (HPSG).
  • GoiTaikei – A Japanese lexicon containing 300,000 Japanese words marked with part-of-speech and semantic classes, originally developed for the ALT-J/E Japanese-to-English machine translation system by NTT. It also has a valency dictionary containing detailed information on predicate structure usage for 6,000 Japanese predicates including the number and type of arguments (valency) and selectional restrictions on the arguments. A total of 14,000 Japanese patterns, including ordinary sentence structures and idiomatic usage, are paired with the equivalent English patterns. An index of Japanese nouns and an English index enables users to search for pairs of sentence patterns, using Japanese or English vocabulary.
  • PolyU Business Corpora – The PolyU Business Corpora are made up of three comparable corpora of Chinese, English and Japanese business texts, the majority of which originate from the business and finance sections of newspapers written in those languages, covering news and reports from auditing and accounting to insurance and investments.  The Japanese version contains around 1.32 million words.

LDC Corpora:

  • Japanese Web N-gram Version 1 – This is the Google N-Gram corpus for Japanese. It contains all unigrams through seven-grams that appear at least 20 times in the processed sentences.
  • Japanese Business News Text – A corpus of Japanese business news of at least 30 million words.
  • Japanese Business News Text Supplement – A supplement to the above corpus that contains text from after the first corpus was created.
  • ECI Multilingual Text – A multilingual corpus comprised mostly of European languages. One of its subcorpora is a small corpus of about 200,000 words of Japanese text.

Leave a Reply

Your email address will not be published. Required fields are marked *