What is it about?
In this paper, we present a word embedding dataset NWJC2Vec constructed using ‘NINJAL Web Japanese Corpus (NWJC)’. NWJC is a Web-crawled text corpus that contains 25.8 billion tokens. We construct two types of the word embedding dataset: one is based on the surface form, and the other is based on the complete morpheme information provided by UniDic, which is a lexicon for the Japanese morphological analyser MeCab. We perform an evaluation of the dataset by comparing it with the ‘Word List by Semantic Principles (Bunrui Goihyo)’.
Featured Image
Read the Original
This page is a summary of: NWJC2Vec, Terminology International Journal of Theoretical and Applied Issues in Specialized Communication, May 2018, John Benjamins,
DOI: 10.1075/term.00011.asa.
You can read the full text:
Contributors
The following have contributed to this page