What is it about?

In this paper, we present a word embedding dataset NWJC2Vec constructed using ‘NINJAL Web Japanese Corpus (NWJC)’. NWJC is a Web-crawled text corpus that contains 25.8 billion tokens. We construct two types of the word embedding dataset: one is based on the surface form, and the other is based on the complete morpheme information provided by UniDic, which is a lexicon for the Japanese morphological analyser MeCab. We perform an evaluation of the dataset by comparing it with the ‘Word List by Semantic Principles (Bunrui Goihyo)’.

Featured Image

Read the Original

This page is a summary of: NWJC2Vec, Terminology International Journal of Theoretical and Applied Issues in Specialized Communication, May 2018, John Benjamins,
DOI: 10.1075/term.00011.asa.
You can read the full text:

Read

Contributors

The following have contributed to this page