Centro de Documentacion de Fundación MAPFRE - Morphological skip-gram : replacing fast text characters n-gram with morphological knowledge

MAP20210011719

Morphological skip-gram : replacing fast text characters n-gram with morphological knowledge / Flavio Arthur O. Santos

Sumario: Natural language processing systems have attracted much interest of the industry. This branch of study is composed of some applications such as machine translation, sentiment analysis, named entity recognition, question and answer, and others. Word embeddings (i.e., continuous word representations) are an essential module for those applications generally used as word representation to machine learning models. Some popular methods to train word embeddings are GloVe and Word2Vec. They achieve good word representations, despite limitations: both ignore morphological information of the words and consider only one representation vector for each word. This approach implies the word embeddings does not consider dierent word contexts properly and are unaware of its inner structure. To mitigate this problem, the other word embeddings method FastText represents each word as a bag of characters n-grams. Hence, a continuous vector describes each n-gram, and the nal word representation is the sum of its characters n-grams vectors. Nevertheless, the use of all n-grams character of a word is a poor approach since some n-grams have no semantic relation with their words and increase the amount of potentially useless information. This approach also increase the training phase time. In this work, we propose a new method for training word embeddings, and its goal is to replace the FastText bag of character n-grams for a bag of word morphemes through the morphological analysis of the word. Thus, words with similar context and morphemes are represented by vectors close to each other. To evaluate our new approach, we performed intrinsic evaluations considering 15 dierent tasks, and the results show a competitive performance compared to FastText. Moreover, the proposed model is 40% faster than FastText in the training phase. We also outperform the baseline approaches in extrinsic evaluations through Hate speech detection and NER tasks using dierent scenarios

En: Revista Iberoamericana de Inteligencia Artificial. - IBERAMIA, Sociedad Iberoamericana de Inteligencia Artificial , 2018- = ISSN 1988-3064. - 15/02/2021 Volumen 24 Número 67 - febrero 2021 , p. 1-17

1. Inteligencia artificial . 2. Lenguajes de programación . 3. Conocimiento . I. Título.