Morphological skip-gram : replacing fast text characters n-gram with morphological knowledge
<?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/MARC21/slim http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd">
<record>
<leader>00000cab a2200000 4500</leader>
<controlfield tag="001">MAP20210011719</controlfield>
<controlfield tag="003">MAP</controlfield>
<controlfield tag="005">20220911190026.0</controlfield>
<controlfield tag="008">210413e20210215esp|||p |0|||b|spa d</controlfield>
<datafield tag="040" ind1=" " ind2=" ">
<subfield code="a">MAP</subfield>
<subfield code="b">spa</subfield>
<subfield code="d">MAP</subfield>
</datafield>
<datafield tag="084" ind1=" " ind2=" ">
<subfield code="a">922.134</subfield>
</datafield>
<datafield tag="100" ind1="1" ind2=" ">
<subfield code="0">MAPA20210005589</subfield>
<subfield code="a">Santos, Flavio Arthur O.</subfield>
</datafield>
<datafield tag="245" ind1="1" ind2="0">
<subfield code="a">Morphological skip-gram</subfield>
<subfield code="b">: replacing fast text characters n-gram with morphological knowledge</subfield>
<subfield code="c">Flavio Arthur O. Santos</subfield>
</datafield>
<datafield tag="520" ind1=" " ind2=" ">
<subfield code="a">Natural language processing systems have attracted much interest of the industry. This branch of study is composed of some applications such as machine translation, sentiment analysis, named entity recognition, question and answer, and others. Word embeddings (i.e., continuous word representations) are an essential module for those applications generally used as word representation to machine learning models. Some popular methods to train word embeddings are GloVe and Word2Vec. They achieve good word representations, despite limitations: both ignore morphological information of the words and consider only one representation vector for each word. This approach implies the word embeddings does not consider dierent word contexts properly and are unaware of its inner structure. To mitigate this problem, the other word embeddings method FastText represents each word as a bag of characters n-grams. Hence, a continuous vector describes each n-gram, and the nal word representation is the sum of its characters n-grams vectors. Nevertheless, the use of all n-grams character of a word is a poor approach since some n-grams have no semantic relation with their words and increase the amount of potentially useless information. This approach also increase the training phase time. In this work, we propose a new method for training word embeddings, and its goal is to replace the FastText bag of character n-grams for a bag of word morphemes through the morphological analysis of the word. Thus, words with similar context and morphemes are represented by vectors close to each other. To evaluate our new approach, we performed intrinsic evaluations considering 15 dierent tasks, and the results show a competitive performance compared to FastText. Moreover, the proposed model is 40% faster than FastText in the training phase. We also outperform the baseline approaches in extrinsic evaluations through Hate speech detection and NER tasks using dierent scenarios </subfield>
</datafield>
<datafield tag="650" ind1=" " ind2="4">
<subfield code="0">MAPA20080611200</subfield>
<subfield code="a">Inteligencia artificial</subfield>
</datafield>
<datafield tag="650" ind1=" " ind2="4">
<subfield code="0">MAPA20080617479</subfield>
<subfield code="a">Lenguajes de programación</subfield>
</datafield>
<datafield tag="650" ind1=" " ind2="4">
<subfield code="0">MAPA20080561871</subfield>
<subfield code="a">Conocimiento</subfield>
</datafield>
<datafield tag="773" ind1="0" ind2=" ">
<subfield code="w">MAP20200034445</subfield>
<subfield code="t">Revista Iberoamericana de Inteligencia Artificial</subfield>
<subfield code="d">IBERAMIA, Sociedad Iberoamericana de Inteligencia Artificial , 2018-</subfield>
<subfield code="x">1988-3064</subfield>
<subfield code="g">15/02/2021 Volumen 24 Número 67 - febrero 2021 , p. 1-17</subfield>
</datafield>
<datafield tag="856" ind1=" " ind2=" ">
<subfield code="q">application/pdf</subfield>
<subfield code="w">1110628</subfield>
<subfield code="y">Recurso electrónico / Electronic resource</subfield>
</datafield>
</record>
</collection>