Emoji space

This text model is inspired by DeepMoji; the idea is to create a function \(m_{\text{emo}}: \text{text} \rightarrow \mathbb{R}^{64}\) that predicts which emoji would be the most probable given a text. To do so, we proposed a composition of two functions , i.e., \(g \circ m_b\) where \(m_b\) is created using the procedure described in Arabic, English, and Spanish for Arabic, English and Spanish, respectively. The second part, i.e., \(g\), is a linear SVM trained with 3.2 million examples of the 64 most frequent emojis per language. The result is that emojis are different for each language; the emoji used can be seen in this manuscript Figure 2.

The Emoji Space is created for Arabic, English and Spanish. These models can be selected using the parameters EvoMSA.base.EvoMSA(Emo=True, lang="en") where lang specifies the language and can be either ar, en, or es.

For example, let us read a dataset to train EvoMSA.

>>> from EvoMSA import base
>>> from microtc.utils import tweet_iterator
>>> import os
>>> tweets = os.path.join(os.path.dirname(base.__file__), 'tests', 'tweets.json')
>>> D = list(tweet_iterator(tweets))
>>> X = [x['text'] for x in D]
>>> y = [x['klass'] for x in D]

Once the dataset is load, EvoMSA using Emoji Space in Spanish is trained as follows:

>>> from EvoMSA.base import EvoMSA
>>> evo = EvoMSA(Emo=True, lang='es').fit(X, y)
>>> evo.predict(['buenos dias'])

As mentioned previously, the model represents a given text into a 64 dimentional space, one can see this representation as follows.

>>> emo = evo.textModels[1]
>>> emo['buenos dias']

it can be observed that the output is a vector \(\in \mathbb{R}^{64}\) where each component correspond an emoji which is stored in the following list

>>> emo._labels

The three best-ranked emoji for good morning (buenos dias) and I love that song (me encanta esa canción) are:

>>> import numpy as np
>>> [emo._labels[x] for x in np.argsort(emo['buenos dias'])[::-1][:3]]
['😄', '😴', '☺']
>>> [emo._labels[x] for x in np.argsort(emo['me encanta esa canción'])[::-1][:3]]
['💓', '♫', '💞']