EvoMSA first version

https://github.com/INGEOTEC/EvoMSA/actions/workflows/test.yaml/badge.svg https://coveralls.io/repos/github/INGEOTEC/EvoMSA/badge.svg?branch=develop https://badge.fury.io/py/EvoMSA.svg https://dev.azure.com/conda-forge/feedstock-builds/_apis/build/status/evomsa-feedstock?branchName=main https://img.shields.io/conda/vn/conda-forge/evomsa.svg https://img.shields.io/conda/pn/conda-forge/evomsa.svg https://readthedocs.org/projects/evomsa/badge/?version=docs https://colab.research.google.com/assets/colab-badge.svg

EvoMSA is a Sentiment Analysis System based on B4MSA and EvoDAG. EvoMSA is a stack generalization algorithm specialized on text classification problems. It works by combining the output of different text models to produce the final prediction.

EvoMSA is a two-stage procedure; the first step transforms the text into a vector space with dimensions related to the number of classes, and then, the second stage trains a supervised learning algorithm.

The first stage is a composition of two functions, \(g \circ m\), where \(m\) is a text model that transforms a text into a vector (i.e., \(m: \text{text} \rightarrow \mathbb R^d\)) and \(g\) is a classifier or regressor (i.e., \(g: \mathbb R^d \rightarrow \mathbb R^c\)), \(d\) depends on \(m\), and \(c\) is the number of classes or labels.

EvoMSA contains different text models (i.e., \(m\)), which can be selected using flags in the class constructor. The text models implemented are:

where lang specifies the language and can be either ar, en, or, es that corresponds to Arabic, English, and Spanish, respectively. On the other hand, \(g\) is a classifier or regressor, and by default, it uses sklearn.svm.LinearSVC.

The second stage is the stacking method, which is a classifier or regressor. EvoMSA uses by default EvoDAG (i.e., EvoDAG.model.EvoDAGE); however, this method can be changed with tha parameter stacked_method, e.g., EvoMSA.base.EvoMSA(stacked_method="sklearn.naive_bayes.GaussianNB").

EvoMSA is described in EvoMSA: A Multilingual Evolutionary Approach for Sentiment Analysis, Mario Graff, Sabino Miranda-Jimenez, Eric Sadit Tellez, Daniela Moctezuma. Computational Intelligence Magazine, vol 15 no. 1, pp. 76-88, Feb. 2020. In this document, we try to follow as much as possible the notation used in the CIM paper; we believe this can help to grasp as easily as possible EvoMSA’s goals.

Quickstart Guide

We have decided to make a live quickstart guide, it covers the installation, the use of EvoMSA with different text models, and it ends by explaining how the text models can be used on their own. Finally, the notebook can be found at the docs directory on GitHub.

Citing

If you find EvoMSA useful for any academic/scientific purpose, we would appreciate citations to the following reference:

@article{DBLP:journals/corr/abs-1812-02307,
author = {Mario Graff and Sabino Miranda{-}Jim{\'{e}}nez
               and Eric Sadit Tellez and Daniela Moctezuma},
title     = {EvoMSA: {A} Multilingual Evolutionary Approach for Sentiment Analysis},
journal   = {Computational Intelligence Magazine},
volume    = {15},
issue     = {1},
year      = {2020},
pages     = {76 -- 88},
url       = {https://ieeexplore.ieee.org/document/8956106},
month     = {Feb.}
}

Installing EvoMSA

EvoMSA can be easly install using anaconda

conda install -c conda-forge EvoMSA

or can be install using pip, it depends on numpy, scipy, scikit-learn and b4msa.

pip install cython
pip install sparsearray
pip install evodag
pip install EvoMSA

Usage

EvoMSA can be used from using the following commands.

Read the dataset

>>> from EvoMSA import base
>>> from microtc.utils import tweet_iterator
>>> import os
>>> tweets = os.path.join(os.path.dirname(base.__file__), 'tests', 'tweets.json')
>>> D = list(tweet_iterator(tweets))
>>> X = [x['text'] for x in D]
>>> y = [x['klass'] for x in D]

Once the dataset is loaded, it is time to create an EvoMSA model, let us create an EvoMSA model enhaced with Emoji space.

>>> from EvoMSA.base import EvoMSA
>>> evo = EvoMSA(Emo=True, lang='es').fit(X, y)

Predict a sentence in Spanish

>>> evo.predict(['EvoMSA esta funcionando'])

EvoMSA uses by default EvoDAG.model.EvoDAGE as stacked classifier; however, this is a parameter that can be modified. Let us, for example use sklearn.naive_bayes.GaussianNB in the previous example.

>>> evo = EvoMSA(Emo=True, lang='es',
                 stacked_method='sklearn.naive_bayes.GaussianNB').fit(X, y)
>>> evo.predict(['EvoMSA esta funcionando'])

Text Models

Besides the default text model (i.e., b4msa.textmodel.TextModel), EvoMSA has four text models (EvoMSA’s CIM paper presents only the first three models) for Arabic, English and Spanish languages that can be selected with a flag in the constructor, these are:

Nonetheless, more text models can be included in EvoMSA. EvoMSA’s core idea is to facilitate the inclusion of diverse text models. We have been using EvoMSA (as INGEOTEC team) on different competitions run at the Workshop of Semantic Evaluation as well as other sentiment-analysis tasks and traditional text classification problems.

During this time, we have created different text models – some of them using the datasets provided by the competition’s organizers and others inspired by our previous work – in different languages. We have decided to make public these text models organizing them by language.

EvoMSA’s classes