EvoMSA 2.0

https://github.com/INGEOTEC/EvoMSA/actions/workflows/test.yaml/badge.svg https://coveralls.io/repos/github/INGEOTEC/EvoMSA/badge.svg?branch=develop https://badge.fury.io/py/EvoMSA.svg https://dev.azure.com/conda-forge/feedstock-builds/_apis/build/status/evomsa-feedstock?branchName=main https://img.shields.io/conda/vn/conda-forge/evomsa.svg https://img.shields.io/conda/pn/conda-forge/evomsa.svg https://readthedocs.org/projects/evomsa/badge/?version=docs https://colab.research.google.com/assets/colab-badge.svg

EvoMSA is a stack generalization algorithm specialized in text classification problems. A text classifier \(c\), can be seen as a composition of two functions, i.e., \(c \equiv g \circ m\); where \(m\) transforms the text into a vector space, i.e., \(m: \text{text} \rightarrow \mathbb R^d\) and \(g\) is the classifier (\(g: \mathbb R^d \rightarrow \mathbb N\)) or regressor (\(g: \mathbb R^d \rightarrow \mathbb R\)). Stack generalization is a technique to combine classifiers (regressors) to produce another classifier (regressor) responsible for making the prediction.

EvoMSA 2.0 removes, from EvoMSA, two text representations, i.e., functions \(m\), particularly the sentiment lexicon-based model, and the aggressiveness model. It was decided to remove them because these models are the ones that require more work to be implemented in another language and, on the other hand, are the ones that contribute less to the performance of the algorithm. However, EvoMSA 2.0 increments the number of human-annotated models, the emoji models, and introduces a new model, namely keyword models.

EvoMSA 2.0 supports more languages than the previous version, currently it supports Arabic (ar), Catalan (ca), German (de), English (en), Spanish (es), French (fr), Hindi (hi), Indonesian (in), Italian (it), Japanese (ja), Korean (ko), Dutch (nl), Polish (pl), Portuguese (pt), Russian (ru), Tagalog (tl), Turkish (tr), and Chinese (zh). It also provides pre-trained models that include the bag-of-words text representations, emoji, and keyword models. These models were trained on Twitter data.

The other enhancement is on the implementation. There are three main classes:

BoW and DenseBoW are text classifiers; BoW is the parent of DenseBoW. The stack generalization technique is implemented in StackGeneralization.


If you find EvoMSA useful for any academic/scientific purpose, we would appreciate citations to the following reference:

author = {Mario Graff and Sabino Miranda{-}Jim{\'{e}}nez
               and Eric Sadit Tellez and Daniela Moctezuma},
title     = {EvoMSA: {A} Multilingual Evolutionary Approach for Sentiment Analysis},
journal   = {Computational Intelligence Magazine},
volume    = {15},
issue     = {1},
year      = {2020},
pages     = {76 -- 88},
url       = {https://ieeexplore.ieee.org/document/8956106},
month     = {Feb.}

Installing EvoMSA

EvoMSA can be easly install using anaconda

conda install -c conda-forge EvoMSA

or can be install using pip, it depends on numpy, scipy, scikit-learn and b4msa.

pip install cython
pip install sparsearray
pip install evodag
pip install EvoMSA

Text Classifier Competitions

EvoMSA 2.0 has been tested in many text classifier competitions without modifications. The aim is to offer a better understanding of how these algorithms perform in a new situation and what would be the difference in performance with an algorithm tailored to the new problem. In the following link, we will describe the specifics of each configuration.


EvoMSA first version

The documentation of EvoMSA first version can be found in the following sections.