DenseBoW API

https://github.com/INGEOTEC/EvoMSA/actions/workflows/test.yaml/badge.svg https://coveralls.io/repos/github/INGEOTEC/EvoMSA/badge.svg?branch=develop https://badge.fury.io/py/EvoMSA.svg https://dev.azure.com/conda-forge/feedstock-builds/_apis/build/status/evomsa-feedstock?branchName=main https://img.shields.io/conda/vn/conda-forge/evomsa.svg https://img.shields.io/conda/pn/conda-forge/evomsa.svg https://readthedocs.org/projects/evomsa/badge/?version=docs
class DenseBoW[source]

DenseBoW is a text classifier in fact it is a subclass of BoW being the difference the process to represent the text in a vector space. This process is described in “EvoMSA: A Multilingual Evolutionary Approach for Sentiment Analysis, Mario Graff, Sabino Miranda-Jimenez, Eric Sadit Tellez, Daniela Moctezuma. Computational Intelligence Magazine, vol 15 no. 1, pp. 76-88, Feb. 2020.” Particularly, in the section where the Emoji Space is described.

Parameters:
  • emoji (bool) – Whether to use emoji text representations. default=True.

  • dataset (bool) – Whether to use labeled dataset text representations (only available in ‘ar’, ‘en’, ‘es’, and ‘zh’). default=True

  • keyword (bool) – Whether to use keyword text representations. default=True.

  • skip_dataset (set) – Set of discarded dataset.

  • unit_vector (bool) – Normalize vectors to have length 1. default=True

  • distance_hyperplane (bool) – Compute de distance to hyperplance in transform

>>> from EvoMSA import DenseBoW
>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> D = list(tweet_iterator(TWEETS))
>>> dense =  DenseBoW(lang='es')
>>> dense.fit(D)
>>> dense.predict(['Buenos dias']).tolist()
['P']    
__init__(emoji: bool = True, dataset: bool = True, keyword: bool = True, skip_dataset: Set[str] = {}, estimator_kwargs={'dual': False}, unit_vector=True, distance_hyperplane=False, **kwargs) None[source]
fit(*args, **kwargs) DenseBoW[source]

Estimate the parameters of the classifier or regressor (DenseBoW.estimator_class - the fitted instance is accesible at DenseBoW.estimator_instance) using the dataset (D, y).

>>> from EvoMSA import DenseBoW
>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> D = list(tweet_iterator(TWEETS))
>>> dense =  DenseBoW(lang='es').fit(D)        
__new__(**kwargs)
transform(D: List[List | dict], y=None) ndarray[source]

Represent the texts in D in the vector space.

Parameters:

D (List of texts or dictionaries.) – Texts to be transformed. In the case, it is a list of dictionaries the text is on the key BoW.key

>>> from EvoMSA import DenseBoW
>>> X = DenseBoW(lang='en').transform([dict(text='Hi'),
                                       dict(text='Good morning')])
>>> X.shape
(2, 1287)
predict(*args, **kwargs) ndarray[source]

Predict the response variable on the dataset D.

>>> from EvoMSA import DenseBoW
>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> D = list(tweet_iterator(TWEETS))
>>> dense = DenseBoW(lang='es').fit(D)
>>> dense.predict(['Buenos dias']).tolist()
['P']                
property text_representations

Classifiers that define the text representation.

select(subset: list | None = None, D: ~typing.List[dict | list | None] = None, y: ~numpy.ndarray | None = None, feature_selection: ~typing.Callable = <class 'EvoMSA.model_selection.KruskalFS'>, feature_selection_kwargs: dict = {}) DenseBoW[source]

Procedure to perform feature selection or indices of the features to be selected.

Parameters:
  • subset (List of indices.) – Representations to be selected.

  • D (List of texts or dictionaries.) – Texts; in the case, it is a list of dictionaries the text is on the key BoW.key. default=None

  • y (Array or None) – Response variable. The response variable can also be in D on the key BoW.label_key. default=None

>>> from EvoMSA import DenseBoW
>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> T = list(tweet_iterator(TWEETS))
>>> text_repr =  DenseBoW(lang='es').select(D=T)
>>> text_repr.weights.shape
(2672, 131072)        
property feature_selection

Feature selection used in select

text_representations_extend(value: List | str)[source]

Add dense BoW representations.

Parameters:

value (List of models or string) – List of models or name

property names

Vector space components

property norm_weights

Euclidean norm of the weights

property weights

Weights of the vector space. It is matrix, i.e., \(\mathbf W \in \mathbb R^{M \times d}\), where \(M\) is the dimension of the vector space (see DenseBoW.names) and \(d\) is the vocabulary size.

>>> from EvoMSA import DenseBoW
>>> text_repr = DenseBoW(lang='es')
>>> text_repr.weights.shape
(2672, 131072)        
property bias

Bias.

property dataset

Dense Representation based on human-annotated datasets

property emoji

Dense Representation based on emojis

property keyword

Dense Representation based on keywords

property unit_vector

Normalize representation to have one length

fromjson(filename: str) DenseBoW[source]

Load the text representations from a json file.

Parameters:

filename (str) – Path

property skip_dataset

Datasets discarded from the text representations

class DenseBoWBP[source]

DenseBoWBP is a DenseBoW with the difference that the parameters are fine-tuned using jax

>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> from EvoMSA.back_prop import DenseBoWBP
>>> D = list(tweet_iterator(TWEETS))
>>> dense = DenseBoWBP(lang='es').fit(D)
>>> dense.predict(['Buenos dias']).tolist()
['P']
__init__(emoji: bool = True, dataset: bool = False, keyword: bool = True, estimator_kwargs={'class_weight': 'balanced', 'dual': 'auto'}, **kwargs)[source]
__new__(**kwargs)