`DenseBoW` API¶

https://coveralls.io/repos/github/INGEOTEC/EvoMSA/badge.svg?branch=develop

https://dev.azure.com/conda-forge/feedstock-builds/_apis/build/status/evomsa-feedstock?branchName=main

https://img.shields.io/conda/vn/conda-forge/evomsa.svg

https://img.shields.io/conda/pn/conda-forge/evomsa.svg

https://readthedocs.org/projects/evomsa/badge/?version=docs

class DenseBoW[source]¶

DenseBoW is a text classifier in fact it is a subclass of BoW being the difference the process to represent the text in a vector space. This process is described in “EvoMSA: A Multilingual Evolutionary Approach for Sentiment Analysis, Mario Graff, Sabino Miranda-Jimenez, Eric Sadit Tellez, Daniela Moctezuma. Computational Intelligence Magazine, vol 15 no. 1, pp. 76-88, Feb. 2020.” Particularly, in the section where the Emoji Space is described.

Parameters:

emoji (bool) – Whether to use emoji text representations. default=True.
dataset (bool) – Whether to use labeled dataset text representations (only available in ‘ar’, ‘en’, ‘es’, and ‘zh’). default=True
keyword (bool) – Whether to use keyword text representations. default=True.
skip_dataset (set) – Set of discarded dataset.
unit_vector (bool) – Normalize vectors to have length 1. default=True
distance_hyperplane (bool) – Compute de distance to hyperplance in transform

>>> from EvoMSA import DenseBoW
>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> D = list(tweet_iterator(TWEETS))
>>> dense =  DenseBoW(lang='es')
>>> dense.fit(D)
>>> dense.predict(['Buenos dias']).tolist()
['P']    

__init__(emoji: bool = True, dataset: bool = True, keyword: bool = True, skip_dataset: Set[str] = {}, estimator_kwargs={'dual': False}, unit_vector=True, distance_hyperplane=False, **kwargs) → None[source]¶

fit(*args, **kwargs) → DenseBoW[source]¶

Estimate the parameters of the classifier or regressor (DenseBoW.estimator_class - the fitted instance is accesible at DenseBoW.estimator_instance) using the dataset (D, y).

>>> from EvoMSA import DenseBoW
>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> D = list(tweet_iterator(TWEETS))
>>> dense =  DenseBoW(lang='es').fit(D)        

__new__(**kwargs)¶

transform(D: List[List | dict], y=None) → ndarray[source]¶

Represent the texts in D in the vector space.

Parameters:: D (List of texts or dictionaries.) – Texts to be transformed. In the case, it is a list of dictionaries the text is on the key BoW.key

>>> from EvoMSA import DenseBoW
>>> X = DenseBoW(lang='en').transform([dict(text='Hi'),
                                       dict(text='Good morning')])
>>> X.shape
(2, 1287)

predict(*args, **kwargs) → ndarray[source]¶

Predict the response variable on the dataset D.

>>> from EvoMSA import DenseBoW
>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> D = list(tweet_iterator(TWEETS))
>>> dense = DenseBoW(lang='es').fit(D)
>>> dense.predict(['Buenos dias']).tolist()
['P']                

property text_representations¶: Classifiers that define the text representation.

select(subset: list | None = None, D: ~typing.List[dict | list | None] = None, y: ~numpy.ndarray | None = None, feature_selection: ~typing.Callable = <class 'EvoMSA.model_selection.KruskalFS'>, feature_selection_kwargs: dict = {}) → DenseBoW[source]¶

Procedure to perform feature selection or indices of the features to be selected.

Parameters:

subset (List of indices.) – Representations to be selected.
D (List of texts or dictionaries.) – Texts; in the case, it is a list of dictionaries the text is on the key BoW.key. default=None
y (Array or None) – Response variable. The response variable can also be in D on the key BoW.label_key. default=None

>>> from EvoMSA import DenseBoW
>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> T = list(tweet_iterator(TWEETS))
>>> text_repr =  DenseBoW(lang='es').select(D=T)
>>> text_repr.weights.shape
(2672, 131072)        

property feature_selection¶: Feature selection used in select

text_representations_extend(value: List | str)[source]¶

Add dense BoW representations.

Parameters:: value (List of models or string) – List of models or name

property names¶: Vector space components

property norm_weights¶: Euclidean norm of the weights

property weights¶

Weights of the vector space. It is matrix, i.e., \(\mathbf W \in \mathbb R^{M \times d}\), where \(M\) is the dimension of the vector space (see DenseBoW.names) and \(d\) is the vocabulary size.

>>> from EvoMSA import DenseBoW
>>> text_repr = DenseBoW(lang='es')
>>> text_repr.weights.shape
(2672, 131072)        

property bias¶: Bias.

property dataset¶: Dense Representation based on human-annotated datasets

property emoji¶: Dense Representation based on emojis

property keyword¶: Dense Representation based on keywords

property unit_vector¶: Normalize representation to have one length

fromjson(filename: str) → DenseBoW[source]¶

Load the text representations from a json file.

Parameters:: filename (str) – Path

property skip_dataset¶: Datasets discarded from the text representations

class DenseBoWBP[source]¶

DenseBoWBP is a DenseBoW with the difference that the parameters are fine-tuned using jax

>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> from EvoMSA.back_prop import DenseBoWBP
>>> D = list(tweet_iterator(TWEETS))
>>> dense = DenseBoWBP(lang='es').fit(D)
>>> dense.predict(['Buenos dias']).tolist()
['P']

__init__(emoji: bool = True, dataset: bool = False, keyword: bool = True, estimator_kwargs={'class_weight': 'balanced', 'dual': 'auto'}, **kwargs)[source]¶

__new__(**kwargs)¶

`DenseBoW` API¶

Table of Contents

Previous topic

Next topic

This Page

DenseBoW API¶

`DenseBoW` API¶