BoW API

https://github.com/INGEOTEC/EvoMSA/actions/workflows/test.yaml/badge.svg https://coveralls.io/repos/github/INGEOTEC/EvoMSA/badge.svg?branch=develop https://badge.fury.io/py/EvoMSA.svg https://dev.azure.com/conda-forge/feedstock-builds/_apis/build/status/evomsa-feedstock?branchName=main https://img.shields.io/conda/vn/conda-forge/evomsa.svg https://img.shields.io/conda/pn/conda-forge/evomsa.svg https://readthedocs.org/projects/evomsa/badge/?version=docs
class BoW[source]

BoW is a bag-of-words text classifier. It is described in “A Simple Approach to Multilingual Polarity Classification in Twitter. Eric S. Tellez, Sabino Miranda-Jiménez, Mario Graff, Daniela Moctezuma, Ranyart R. Suárez, Oscar S. Siordia. Pattern Recognition Letters” and “An Automated Text Categorization Framework based on Hyperparameter Optimization. Eric S. Tellez, Daniela Moctezuma, Sabino Miranda-Jímenez, Mario Graff. Knowledge-Based Systems Volume 149, 1 June 2018.”

BoW uses, by default, a pre-trained bag-of-words representation. The representation was trained on 4,194,304 (\(2^{22}\)) tweets randomly selected. The pre-trained representations are used when the parameters lang and pretrain are set; pretrain by default is set to True, and the default language is Spanish (es). The available languages are: Arabic (ar), Catalan (ca), German (de), English (en), Spanish (es), French (fr), Hindi (hi), Indonesian (in), Italian (it), Japanese (ja), Korean (ko), Dutch (nl), Polish (pl), Portuguese (pt), Russian (ru), Tagalog (tl), Turkish (tr), and Chinese (zh).

Parameters:
  • lang (str) – Language. (ar | ca | de | en | es | fr | hi | in | it | ja | ko | nl | pl | pt | ru | tl | tr | zh), default=’es’.

  • voc_size_exponent (int) – Vocabulary size. default=17, i.e., \(2^{17}\)

  • voc_selection (str) – Vocabulary (most_common_by_type | most_common). default=most_common_by_type

  • key (Union[str, List[str]]) – Key where the text is in the dictionary. (default=’text’)

  • label_key (str) – Key where the response is in the dictionary. (default=’klass’)

  • mixer_func (Callable[[List], csr_matrix]) – Function to combine the output in case of multiple texts

  • decision_function_name (str) – Name of the decision function (detaulf=’decision_function’)

  • estimator_class (class) – Classifier or Regressor

  • estimator_kwargs (dict) – Keyword parameters for the estimator

  • pretrain (bool) – Whether to use a pre-trained representation. default=True.

  • b4msa_kwargs (dict) – b4msa.textmodel.TextModel keyword arguments used to train a bag-of-words representation. default=dict().

  • kfold_class (class) – Class of the KFold procedure (default=StratifiedKFold)

  • kfold_kwargs (dict) – Keyword parameters for the KFold class

  • v1 (bool) – Whether to use version 1 or pretrained representations. default=False

  • n_jobs (int) – Number of jobs. default=1

>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> from EvoMSA import BoW
>>> bow = BoW(lang='es').fit(list(tweet_iterator(TWEETS)))
>>> bow.predict(['Buenos dias']).tolist()
['P']
__init__(lang: str = 'es', voc_size_exponent: int = 17, voc_selection: str = 'most_common_by_type', key: str | ~typing.List[str] = 'text', label_key: str = 'klass', mixer_func: ~typing.Callable[[~typing.List], ~scipy.sparse._csr.csr_matrix] = <built-in function sum>, decision_function_name: str = 'decision_function', estimator_class=<class 'sklearn.svm._classes.LinearSVC'>, estimator_kwargs={'dual': True}, pretrain=True, b4msa_kwargs={}, kfold_class=<class 'sklearn.model_selection._split.StratifiedKFold'>, kfold_kwargs: dict = {'random_state': 0, 'shuffle': True}, v1: bool = False, n_jobs: int = 1) None[source]
fit(D: List[dict | list], y: ndarray | None = None) BoW[source]

Estimate the parameters of the BoW (BoW.b4msa_fit) and the classifier or regressor (BoW.estimator_class - the fitted instance is accesible at BoW.estimator_instance) using the dataset (D, y).

Parameters:
  • D (List of texts or dictionaries.) – Texts; in the case, it is a list of dictionaries the text is on the key BoW.key

  • y (Array or None) – Response variable. The response variable can also be in D on the key BoW.label_key.

>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> from EvoMSA import BoW
>>> import numpy as np
>>> D = list(tweet_iterator(TWEETS))
>>> bow = BoW(lang='es').fit(D)                
transform(D: List[List | dict], y=None) csr_matrix[source]

Represent the texts in D in the vector space.

Parameters:

D (List of texts or dictionaries.) – Texts to be transformed. In the case, it is a list of dictionaries the text is on the key BoW.key

>>> from EvoMSA import BoW
>>> X = BoW(lang='en').transform(['Hi', 'Good morning'])
>>> X = BoW(lang='en').transform([dict(text='Hi'), dict(text='Good morning')])
>>> X.shape
(2, 131072)
predict(D: List[dict | list]) ndarray[source]

Predict the response variable on the dataset D.

Parameters:

D (List of texts or dictionaries.) – Texts; in the case, it is a list of dictionaries the text is on the key BoW.key

>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> from EvoMSA import BoW
>>> bow = BoW(lang='es').fit(list(tweet_iterator(TWEETS)))
>>> bow.predict(['Buenos dias']).tolist()
['P']                
decision_function(D: List[dict | list]) list | ndarray[source]

Decision function of the estimate response variable in D.

Parameters:

D (List of texts or dictionaries.) – Texts to be transformed. In the case, it is a list of dictionaries the text is on the key BoW.key

>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> from EvoMSA import BoW
>>> bow = BoW(lang='es').fit(list(tweet_iterator(TWEETS)))
>>> bow.decision_function(['Buenos dias'])
array([[-1.40547754, -1.01340503, -0.57912244,  0.90450322]])      
property bow

Bag of Word text representation.

The following example tokenizes hi. The notation is the following, the first ‘hi’ corresponds to the word hi. Then, there come the q-grams of characters, the token ‘q:hi’ represents the q-gram hi. All the q-grams start with the prefix ‘q:’. Finally, the character ~ represents a space.

>>> bow = BoW(lang='en')
>>> bow.bow.tokenize(['hi'])
['hi', 'q:~h', 'q:hi', 'q:i~', 'q:~hi', 'q:hi~', 'q:~hi~']        
property names

Vector space components

property weights

Vector space weights

property estimator_instance

Estimator - Classifier or Regressor fitted (BoW.fit) on the dataset

property pretrain

Whether the to use pre-trained text representations

The parameters of the BoW text representation are BoW.lang, BoW.voc_selection, and BoW.voc_size_exponent. The aforementioned parameters are not available on Version 1.0 (BoW.v1).

property lang

Language of the pre-trained text representations

property voc_selection

Method used to select the vocabulary

property voc_size_exponent

Vocabulary size \(2^v\); where \(v\) is voc_size_exponent

property v1

Whether to use the Version 1.0 text representations. This version is only available for Arabic (ar), English (en), and Spanish (es).

b4msa_fit(D: List[List | dict])[source]

Estimate the parameters of the BoW (BoW.bow) in case it is not pretrained (BoW.pretrain)

Parameters:

D (List of texts or dictionaries.) – Dataset

>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> from EvoMSA import BoW
>>> bow = BoW(pretrain=False)
>>> bow.b4msa_fit(list(tweet_iterator(TWEETS)))
>>> X = bow.transform(['Hola'])
>>> X.shape
(1, 84802)
train_predict_decision_function(D: List[dict | list], y: ndarray | None = None) ndarray[source]

Method to compute the kfold predictions on dataset D with response y

Parameters:
  • D (List of texts or dictionaries.) – Texts to be transformed. In the case, it is a list of dictionaries the text is on the key BoW.key

  • y (Array or None) – Response variable

For example, the following code computes the accuracy using k-fold cross-validation on the dataset found on TWEETS

>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> from EvoMSA import BoW
>>> import numpy as np
>>> D = list(tweet_iterator(TWEETS))
>>> bow = BoW(lang='es')
>>> df = bow.train_predict_decision_function(D)
>>> df.shape
(1000, 4)
>>> hy = df.argmax(axis=1)
>>> y = np.array([x['klass'] for x in D])
>>> labels = np.unique(y)
>>> accuracy = (y == labels[hy]).mean()
dependent_variable(D: List[dict | list], y: ndarray | None = None) ndarray[source]

Obtain the response variable

Parameters:
  • D (List of texts or dictionaries) – Dataset

  • y (Array or None) – Response variable

property cache

If the cache is set, it is returned when calling BoW.transform; afterward, it is unset.

property label_key

Key where the response is in the dictionary.

property key

Key where the text(s) is(are) in the dictionary.

property decision_function_name

Name of the estimator’s decision function

property kfold_class

Class to produce the kfolds

property kfold_kwargs

Keyword arguments of the kfold class

property estimator_class

Class of the classifier or regressor

property estimator_kwargs

Keyword arguments of the estimator BoW.estimator_class

property b4msa_kwargs

Keyword arguments of B4MSA

property mixer_func

The function is used to fix the output of the text’s representations.

property n_jobs

Number of jobs used in multiprocessing.

__new__(**kwargs)
class BoWBP[source]

BoWBP is a BoW with the difference that the parameters are fine-tuned using jax

>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> from EvoMSA.back_prop import BoWBP
>>> D = list(tweet_iterator(TWEETS))
>>> bow = BoWBP(lang='es').fit(D)
>>> bow.predict(['Buenos dias']).tolist()
['NONE']
__init__(voc_size_exponent: int = 15, estimator_kwargs={'class_weight': 'balanced', 'dual': True}, deviation=None, fraction_initial_parameters=1, optimizer_kwargs: dict = None, **kwargs)[source]
property evolution

Evolution of the objective-function value

property deviation

Function to measure the deviation between the true observations and the predictions.

property parameters

Parameter to optimize

__new__(**kwargs)