`BoW` API¶

https://coveralls.io/repos/github/INGEOTEC/EvoMSA/badge.svg?branch=develop

https://dev.azure.com/conda-forge/feedstock-builds/_apis/build/status/evomsa-feedstock?branchName=main

https://img.shields.io/conda/vn/conda-forge/evomsa.svg

https://img.shields.io/conda/pn/conda-forge/evomsa.svg

https://readthedocs.org/projects/evomsa/badge/?version=docs

BoW is a bag-of-words text classifier. It is described in “A Simple Approach to Multilingual Polarity Classification in Twitter. Eric S. Tellez, Sabino Miranda-Jiménez, Mario Graff, Daniela Moctezuma, Ranyart R. Suárez, Oscar S. Siordia. Pattern Recognition Letters” and “An Automated Text Categorization Framework based on Hyperparameter Optimization. Eric S. Tellez, Daniela Moctezuma, Sabino Miranda-Jímenez, Mario Graff. Knowledge-Based Systems Volume 149, 1 June 2018.”

BoW uses, by default, a pre-trained bag-of-words representation. The representation was trained on 4,194,304 (\(2^{22}\)) tweets randomly selected. The pre-trained representations are used when the parameters lang and pretrain are set; pretrain by default is set to True, and the default language is Spanish (es). The available languages are: Arabic (ar), Catalan (ca), German (de), English (en), Spanish (es), French (fr), Hindi (hi), Indonesian (in), Italian (it), Japanese (ja), Korean (ko), Dutch (nl), Polish (pl), Portuguese (pt), Russian (ru), Tagalog (tl), Turkish (tr), and Chinese (zh).

Parameters:

lang (str) – Language. (ar | ca | de | en | es | fr | hi | in | it | ja | ko | nl | pl | pt | ru | tl | tr | zh), default=’es’.
voc_size_exponent (int) – Vocabulary size. default=17, i.e., \(2^{17}\)
voc_selection (str) – Vocabulary (most_common_by_type | most_common). default=most_common_by_type
key (Union[str, List[str]]) – Key where the text is in the dictionary. (default=’text’)
label_key (str) – Key where the response is in the dictionary. (default=’klass’)
mixer_func (Callable[[List], csr_matrix]) – Function to combine the output in case of multiple texts
decision_function_name (str) – Name of the decision function (detaulf=’decision_function’)
estimator_class (class) – Classifier or Regressor
estimator_kwargs (dict) – Keyword parameters for the estimator
pretrain (bool) – Whether to use a pre-trained representation. default=True.
b4msa_kwargs (dict) – b4msa.textmodel.TextModel keyword arguments used to train a bag-of-words representation. default=dict().
kfold_class (class) – Class of the KFold procedure (default=StratifiedKFold)
kfold_kwargs (dict) – Keyword parameters for the KFold class
v1 (bool) – Whether to use version 1 or pretrained representations. default=False
n_jobs (int) – Number of jobs. default=1

>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> from EvoMSA import BoW
>>> bow = BoW(lang='es').fit(list(tweet_iterator(TWEETS)))
>>> bow.predict(['Buenos dias']).tolist()
['P']

__init__(lang: str = 'es', voc_size_exponent: int = 17, voc_selection: str = 'most_common_by_type', key: str | ~typing.List[str] = 'text', label_key: str = 'klass', mixer_func: ~typing.Callable[[~typing.List], ~scipy.sparse._csr.csr_matrix] = <built-in function sum>, decision_function_name: str = 'decision_function', estimator_class=<class 'sklearn.svm._classes.LinearSVC'>, estimator_kwargs={'dual': True}, pretrain=True, b4msa_kwargs={}, kfold_class=<class 'sklearn.model_selection._split.StratifiedKFold'>, kfold_kwargs: dict = {'random_state': 0, 'shuffle': True}, v1: bool = False, n_jobs: int = 1) → None[source]¶

fit(D: List[dict | list], y: ndarray | None = None) → BoW[source]¶

Estimate the parameters of the BoW (BoW.b4msa_fit) and the classifier or regressor (BoW.estimator_class - the fitted instance is accesible at BoW.estimator_instance) using the dataset (D, y).

Parameters:

D (List of texts or dictionaries.) – Texts; in the case, it is a list of dictionaries the text is on the key BoW.key
y (Array or None) – Response variable. The response variable can also be in D on the key BoW.label_key.

>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> from EvoMSA import BoW
>>> import numpy as np
>>> D = list(tweet_iterator(TWEETS))
>>> bow = BoW(lang='es').fit(D)                

transform(D: List[List | dict], y=None) → csr_matrix[source]¶

Represent the texts in D in the vector space.

Parameters:: D (List of texts or dictionaries.) – Texts to be transformed. In the case, it is a list of dictionaries the text is on the key BoW.key

>>> from EvoMSA import BoW
>>> X = BoW(lang='en').transform(['Hi', 'Good morning'])
>>> X = BoW(lang='en').transform([dict(text='Hi'), dict(text='Good morning')])
>>> X.shape
(2, 131072)

predict(D: List[dict | list]) → ndarray[source]¶

Predict the response variable on the dataset D.

Parameters:: D (List of texts or dictionaries.) – Texts; in the case, it is a list of dictionaries the text is on the key BoW.key

>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> from EvoMSA import BoW
>>> bow = BoW(lang='es').fit(list(tweet_iterator(TWEETS)))
>>> bow.predict(['Buenos dias']).tolist()
['P']                

decision_function(D: List[dict | list]) → list | ndarray[source]¶

Decision function of the estimate response variable in D.

Parameters:: D (List of texts or dictionaries.) – Texts to be transformed. In the case, it is a list of dictionaries the text is on the key BoW.key

>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> from EvoMSA import BoW
>>> bow = BoW(lang='es').fit(list(tweet_iterator(TWEETS)))
>>> bow.decision_function(['Buenos dias'])
array([[-1.40547754, -1.01340503, -0.57912244,  0.90450322]])      

property bow¶

Bag of Word text representation.

The following example tokenizes hi. The notation is the following, the first ‘hi’ corresponds to the word hi. Then, there come the q-grams of characters, the token ‘q:hi’ represents the q-gram hi. All the q-grams start with the prefix ‘q:’. Finally, the character ~ represents a space.

>>> bow = BoW(lang='en')
>>> bow.bow.tokenize(['hi'])
['hi', 'q:~h', 'q:hi', 'q:i~', 'q:~hi', 'q:hi~', 'q:~hi~']        

property names¶: Vector space components

property weights¶: Vector space weights

property estimator_instance¶: Estimator - Classifier or Regressor fitted (BoW.fit) on the dataset

property pretrain¶

Whether the to use pre-trained text representations

The parameters of the BoW text representation are BoW.lang, BoW.voc_selection, and BoW.voc_size_exponent. The aforementioned parameters are not available on Version 1.0 (BoW.v1).

property lang¶: Language of the pre-trained text representations

property voc_selection¶: Method used to select the vocabulary

property voc_size_exponent¶: Vocabulary size \(2^v\); where \(v\) is voc_size_exponent

property v1¶: Whether to use the Version 1.0 text representations. This version is only available for Arabic (ar), English (en), and Spanish (es).

b4msa_fit(D: List[List | dict])[source]¶

Estimate the parameters of the BoW (BoW.bow) in case it is not pretrained (BoW.pretrain)

Parameters:: D (List of texts or dictionaries.) – Dataset

>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> from EvoMSA import BoW
>>> bow = BoW(pretrain=False)
>>> bow.b4msa_fit(list(tweet_iterator(TWEETS)))
>>> X = bow.transform(['Hola'])
>>> X.shape
(1, 84802)

train_predict_decision_function(D: List[dict | list], y: ndarray | None = None, X=None) → ndarray[source]¶

Method to compute the kfold predictions on dataset D with response y

Parameters:

D (List of texts or dictionaries.) – Texts to be transformed. In the case, it is a list of dictionaries the text is on the key BoW.key
y (Array or None) – Response variable
X – Transform dataset

For example, the following code computes the accuracy using k-fold cross-validation on the dataset found on TWEETS

>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> from EvoMSA import BoW
>>> import numpy as np
>>> D = list(tweet_iterator(TWEETS))
>>> bow = BoW(lang='es')
>>> df = bow.train_predict_decision_function(D)
>>> df.shape
(1000, 4)
>>> hy = df.argmax(axis=1)
>>> y = np.array([x['klass'] for x in D])
>>> labels = np.unique(y)
>>> accuracy = (y == labels[hy]).mean()

dependent_variable(D: List[dict | list], y: ndarray | None = None) → ndarray[source]¶

Obtain the response variable

Parameters:

D (List of texts or dictionaries) – Dataset
y (Array or None) – Response variable

property cache¶: If the cache is set, it is returned when calling BoW.transform; afterward, it is unset.

property label_key¶: Key where the response is in the dictionary.

property key¶: Key where the text(s) is(are) in the dictionary.

property decision_function_name¶: Name of the estimator’s decision function

property kfold_class¶: Class to produce the kfolds

property kfold_kwargs¶: Keyword arguments of the kfold class

property estimator_class¶: Class of the classifier or regressor

property estimator_kwargs¶: Keyword arguments of the estimator BoW.estimator_class

property b4msa_kwargs¶: Keyword arguments of B4MSA

property mixer_func¶: The function is used to fix the output of the text’s representations.

property n_jobs¶: Number of jobs used in multiprocessing.

__new__(**kwargs)¶

class BoWBP[source]¶

BoWBP is a BoW with the difference that the parameters are fine-tuned using jax

>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> from EvoMSA.back_prop import BoWBP
>>> D = list(tweet_iterator(TWEETS))
>>> bow = BoWBP(lang='es').fit(D)
>>> bow.predict(['Buenos dias']).tolist()
['NONE']

__init__(voc_size_exponent: int = 15, estimator_kwargs={'class_weight': 'balanced', 'dual': True}, deviation=None, fraction_initial_parameters=1, optimizer_kwargs: dict = None, **kwargs)[source]¶

property evolution¶: Evolution of the objective-function value

property deviation¶: Function to measure the deviation between the true observations and the predictions.

property parameters¶: Parameter to optimize

__new__(**kwargs)¶

`BoW` API¶

Table of Contents

Previous topic

Next topic

This Page

BoW API¶

`BoW` API¶