BoW
API¶
- class BoW[source]¶
BoW is a bag-of-words text classifier. It is described in “A Simple Approach to Multilingual Polarity Classification in Twitter. Eric S. Tellez, Sabino Miranda-Jiménez, Mario Graff, Daniela Moctezuma, Ranyart R. Suárez, Oscar S. Siordia. Pattern Recognition Letters” and “An Automated Text Categorization Framework based on Hyperparameter Optimization. Eric S. Tellez, Daniela Moctezuma, Sabino Miranda-Jímenez, Mario Graff. Knowledge-Based Systems Volume 149, 1 June 2018.”
BoW uses, by default, a pre-trained bag-of-words representation. The representation was trained on 4,194,304 (\(2^{22}\)) tweets randomly selected. The pre-trained representations are used when the parameters
lang
andpretrain
are set;pretrain
by default is set to True, and the default language is Spanish (es). The available languages are: Arabic (ar), Catalan (ca), German (de), English (en), Spanish (es), French (fr), Hindi (hi), Indonesian (in), Italian (it), Japanese (ja), Korean (ko), Dutch (nl), Polish (pl), Portuguese (pt), Russian (ru), Tagalog (tl), Turkish (tr), and Chinese (zh).- Parameters:
lang (str) – Language. (ar | ca | de | en | es | fr | hi | in | it | ja | ko | nl | pl | pt | ru | tl | tr | zh), default=’es’.
voc_size_exponent (int) – Vocabulary size. default=17, i.e., \(2^{17}\)
voc_selection (str) – Vocabulary (most_common_by_type | most_common). default=most_common_by_type
key (Union[str, List[str]]) – Key where the text is in the dictionary. (default=’text’)
label_key (str) – Key where the response is in the dictionary. (default=’klass’)
mixer_func (Callable[[List], csr_matrix]) – Function to combine the output in case of multiple texts
decision_function_name (str) – Name of the decision function (detaulf=’decision_function’)
estimator_class (class) – Classifier or Regressor
estimator_kwargs (dict) – Keyword parameters for the estimator
pretrain (bool) – Whether to use a pre-trained representation. default=True.
b4msa_kwargs (dict) –
b4msa.textmodel.TextModel
keyword arguments used to train a bag-of-words representation. default=dict().kfold_class (class) – Class of the KFold procedure (default=StratifiedKFold)
kfold_kwargs (dict) – Keyword parameters for the KFold class
v1 (bool) – Whether to use version 1 or pretrained representations. default=False
n_jobs (int) – Number of jobs. default=1
>>> from microtc.utils import tweet_iterator >>> from EvoMSA.tests.test_base import TWEETS >>> from EvoMSA import BoW >>> bow = BoW(lang='es').fit(list(tweet_iterator(TWEETS))) >>> bow.predict(['Buenos dias']).tolist() ['P']
- __init__(lang: str = 'es', voc_size_exponent: int = 17, voc_selection: str = 'most_common_by_type', key: str | ~typing.List[str] = 'text', label_key: str = 'klass', mixer_func: ~typing.Callable[[~typing.List], ~scipy.sparse._csr.csr_matrix] = <built-in function sum>, decision_function_name: str = 'decision_function', estimator_class=<class 'sklearn.svm._classes.LinearSVC'>, estimator_kwargs={'dual': True}, pretrain=True, b4msa_kwargs={}, kfold_class=<class 'sklearn.model_selection._split.StratifiedKFold'>, kfold_kwargs: dict = {'random_state': 0, 'shuffle': True}, v1: bool = False, n_jobs: int = 1) None [source]¶
- fit(D: List[dict | list], y: ndarray | None = None) BoW [source]¶
Estimate the parameters of the BoW (
BoW.b4msa_fit
) and the classifier or regressor (BoW.estimator_class
- the fitted instance is accesible atBoW.estimator_instance
) using the dataset (D, y).- Parameters:
D (List of texts or dictionaries.) – Texts; in the case, it is a list of dictionaries the text is on the key
BoW.key
y (Array or None) – Response variable. The response variable can also be in D on the key
BoW.label_key
.
>>> from microtc.utils import tweet_iterator >>> from EvoMSA.tests.test_base import TWEETS >>> from EvoMSA import BoW >>> import numpy as np >>> D = list(tweet_iterator(TWEETS)) >>> bow = BoW(lang='es').fit(D)
- transform(D: List[List | dict], y=None) csr_matrix [source]¶
Represent the texts in D in the vector space.
- Parameters:
D (List of texts or dictionaries.) – Texts to be transformed. In the case, it is a list of dictionaries the text is on the key
BoW.key
>>> from EvoMSA import BoW >>> X = BoW(lang='en').transform(['Hi', 'Good morning']) >>> X = BoW(lang='en').transform([dict(text='Hi'), dict(text='Good morning')]) >>> X.shape (2, 131072)
- predict(D: List[dict | list]) ndarray [source]¶
Predict the response variable on the dataset D.
- Parameters:
D (List of texts or dictionaries.) – Texts; in the case, it is a list of dictionaries the text is on the key
BoW.key
>>> from microtc.utils import tweet_iterator >>> from EvoMSA.tests.test_base import TWEETS >>> from EvoMSA import BoW >>> bow = BoW(lang='es').fit(list(tweet_iterator(TWEETS))) >>> bow.predict(['Buenos dias']).tolist() ['P']
- decision_function(D: List[dict | list]) list | ndarray [source]¶
Decision function of the estimate response variable in D.
- Parameters:
D (List of texts or dictionaries.) – Texts to be transformed. In the case, it is a list of dictionaries the text is on the key
BoW.key
>>> from microtc.utils import tweet_iterator >>> from EvoMSA.tests.test_base import TWEETS >>> from EvoMSA import BoW >>> bow = BoW(lang='es').fit(list(tweet_iterator(TWEETS))) >>> bow.decision_function(['Buenos dias']) array([[-1.40547754, -1.01340503, -0.57912244, 0.90450322]])
- property bow¶
Bag of Word text representation.
The following example tokenizes hi. The notation is the following, the first ‘hi’ corresponds to the word hi. Then, there come the q-grams of characters, the token ‘q:hi’ represents the q-gram hi. All the q-grams start with the prefix ‘q:’. Finally, the character ~ represents a space.
>>> bow = BoW(lang='en') >>> bow.bow.tokenize(['hi']) ['hi', 'q:~h', 'q:hi', 'q:i~', 'q:~hi', 'q:hi~', 'q:~hi~']
- property names¶
Vector space components
- property weights¶
Vector space weights
- property pretrain¶
Whether the to use pre-trained text representations
The parameters of the BoW text representation are
BoW.lang
,BoW.voc_selection
, andBoW.voc_size_exponent
. The aforementioned parameters are not available on Version 1.0 (BoW.v1
).
- property lang¶
Language of the pre-trained text representations
- property voc_selection¶
Method used to select the vocabulary
- property voc_size_exponent¶
Vocabulary size \(2^v\); where \(v\) is
voc_size_exponent
- property v1¶
Whether to use the Version 1.0 text representations. This version is only available for Arabic (ar), English (en), and Spanish (es).
- b4msa_fit(D: List[List | dict])[source]¶
Estimate the parameters of the BoW (
BoW.bow
) in case it is not pretrained (BoW.pretrain
)- Parameters:
D (List of texts or dictionaries.) – Dataset
>>> from microtc.utils import tweet_iterator >>> from EvoMSA.tests.test_base import TWEETS >>> from EvoMSA import BoW >>> bow = BoW(pretrain=False) >>> bow.b4msa_fit(list(tweet_iterator(TWEETS))) >>> X = bow.transform(['Hola']) >>> X.shape (1, 84802)
- train_predict_decision_function(D: List[dict | list], y: ndarray | None = None) ndarray [source]¶
Method to compute the kfold predictions on dataset D with response y
- Parameters:
D (List of texts or dictionaries.) – Texts to be transformed. In the case, it is a list of dictionaries the text is on the key
BoW.key
y (Array or None) – Response variable
For example, the following code computes the accuracy using k-fold cross-validation on the dataset found on TWEETS
>>> from microtc.utils import tweet_iterator >>> from EvoMSA.tests.test_base import TWEETS >>> from EvoMSA import BoW >>> import numpy as np >>> D = list(tweet_iterator(TWEETS)) >>> bow = BoW(lang='es') >>> df = bow.train_predict_decision_function(D) >>> df.shape (1000, 4) >>> hy = df.argmax(axis=1) >>> y = np.array([x['klass'] for x in D]) >>> labels = np.unique(y) >>> accuracy = (y == labels[hy]).mean()
- dependent_variable(D: List[dict | list], y: ndarray | None = None) ndarray [source]¶
Obtain the response variable
- Parameters:
D (List of texts or dictionaries) – Dataset
y (Array or None) – Response variable
- property cache¶
If the cache is set, it is returned when calling
BoW.transform
; afterward, it is unset.
- property label_key¶
Key where the response is in the dictionary.
- property key¶
Key where the text(s) is(are) in the dictionary.
- property decision_function_name¶
Name of the estimator’s decision function
- property kfold_class¶
Class to produce the kfolds
- property kfold_kwargs¶
Keyword arguments of the kfold class
- property estimator_class¶
Class of the classifier or regressor
- property estimator_kwargs¶
Keyword arguments of the estimator
BoW.estimator_class
- property b4msa_kwargs¶
Keyword arguments of B4MSA
- property mixer_func¶
The function is used to fix the output of the text’s representations.
- property n_jobs¶
Number of jobs used in multiprocessing.
- __new__(**kwargs)¶
- class BoWBP[source]¶
BoWBP is a
BoW
with the difference that the parameters are fine-tuned using jax>>> from microtc.utils import tweet_iterator >>> from EvoMSA.tests.test_base import TWEETS >>> from EvoMSA.back_prop import BoWBP >>> D = list(tweet_iterator(TWEETS)) >>> bow = BoWBP(lang='es').fit(D) >>> bow.predict(['Buenos dias']).tolist() ['NONE']
- __init__(voc_size_exponent: int = 15, estimator_kwargs={'class_weight': 'balanced', 'dual': True}, deviation=None, fraction_initial_parameters=1, optimizer_kwargs: dict = None, **kwargs)[source]¶
- property evolution¶
Evolution of the objective-function value
- property deviation¶
Function to measure the deviation between the true observations and the predictions.
- property parameters¶
Parameter to optimize
- __new__(**kwargs)¶