DenseBoW
API¶
- class DenseBoW[source]¶
DenseBoW is a text classifier in fact it is a subclass of
BoW
being the difference the process to represent the text in a vector space. This process is described in “EvoMSA: A Multilingual Evolutionary Approach for Sentiment Analysis, Mario Graff, Sabino Miranda-Jimenez, Eric Sadit Tellez, Daniela Moctezuma. Computational Intelligence Magazine, vol 15 no. 1, pp. 76-88, Feb. 2020.” Particularly, in the section where the Emoji Space is described.- Parameters:
emoji (bool) – Whether to use emoji text representations. default=True.
dataset (bool) – Whether to use labeled dataset text representations (only available in ‘ar’, ‘en’, ‘es’, and ‘zh’). default=True
keyword (bool) – Whether to use keyword text representations. default=True.
skip_dataset (set) – Set of discarded dataset.
unit_vector (bool) – Normalize vectors to have length 1. default=True
distance_hyperplane (bool) – Compute de distance to hyperplance in
transform
>>> from EvoMSA import DenseBoW >>> from microtc.utils import tweet_iterator >>> from EvoMSA.tests.test_base import TWEETS >>> D = list(tweet_iterator(TWEETS)) >>> dense = DenseBoW(lang='es') >>> dense.fit(D) >>> dense.predict(['Buenos dias']).tolist() ['P']
- __init__(emoji: bool = True, dataset: bool = True, keyword: bool = True, skip_dataset: Set[str] = {}, estimator_kwargs={'dual': False}, unit_vector=True, distance_hyperplane=False, **kwargs) None [source]¶
- fit(*args, **kwargs) DenseBoW [source]¶
Estimate the parameters of the classifier or regressor (
DenseBoW.estimator_class
- the fitted instance is accesible atDenseBoW.estimator_instance
) using the dataset (D, y).>>> from EvoMSA import DenseBoW >>> from microtc.utils import tweet_iterator >>> from EvoMSA.tests.test_base import TWEETS >>> D = list(tweet_iterator(TWEETS)) >>> dense = DenseBoW(lang='es').fit(D)
- __new__(**kwargs)¶
- transform(D: List[List | dict], y=None) ndarray [source]¶
Represent the texts in D in the vector space.
- Parameters:
D (List of texts or dictionaries.) – Texts to be transformed. In the case, it is a list of dictionaries the text is on the key
BoW.key
>>> from EvoMSA import DenseBoW >>> X = DenseBoW(lang='en').transform([dict(text='Hi'), dict(text='Good morning')]) >>> X.shape (2, 1287)
- predict(*args, **kwargs) ndarray [source]¶
Predict the response variable on the dataset D.
>>> from EvoMSA import DenseBoW >>> from microtc.utils import tweet_iterator >>> from EvoMSA.tests.test_base import TWEETS >>> D = list(tweet_iterator(TWEETS)) >>> dense = DenseBoW(lang='es').fit(D) >>> dense.predict(['Buenos dias']).tolist() ['P']
- property text_representations¶
Classifiers that define the text representation.
- select(subset: list | None = None, D: ~typing.List[dict | list | None] = None, y: ~numpy.ndarray | None = None, feature_selection: ~typing.Callable = <class 'EvoMSA.model_selection.KruskalFS'>, feature_selection_kwargs: dict = {}) DenseBoW [source]¶
Procedure to perform feature selection or indices of the features to be selected.
- Parameters:
subset (List of indices.) – Representations to be selected.
D (List of texts or dictionaries.) – Texts; in the case, it is a list of dictionaries the text is on the key
BoW.key
. default=Noney (Array or None) – Response variable. The response variable can also be in D on the key
BoW.label_key
. default=None
>>> from EvoMSA import DenseBoW >>> from microtc.utils import tweet_iterator >>> from EvoMSA.tests.test_base import TWEETS >>> T = list(tweet_iterator(TWEETS)) >>> text_repr = DenseBoW(lang='es').select(D=T) >>> text_repr.weights.shape (2672, 131072)
- text_representations_extend(value: List | str)[source]¶
Add dense BoW representations.
- Parameters:
value (List of models or string) – List of models or name
- property names¶
Vector space components
- property norm_weights¶
Euclidean norm of the weights
- property weights¶
Weights of the vector space. It is matrix, i.e., \(\mathbf W \in \mathbb R^{M \times d}\), where \(M\) is the dimension of the vector space (see
DenseBoW.names
) and \(d\) is the vocabulary size.>>> from EvoMSA import DenseBoW >>> text_repr = DenseBoW(lang='es') >>> text_repr.weights.shape (2672, 131072)
- property bias¶
Bias.
- property dataset¶
Dense Representation based on human-annotated datasets
- property emoji¶
Dense Representation based on emojis
- property keyword¶
Dense Representation based on keywords
- property unit_vector¶
Normalize representation to have one length
- fromjson(filename: str) DenseBoW [source]¶
Load the text representations from a json file.
- Parameters:
filename (str) – Path
- property skip_dataset¶
Datasets discarded from the text representations
- class DenseBoWBP[source]¶
DenseBoWBP is a
DenseBoW
with the difference that the parameters are fine-tuned using jax>>> from microtc.utils import tweet_iterator >>> from EvoMSA.tests.test_base import TWEETS >>> from EvoMSA.back_prop import DenseBoWBP >>> D = list(tweet_iterator(TWEETS)) >>> dense = DenseBoWBP(lang='es').fit(D) >>> dense.predict(['Buenos dias']).tolist() ['P']
- __init__(emoji: bool = True, dataset: bool = False, keyword: bool = True, estimator_kwargs={'class_weight': 'balanced', 'dual': 'auto'}, **kwargs)[source]¶
- __new__(**kwargs)¶