Competition Systems

We test 13 different combinations of BoW and DenseBoW models. These models include the use of the two procedures to select the vocabulary (parameter voc_selection), the use of pre-trained BoW, and the creation of the BoW representation with the given training set. Additionally, we create text representations tailored to the problem at hand. That is the words with more discriminant power in a BoW classifier, trained on the training set, are selected as the labels in self-supervised problems.

class Comp2023[source]

Configurations tested on the 2023 Competitions.

Parameters:
  • lang – language, see BoW

  • voc_size_exponent (int) – Vocabulary size. default=17, i.e., \(2^{17}\).

  • tailored (str) – DenseBoW created with keywords selected from a problem.

  • feature_selection (bool) – Perform feature selection on DenseBoW, default=True.

Type:

lang: str

__init__(lang='es', voc_size_exponent=17, tailored=None, feature_selection=True) None[source]
bow(D=None, y=None)[source]

Pre-trained BoW where the tokens are selected based on a normalized frequency w.r.t. its type, i.e., bigrams, words, and q-grams of characters.

>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> from EvoMSA import BoW
>>> D = list(tweet_iterator(TWEETS))
>>> bow = BoW(lang='es').fit(D)
>>> hy = bow.predict(['Buenos días'])
bow_voc_selection(D=None, y=None)[source]

Pre-trained BoW where the tokens correspond to the most frequent ones.

>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> from EvoMSA import BoW
>>> D = list(tweet_iterator(TWEETS))
>>> bow = BoW(lang='es',
              voc_selection='most_common').fit(D)
>>> hy = bow.predict(['Buenos días'])
bow_training_set(D=None, y=None)[source]

BoW trained with the training set; the number of tokens corresponds to all the tokens in the set.

>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> from EvoMSA import BoW
>>> from EvoMSA.utils import b4msa_params
>>> D = list(tweet_iterator(TWEETS))
>>> params = b4msa_params(lang='es')
>>> del params['token_max_filter']
>>> del params['max_dimension']                
>>> bow = BoW(lang='es',
              pretrain=False, 
              b4msa_kwargs=params).fit(D)
>>> hy = bow.predict(['Buenos días'])
stack_bow_keywords_emojis(D, y=None)[source]

Stack generalization (StackGeneralization) approach where the base classifiers are the BoW, the emoji, and the keywords of DenseBoW.

>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> from EvoMSA import BoW, DenseBoW, StackGeneralization
>>> D = list(tweet_iterator(TWEETS))
>>> bow = BoW(lang='es')
>>> keywords = DenseBoW(lang='es',
                        emoji=False,
                        dataset=False).select(D=D)
>>> emojis = DenseBoW(lang='es',
                      keyword=False,
                      dataset=False).select(D=D)
>>> st = StackGeneralization(decision_function_models=[bow,
                                                       keywords,
                                                       emojis]).fit(D)
>>> hy = st.predict(['Buenos días'])
stack_bow_keywords_emojis_voc_selection(D, y=None)[source]

Stack generalization (StackGeneralization) approach where the base classifiers are the BoW, the emoji, and the keywords of DenseBoW. The tokens in these models were selected based on a normalized frequency w.r.t. its type, i.e., bigrams, words, and q-grams of characters.

>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> from EvoMSA import BoW, DenseBoW, StackGeneralization
>>> D = list(tweet_iterator(TWEETS))
>>> bow = BoW(lang='es',
              voc_selection='most_common')
>>> keywords = DenseBoW(lang='es',
                        voc_selection='most_common',
                        emoji=False,
                        dataset=False).select(D=D)
>>> emojis = DenseBoW(lang='es',
                      voc_selection='most_common',
                      keyword=False,
                      dataset=False).select(D=D)
>>> st = StackGeneralization(decision_function_models=[bow,
                                                       keywords,
                                                       emojis]).fit(D)
>>> hy = st.predict(['Buenos días'])        
stack_bows(D=None, y=None)[source]

Stack generalization approach where the base classifiers are BoW with the two token selection procedures set in the parameter voc_selection.

>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> from EvoMSA import BoW, StackGeneralization
>>> D = list(tweet_iterator(TWEETS))
>>> bow = BoW(lang='es')
>>> bow_voc = BoW(lang='es',
                  voc_selection='most_common')
>>> st = StackGeneralization(decision_function_models=[bow,
                                                       bow_voc]).fit(D)
>>> hy = st.predict(['Buenos días'])
stack_2_bow_keywords(D, y=None)[source]

Stack generalization approach where with four base classifiers. These correspond to two BoW and two dense DenseBoW (emojis and keywords), where the difference in each is the procedure used to select the tokens, i.e., the most frequent or normalized frequency (i.e., voc_selection).

>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> from EvoMSA import BoW, DenseBoW, StackGeneralization
>>> D = list(tweet_iterator(TWEETS))
>>> bow = BoW(lang='es')
>>> bow_voc = BoW(lang='es',
                  voc_selection='most_common')
>>> keywords = DenseBoW(lang='es',
                        dataset=False).select(D=D)
>>> keywords_voc = DenseBoW(lang='es',
                            voc_selection='most_common',
                            dataset=False).select(D=D)
>>> st = StackGeneralization(decision_function_models=[bow, bow_voc,
                                                       keywords,
                                                       keywords_voc]).fit(D)
>>> hy = st.predict(['Buenos días'])
stack_2_bow_tailored_keywords(D, y=None)[source]

Stack generalization approach where with four base classifiers. These correspond to two BoW and two DenseBoW (emojis and keywords), where the difference in each is the procedure used to select the tokens, i.e., the most frequent or normalized frequency. The second difference is that the dense representation with normalized frequency also includes models for the most discriminant words selected by a BoW classifier in the training set. We refer to these latter representations as tailored keywords.

>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> from EvoMSA import BoW, DenseBoW, StackGeneralization
>>> D = list(tweet_iterator(TWEETS))
>>> bow = BoW(lang='es')
>>> keywords = DenseBoW(lang='es',
                        dataset=False).select(D=D)
>>> tailored = 'IberLEF2023_DAVINCIS_task1'
>>> keywords.text_representations_extend(tailored)
>>> bow_voc = BoW(lang='es',
                  voc_selection='most_common')
>>> keywords_voc = DenseBoW(lang='es',
                            voc_selection='most_common',
                            dataset=False).select(D=D)
>>> st = StackGeneralization(decision_function_models=[bow, bow_voc,
                                                       keywords,
                                                       keywords_voc]).fit(D)
>>> hy = st.predict(['Buenos días'])        
stack_2_bow_all_keywords(D, y=None)[source]

Stack generalization approach where with four base classifiers equivalently to StackGeneralization using BoW and DenseBoW with and without voc_selection where the difference is that the dense representations include the models created with the human-annotated datasets.

>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> from EvoMSA import BoW, DenseBoW, StackGeneralization
>>> D = list(tweet_iterator(TWEETS))        
>>> bow = BoW(lang='es')
>>> keywords = DenseBoW(lang='es')
>>> sel = [k for k, v in enumerate(keywords.names)
           if not(v in ['davincis2022_1'] or 'semeval2023' in v)]
>>> keywords.select(sel).select(D=D)
>>> bow_voc = BoW(lang='es', voc_selection='most_common')
>>> keywords_voc = DenseBoW(lang='es',
                            voc_selection='most_common').select(sel).select(D=D)
>>> st = StackGeneralization(decision_function_models=[bow,
                                                       bow_voc,
                                                       keywords,
                                                       keywords_voc]).fit(D)
>>> hy = st.predict(['Buenos días'])
stack_2_bow_tailored_all_keywords(D, y=None)[source]

Stack generalization approach where with four base classifiers equivalently to StackGeneralization using BoW and DenseBoW with and without voc_selection where the difference is that the dense representation with normalized frequency also includes the tailored keywords.

>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> from EvoMSA import BoW, DenseBoW, StackGeneralization
>>> D = list(tweet_iterator(TWEETS))        
>>> bow = BoW(lang='es')
>>> keywords = DenseBoW(lang='es')
>>> tailored = 'IberLEF2023_DAVINCIS_task1'
>>> sel = [k for k, v in enumerate(keywords.names)
           if not(v in ['davincis2022_1'] or 'semeval2023' in v)]
>>> keywords.select(sel)                           
>>> keywords.text_representations_extend(tailored)
>>> keywords.select(D=D)
>>> bow_voc = BoW(lang='es', voc_selection='most_common')
>>> keywords_voc = DenseBoW(lang='es',
                            voc_selection='most_common').select(sel).select(D=D)
>>> st = StackGeneralization(decision_function_models=[bow,
                                                       bow_voc,
                                                       keywords,
                                                       keywords_voc]).fit(D)
>>> hy = st.predict(['Buenos días'])        
stack_3_bows(D=None, y=None)[source]

Stack generalization approach with three base classifiers. All of them are BoW; the first two correspond pre-trained BoW with the two token selection procedures described previously (i.e., BoW default parameters and BoW using voc_selection), and the latest is a BoW trained on the training set.

>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> from EvoMSA import BoW, StackGeneralization
>>> from EvoMSA.utils import b4msa_params        
>>> D = list(tweet_iterator(TWEETS))
>>> bow = BoW(lang='es')
>>> bow_voc = BoW(lang='es',
                  voc_selection='most_common')
>>> params = b4msa_params(lang='es')
>>> del params['token_max_filter']
>>> del params['max_dimension']                
>>> bow_train = BoW(lang='es',
                    pretrain=False, 
                    b4msa_kwargs=params).fit(D)                          
>>> st = StackGeneralization(decision_function_models=[bow,
                                                       bow_voc,
                                                       bow_train]).fit(D)
>>> hy = st.predict(['Buenos días'])                
stack_3_bows_tailored_keywords(D, y=None)[source]

Stack generalization approach with five base classifiers. The first corresponds to a BoW trained on the training set, and the rest are used in EvoMSA.competitions.Comp2023.stack_2_bow_tailored_keywords.

>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> from EvoMSA import BoW, DenseBoW, StackGeneralization
>>> from EvoMSA.utils import b4msa_params        
>>> D = list(tweet_iterator(TWEETS))
>>> bow = BoW(lang='es')
>>> keywords = DenseBoW(lang='es',
                        dataset=False).select(D=D)
>>> tailored = 'IberLEF2023_DAVINCIS_task1'
>>> keywords.text_representations_extend(tailored)
>>> bow_voc = BoW(lang='es',
                  voc_selection='most_common')
>>> keywords_voc = DenseBoW(lang='es',
                            voc_selection='most_common',
                            dataset=False).select(D=D)
>>> params = b4msa_params(lang='es')
>>> del params['token_max_filter']
>>> del params['max_dimension']                
>>> bow_train = BoW(lang='es',
                    pretrain=False, 
                    b4msa_kwargs=params)                                    
>>> st = StackGeneralization(decision_function_models=[bow, bow_voc,
                                                       bow_train,
                                                       keywords,
                                                       keywords_voc]).fit(D)
>>> hy = st.predict(['Buenos días'])                
stack_3_bow_tailored_all_keywords(D, y=None)[source]

Stack generalization approach with five base classifiers. The first corresponds to a BoW trained on the training set, and the rest are used in EvoMSA.competitions.Comp2023.stack_2_bow_tailored_all_keywords.

>>> from microtc.utils import tweet_iterator
>>> from EvoMSA.tests.test_base import TWEETS
>>> from EvoMSA import BoW, DenseBoW, StackGeneralization
>>> from EvoMSA.utils import b4msa_params 
>>> D = list(tweet_iterator(TWEETS))        
>>> bow = BoW(lang='es')
>>> keywords = DenseBoW(lang='es')
>>> tailored = 'IberLEF2023_DAVINCIS_task1'
>>> keywords.text_representations_extend(tailored)        
>>> sel = [k for k, v in enumerate(keywords.names)
           if not(v in ['davincis2022_1'] or 'semeval2023' in v)]
>>> keywords.select(sel).select(D=D)
>>> bow_voc = BoW(lang='es', voc_selection='most_common')
>>> keywords_voc = DenseBoW(lang='es',
                            voc_selection='most_common').select(sel).select(D=D)
>>> params = b4msa_params(lang='es')
>>> del params['token_max_filter']
>>> del params['max_dimension']                
>>> bow_train = BoW(lang='es',
                    pretrain=False, 
                    b4msa_kwargs=params)                                    
>>> st = StackGeneralization(decision_function_models=[bow,
                                                       bow_voc,
                                                       bow_train,
                                                       keywords,
                                                       keywords_voc]).fit(D)
>>> hy = st.predict(['Buenos días'])       

Tailored Keywords

bow = BoW(lang=LANG, pretrain=False).fit(D)
keywords = DenseBoW(lang=LANG, emoji=False, dataset=False).names
tokens = [(name, np.median(np.fabs(w * v)))
          for name, w, v in zip(bow.names, bow.weights, bow.estimator_instance.coef_.T)
          if name[:2] != 'q:' and '~' not in name and name not in keywords]
tokens.sort(key=lambda x: x[1], reverse=True)
semi = SelfSupervisedDataset([k for k, _ in tokens[:2048]],
                             tempfile=f'{MODEL}.gz',
                             bow=BoW(lang=LANG), capacity=1, n_jobs=63)
semi.process(PATH_DATASET, output=MODEL)