Multi-objective evolutionary optimization for dimensionality reduction of texts represented by synsets

  1. Vélez de Mendizabal, Iñaki 12
  2. Basto-Fernandes, Vitor 2
  3. Ezpeleta, Enaitz 1
  4. Méndez, José R. 345
  5. Gómez-Meire, Silvana 5
  6. Zurutuza, Urko 1
  1. 1 Electronics and Computing Department, Mondragon Unibertsitatea, Arrasate-Mondragón, Gipuzkoa, Spain
  2. 2 University Institute of Lisbon ISTAR-IUL, Instituto Universitário de Lisboa (ISCTE-IUL), Lisboa, Portugal
  3. 3 Galicia Sur Health Research Institute (IIS Galicia Sur), Hospital Álvaro Cunqueiro, Bloque técnico, SING Research Group, Vigo, Pontevedra, Spain
  4. 4 CINBIO-Biomedical Research Centre, Lagoas-Marcosende, Vigo, Pontevedra, Spain
  5. 5 Department of Computer Science Universidade de Vigo, Ourense, Spain
Revista:
PeerJ Computer Science

ISSN: 2376-5992

Ano de publicación: 2023

Volume: 9

Páxinas: e1240

Tipo: Artigo

DOI: 10.7717/PEERJ-CS.1240 GOOGLE SCHOLAR lock_openAcceso aberto editor

Outras publicacións en: PeerJ Computer Science

Resumo

Despite new developments in machine learning classification techniques, improving the accuracy of spam filtering is a difficult task due to linguistic phenomena that limit its effectiveness. In particular, we highlight polysemy, synonymy, the usage of hypernyms/hyponyms, and the presence of irrelevant/confusing words. These problems should be solved at the pre-processing stage to avoid using inconsistent information in the building of classification models. Previous studies have suggested that the use of synset-based representation strategies could be successfully used to solve synonymy and polysemy problems. Complementarily, it is possible to take advantage of hyponymy/hypernymy-based to implement dimensionality reduction strategies. These strategies could unify textual terms to model the intentions of the document without losing any information (e.g., bringing together the synsets “viagra”, “ciallis”, “levitra” and other representing similar drugs by using “virility drug” which is a hyponym for all of them). These feature reduction schemes are known as lossless strategies as the information is not removed but only generalised. However, in some types of text classification problems (such as spam filtering) it may not be worthwhile to keep all the information and let dimensionality reduction algorithms discard information that may be irrelevant or confusing. In this work, we are introducing the feature reduction as a multi-objective optimisation problem to be solved using a Multi-Objective Evolutionary Algorithm (MOEA). Our algorithm allows, with minor modifications, to implement lossless (using only semantic-based synset grouping), low-loss (discarding irrelevant information and using semantic-based synset grouping) or lossy (discarding only irrelevant information) strategies. The contribution of this study is two-fold: (i) to introduce different dimensionality reduction methods (lossless, low-loss and lossy) as an optimization problem that can be solved using MOEA and (ii) to provide an experimental comparison of lossless and low-loss schemes for text representation. The results obtained support the usefulness of the low-loss method to improve the efficiency of classifiers.

Información de financiamento

Financiadores

  • SMEIC, SRA and ERDF
    • TIN2017-84658-C2-1-R and TIN2017-84658-C2-2-R
  • Conselleria de Cultura, Educación e Universidade of Xunta de Galicia
    • ED431C 2022/03-GRC
  • Universities and Research of the Basque Country
    • IT1676-22
  • FCT
    • UIDB/04466/2020 and UIDP/04466/2020

Referencias bibliográficas

  • Aiyar S, Shetty NP. 2018. N-gram assisted youtube spam comment detection. Procedia Computer Science 132(6):174-182
  • Alberto T, Lochter J. 2017. YouTube spam collection. UCI machine learning repository.
  • Ali A. 2020. Here’s What Happens Every Minute on the Internet in 2020 (Visual Capitalist) (accessed 19 October 2022)
  • Almeida TA, Silva TP, Santos I, Gómez Hidalgo JM. 2016. Text normalization and semantic indexing to enhance instant messaging and SMS spam filtering. Knowledge-Based Systems 108(3):25-32
  • Bahgat EM, Moawad IF. 2017. Semantic-based feature reduction approach for e-mail classification.
  • Barushka A, Hajek P. 2019. Review spam detection using word embeddings and deep neural networks. In: MacIntyre J, Maglogiannis I, Iliadis L, Pimenidis E, eds. Artificial Intelligence Applications and Innovations. Cham: Springer International Publishing. 559:340-350
  • Basto-Fernandes V, Yevseyeva I, Méndez JR, Zhao J, Fdez-Riverola F, Emmerich MTM. 2016. A spam filtering multi-objective optimization study covering parsimony maximization and three-way classification. Applied Soft Computing 48(4):111-123
  • Blum AL, Langley P. 1997. Selection of relevant features and examples in machine learning. Artificial Intelligence 97(1):245-271
  • Cabrera-León Y, García Báez P, Suárez-Araujo CP. 2018. Non-email spam and machine learning-based anti-spam filters: trends and some remarks. In: EUROCAST 2017: Computer Aided Systems Theory–EUROCAST 2017. Cham: Springer. 10671:245-253
  • Chakraborty M, Pal S, Pramanik R, Ravindranath Chowdary C. 2016. Recent developments in social spam detection and combating techniques: a survey. Information Processing and Management 52(6):1053-1073
  • Chandrashekar G, Sahin F. 2014. A survey on feature selection methods. Computers & Electrical Engineering 40(1):16-28
  • Durillo JJ, Nebro AJ. 2008. jMetal Web site. (accessed 19 October 2022)
  • Goldkamp J, Dehghanimohammadabadi M. 2019. Evolutionary multi-objective optimization for multivariate pairs trading. Expert Systems with Applications 135(21):113-128
  • Kalousis A, Prados J, Hilario M. 2007. Stability of feature selection algorithms: a study on high-dimensional spaces. Knowledge and Information Systems 12(1):95-116
  • Kohavi R, John GH. 1997. Wrappers for feature subset selection. Artificial Intelligence 97(1):273-324
  • Li J, Lv P, Xiao W, Yang L, Zhang P. 2021. Exploring groups of opinion spam using sentiment analysis guided by nominated topics. Expert Systems with Applications 171:114585
  • Lopez-Gazpio I, Maritxalar M, Lapata M, Agirre E. 2019. Word n-gram attention models for sentence similarity and inference. Expert Systems with Applications 132(Feb):1-11
  • Méndez JR, Cotos-Yañez TR, Ruano-Ordás D. 2019. A new semantic-based feature selection method for spam filtering. Applied Soft Computing 76:89-104
  • Moro A, Navigli R. 2010. Babelfy | Multilingual Word Sense Disambiguation and Entity Linking together! (accessed 19 October 2022)
  • Moro A, Raganato A, Navigli R. 2014. Entity linking meets word sense disambiguation: a unified approach. Transactions of the Association for Computational Linguistics 2(22):231-244
  • Novo-Lourés M, Lage Y, Pavón R, Laza R, Ruano-Ordás D, Méndez JR. 2021. Improving pipelining tools for pre-processing data. International Journal of Interactive Multimedia and Artificial Intelligence
  • Novo-Lourés M, Pavón R, Laza R, Ruano-Ordas D, Méndez JR. 2020. Using natural language preprocessing architecture (NLPA) for big data text sources. Scientific Programming 2020:1-13
  • Princeton University. 2010. WordNet. (accessed 19 October 2022)
  • Robles JF, Chica M, Cordon O. 2020. Evolutionary multiobjective optimization to target social network influentials in viral marketing. Expert Systems with Applications 147(5439):113183
  • Sahin E, Aydos M, Orhan F. 2018. Spam/ham e-mail classification using machine learning methods based on bag of words technique.
  • Salcedo-Sanz S, Camps-Valls G, Perez-Cruz F, Sepulveda-Sanchis J, Bousono-Calzon C. 2004. Enhancing genetic feature selection through restricted search and walsh analysis. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 34(4):398-406
  • Sapienza NLP. 2012. BabelNet®, the largest multilingual encyclopedic dictionary and semantic network. (accessed 19 October 2022)
  • Scozzafava F, Raganato A, Moro A, Navigli R. 2015. Automatic identification and disambiguation of concepts and named entities in the multilingual wikipedia. In: Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Cham: Springer. 9336:357-366
  • Shah FP, Patel V. 2016. A review on feature selection and feature extraction for text classification.
  • Silva RM, Alberto TC, Almeida TA, Yamakami A. 2017. Towards filtering undesired short text messages using an online learning approach with semantic indexing. Expert Systems with Applications 83:314-325
  • Statista Inc. 2022. Number of internet and social media users worldwide as of july 2022. (accessed 19 October 2022)
  • Suryawanshi S, Goswami A, Patil P. 2019. Email spam detection: an empirical comparative study of different ML and ensemble classifiers.
  • Tanabe R, Ishibuchi H. 2020. A review of evolutionary multimodal multiobjective optimization. IEEE Transactions on Evolutionary Computation 24(1):193-200
  • Trivedi SK, Dey S. 2016. A comparative study of various supervised feature selection methods for spam classification.
  • Turk S, Özcan E, John R. 2017. Multi-objective optimisation in inventory planning with supplier selection. Expert Systems with Applications 78:51-63
  • Vázquez I, Novo-Lourés M, Pavón R, Laza R, Méndez JR, Ruano-Ordás D. 2021. Improvements for research data repositories: the case of text spam. Journal of Information Science
  • Vélez de Mendizabal I, Basto-Fernandes V, Ezpeleta E, Méndez JR, Zurutuza U. 2020. SDRS: a new lossless dimensionality reduction for text corpora. Information Processing and Management 57(4):102249
  • Verma S, Pant M, Snasel V. 2021. A comprehensive review on NSGA-II for multi-objective combinatorial optimization problems. IEEE Access 9:57757-57791
  • Witten IH, Frank E, Hall MA, Pal CJ. 2016. Data mining: practical machine learning tools and techniques. Amsterdam Elsevier: Data Mining: Practical Machine Learning Tools and Techniques.
  • Xu H, Sun W, Javaid A. 2016. Efficient spam detection across online social networks.