Procesamiento de Expresiones Multipalabra en gallego mediante Aprendizaje Profundo

  1. Doval, Yerai
  2. Kuriyozov, Elmurod
  3. Darriba Bilbao, Víctor Manuel
Revista:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Ano de publicación: 2021

Número: 67

Páxinas: 45-57

Tipo: Artigo

Outras publicacións en: Procesamiento del lenguaje natural

Resumo

l tratamiento de Expresiones Multipalabra es todavía una tarea pendiente en el Procesamiento del Lenguaje Natural. En este trabajo pretendemos determinar experimentalmente la utilidad de los modelos de Aprendizaje Automático para el procesamiento de Expresiones Multipalabra en gallego. Para ello usamos CORGA, un corpus con 40 millones de palabras, con el cual entrenamos modelos transformer de Aprendizaje Profundo, y comparamos su rendimiento con el de modelos más tradicionales de campo aleatorio condicional.

Referencias bibliográficas

  • Blunsom, P. y T. Baldwin. 2006. Multilingual deep lexical acquisition for HPSGs via supertagging. En Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, páginas 164–171, Sydney, Australia, Julio. Association for Computational Linguistics.
  • Candito, M. y M. Constant. 2014. Strategies for Contiguous Multiword Expression Analysis and Dependency Parsing. En ACL’14 - The 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, United States. ACL.
  • Centro Ramón Piñeiro para a Investigación en Humanidades. 2019a. Corpus de Referencia do Galego Actual (CORGA) [v3.2]. http://corpus.cirp.gal/corga/.
  • Centro Ramón Piñeiro para a Investigación en Humanidades. 2019b. Etiquetador/Lematizador do Galego Actual (XIADA) [v2.7]. http://corpus.cirp.gal/xiada/.
  • Cho, K., B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, y Y. Bengio. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation. En Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, páginas 1724–1734, Doha, Qatar, Octubre. Association for Computational Linguistics.
  • Conneau, A., K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, y V. Stoyanov. 2020. Unsupervised crosslingual representation learning at scale. En Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, páginas 8440–8451, Online, Julio. Association for Computational Linguistics.
  • Constant, M., G. Eryi˘git, J. Monti, L. van der Plas, C. Ramisch, M. Rosner, y A. Todirascu. 2017. Multiword Expression Processing: A Survey. Computational Linguistics, 43(4):837–892.
  • Constant, M. y J. Nivre. 2016. A transitionbased system for joint lexical and syntactic analysis. En Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), páginas 161–171, Berlin, Germany, Agosto. Association for Computational Linguistics.
  • Constant, M. y A. Sigogne. 2011. MWUaware part-of-speech tagging with a CRF model and lexical resources. En Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World, páginas 49–56, Portland, Oregon, USA, Junio. Association for Computational Linguistics. Constant, M., A. Sigogne, y P. Watrin. 2012.
  • Discriminative strategies to integrate multiword expression recognition and parsing. En Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), páginas 204–212, Jeju Island, Korea, Julio. Association for Computational Linguistics.
  • Devlin, J., M.-W. Chang, K. Lee, y K. Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. En Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), páginas 4171–4186, Minneapolis, Minnesota, Junio. Association for Computational Linguistics.
  • Diab, M. y P. Bhutada. 2009. Verb noun construction MWE token classification. En Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications, páginas 17–22, Singapore, Agosto. Association for Computational Linguistics. Domínguez Noya, E. M., F. M. Barcala Rodríguez, y M. A. Molinero Alvarez. 2009. Avaliación dun etiquetador automático estatístico para o galego actual: Xiada. Cadernos de Lingua, 30-31:151–193.
  • Domínguez Noya, E. M., M. S. López Martínez, y F. M. Barcala Rodríguez. 2019. O Corpus de Referencia do Galego actual (CORGA): composición, codificación, etiquetaxe e explotación. Verba: Anuario Galego de Filoloxía, Anexo 74:179–219.
  • Dubremetz, M. y J. Nivre. 2014. Extraction of nominal multiword expressions in French. En Proceedings of the 10th Workshop on Multiword Expressions (MWE), páginas 72–76, Gothenburg, Sweden, Abril. Association for Computational Linguistics.
  • Farahmand, M. y R. Martins. 2014. A supervised model for extraction of multiword expressions, based on statistical context features. En Proceedings of the 10th Workshop on Multiword Expressions (MWE), páginas 10–16, Gothenburg, Sweden, Abril. Association for Computational Linguistics.
  • Firth, J. R. 1957. Papers in Linguistics, 1934-1951. Oxford University Press, London.
  • Green, S., M.-C. de Marneffe, J. Bauer, y C. D. Manning. 2011. Multiword expression identification with tree substitution grammars: A parsing tour de force with French. En Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, páginas 725– 735, Edinburgh, Scotland, UK., Julio. Association for Computational Linguistics.
  • Green, S., M.-C. de Marneffe, y C. D. Manning. 2013. Parsing models for identifying multiword expressions. Computational Linguistics, 39(1):195–227.
  • Hochreiter, S. y J. Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780, 11.
  • Jackendoff, R. 1997. Twistin’ the night away. Language, 73(3):534–559, Septiembre.
  • Klyueva, N., A. Doucet, y M. Straka. 2017. Neural networks for multi-word expression detection. En Proceedings of the 13th Workshop on Multiword Expressions, páginas 60–65, Valencia, Spain, Abril. Association for Computational Linguistics.
  • Kurfali, M. 2020. TRAVIS at PARSEME shared task 2020: How good is (m)BERT at seeing the unseen? En Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons, páginas 136–141, online, Diciembre. Association for Computational Linguistics.
  • Lafferty, J. D., A. McCallum, y F. C. N. Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. En Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, páginas 282–289, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
  • Lapata, M. y A. Lascarides. 2003. Detecting novel compounds: The role of distributional evidence. En 10th Conference of the European Chapter of the Association for Computational Linguistics, Budapest, Hungary, Abril. Association for Computational Linguistics.
  • Legrand, J. y R. Collobert. 2016. Phrase representations for multiword expressions. En Proceedings of the 12th Workshop on Multiword Expressions, páginas 67–71, Berlin, Germany, Agosto. Association for Computational Linguistics.
  • Maldonado, A., L. Han, E. Moreau, A. Alsulaimani, K. D. Chowdhury, C. Vogel, y Q. Liu. 2017. Detection of verbal multiword expressions via conditional random fields with syntactic dependency features and semantic re-ranking. En Proceedings of the 13th Workshop on Multiword Expressions, páginas 114–120, Valencia, Spain, Abril. Association for Computational Linguistics.
  • Okazaki, N. 2007. CRFsuite: a fast implementation of Conditional Random Fields (CRFs). http://www.chokkan.org/software/crfsuite/.
  • Pecina, P. 2009. Lexical Association Measures: Collocation Extraction. UFAL, Praha, Czechia.
  • Ramisch, C. 2015. Multiword Expressions Acquisition: A Generic and Open Framework, volumen XIV de Theory and Applications of Natural Language Processing. Springer.
  • Ramisch, C., A. Villavicencio, L. Moura, y M. Idiart. 2008. Picking them up and figuring them out: Verb-particle constructions, noise and idiomaticity. En A. Clark y K. Toutanova, editores, Proceedings of the Twelfth Conference on Natural Language Learning, páginas 49–56, Manchester, UK. ACL.
  • Ramshaw, L. A. y M. Marcus. 1995. Text chunking using transformation-based learning. En D. Yarowsky y K. Church, editores, Third Workshop on Very Large Corpora, VLC@ACL 1995, Cambridge, Massachusetts, USA, June 30, 1995.
  • Riedl, M. y C. Biemann. 2016. Impact of MWE resources on multiword recognition. En Proceedings of the 12th Workshop on Multiword Expressions, páginas 107–111, Berlin, Germany, Agosto. Association for Computational Linguistics.
  • Rondon, A., H. Caseli, y C. Ramisch. 2015. Never-ending multiword expressions learning. En Proceedings of the 11th Workshop on MWEs, páginas 45–53, Denver, CO, USA. ACL.
  • Schneider, N., E. Danchik, C. Dyer, y N. A. Smith. 2014. Discriminative lexical semantic segmentation with gaps: Running the MWE gamut. Transactions of the Association for Computational Linguistics, 2:193–206.
  • Schneider, N., D. Hovy, A. Johannsen, y M. Carpuat. 2016. SemEval-2016 task 10: Detecting minimal semantic units and their meanings (DiMSUM). En Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), páginas 546–559, San Diego, California, Junio. Association for Computational Linguistics.
  • Simkó, K. I., V. Kovács, y V. Vincze. 2017. USzeged: Identifying verbal multiword expressions with POS tagging and parsing techniques. En Proceedings of the 13th Workshop on Multiword Expressions, páginas 48–53, Valencia, Spain, Abril. Association for Computational Linguistics.
  • Taslimipoor, S., S. Bahaadini, y E. Kochmar. 2020. MTLB-STRUCT @Parseme 2020: Capturing unseen multiword expressions using multi-task learning and pre-trained masked language models. En Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons, páginas 142–148, online, Diciembre. Association for Computational Linguistics.
  • Taslimipoor, S. y O. Rohanian. 2018. SHOMA at parseme shared task on automatic identification of vmwes: Neural multiword expression tagging with high generalisation. CoRR, abs/1809.03056.
  • Vincze, V., I. Nagy T., y G. Berend. 2011. Multiword expressions and named entities in the wiki50 corpus. En Proceedings of the International Conference Recent Advances in Natural Language Processing 2011, páginas 289–295, Hissar, Bulgaria, Septiembre. Association for Computational Linguistics.
  • Vincze, V., J. Zsibrita, y I. Nagy T. 2013. Dependency parsing for identifying Hungarian light verb constructions. En Proceedings of the Sixth International Joint Conference on Natural Language Processing, páginas 207–215, Nagoya, Japan, Octubre. Asian Federation of Natural Language Processing.
  • Wolf, T., L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, y A. M. Rush. 2020. Transformers: State-of-the-art natural language processing. En Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, páginas 38–45, Online, Octubre. Association for Computational Linguistics.
  • Zampieri, N., M. Scholivet, C. Ramisch, y B. Favre. 2018. Veyn at PARSEME shared task 2018: Recurrent neural networks for VMWE identification. En Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions, páginas 290–296, Santa Fe, New Mexico, USA, Agosto. Association for Computational Linguistics