Adaptation and application of hidden markov models in automatic text classification

Seara Vieira, Adrián

Adaptation and application of hidden markov models in automatic text classification

Seara Vieira, Adrián

Dirixida por:

Eva Lorenzo Iglesias Director
María Lourdes Borrajo Diz Co-director

Universidade de defensa: Universidade de Vigo

Fecha de defensa: 21 de decembro de 2016

Tribunal:

Manuel de Buenaga Rodríguez Presidente/a
Alma María Gómez Rodríguez Secretaria
Manuel Jesús Maña López Vogal

Departamento:

Informática

Tipo: Tese

Teseo: 430393 DIALNET

Resumo

With the rapid growth of corporate and digital databases, text classification has become one of the key techniques for handling and organizing digital text data. The high volume of items makes it impossible for any expert to handle it manually. Therefore there is great interest to automate the classification process and produce relevant documents to a topic. Text classification is the task of automatically assigning a document set to a predefined set of classes or topics. Specifically, in a supervised text classification problem, a machine learning algorithm is trained with examples in order to perform class assignments with unknown documents automatically. A large number of techniques have been used and tested for text classification, including Naive Bayes, k-Nearest Neighbours (kNN), Neural Networks, and Support Vector Machines (SVMs). Among them, SVM has been recognized as one of the most powerful and promising machine learning methods. In other studies, Hidden Markov Models (HMM) have been used to describe the statistical properties of a sequential random process. They are known for their application in language problems like speech recognition. However, their application has been extended to fields of text processing such as information extraction, information retrieval, text categorization, and text classification. A Hidden Markov Model is a statistical tool used to model generative sequences that can be characterised by an underlying hidden process. It can be seen as a state diagram that consists of a set of states and transitions between them, where each state can emit an output observation with a certain probability. Thus, two processes take place when generating sequences in an HMM. The first process describes the unobservable state sequence, i.e. the hidden layer represented by the state transition probability. The second process links each state to observations creating the observable layer represented by the output observation probability. One of the common tasks that can be addressed by deploying an HMM in text analysis is to infer the internal structure of the documents. This task is related to the decoding problem for HMMs, where the hidden state transitions must be deduced given an HMM and an observation sequence. Another text processing task that is solved with the use of HMMs is to perform text sequence classification, in which the HMM is commonly treated as a word generator and receives as input a sequence of units belonging to a document. This task is related to the evaluation problem for HMMs, where the probability of an observation sequence must be calculated given a trained HMM. In general, HMMs cannot be used directly in a text classification process. Previous HMM related techniques provide a way of applying these models to a text processing task; however, they are specifically suited for information retrieval problems or need to use prior knowledge to achieve significant results. One of the goals of this thesis work is to create a simpler and more effective text classifier based on Hidden Markov Models that can categorize documents according to their content. In addition to the classification problem itself, document classifiers must handle text preprocessing issues that arise from the document corpora. Of all these, there are two major difficulties which the classifiers must deal with: the representation of the documents and the amount of training data. The accuracy of text learning systems is clearly affected by how the documents are represented. Documents have to be transformed into a format that automatic classifiers can handle. Document representation is usually (at its simplest) done through term-document frequencies, or bags-of-words, representations, which for efficiency are implemented over sparse matrices. In this case, every document is represented by a vector of its more relevant terms (feature words). This classical representation is hindered by the practical limitations of big text corpora. The time and space efficiency of the text classification largely depends on the size of the aforementioned vectors. Generally, not all words have a significant contribution to discriminately represent a text. It is possible to apply filtering algorithms to the initial feature word set in order to reduce its dimensionality. Some of these techniques can remove redundant or irrelevant features using statistical filtering methods such as Information Gain. Another type of feature reduction methods, as PCA (Principal Component Analysis), creates new attributes as a combination of the original attributes. Generally, these algorithms increase the classification accuracy and reduce the computational cost of the process, although they are very costly processes for high-dimensional datasets. On the other hand, the amount of available training data affects the classiffication performance in multiple ways. The class imbalance problem occurs when only a very small number of cases represents one class compared to the others. The classification techniques usually assume that data are uniformly distributed between different classes. A classifier performs well if it is applied to a dataset evenly distributed among classes, but when encounters unbalanced data the performance often decreases. In this case, the classifier tends to ignore rare classes and predict the larger ones. Sampling strategies such as over- and undersampling are popular in tackling the problem of class imbalance [?]. The undersampling algorithms artificially decrease the number of samples that belong to larger classes, while the oversampling algorithms increase the number of instances of rare classes by taking the larger classes into account. In general, these techniques can improve the classification performance when the corpus is not balanced. However, it is possible to lose valuable information for the classifier (undersampling case) or increase the time needed for the training phase and overfit the model (oversampling case). Aside from building an HMM-based classifier, this thesis work also explores how the previously mentioned problems affect its classification performance and how can the model be applied in other text preprocessing steps such as feature reduction and class balancing.