Multi-relational learning for full text scientific document classification

  1. Oliveira Gonçalves, Carlos Adriano
Dirigida por:
  1. Rui Carlos Camacho de Sousa Ferreira da Silva Director/a
  2. Eva Lorenzo Iglesias Directora

Universidad de defensa: Universidade de Vigo

Fecha de defensa: 21 de julio de 2022

Tribunal:
  1. Sérgio Guilherme Aleixo de Matos Presidente/a
  2. Alma María Gómez Rodríguez Secretaria
  3. Carla Teixeira Lopes Vocal
Departamento:
  1. Informática

Tipo: Tesis

Resumen

Text mining is an area of Artificial Intelligence that has grown in importance in recent years due to the large amount of textual data that must be handled during decision-making processes and in the development of strategies in various sectors of society. Document classification is one of the most important techniques in many environments, such as web pages or document classification, sentiment analysis of social network users, spam email detection, information sharing and recommendation systems, among others. More specifically, in the field of medicine, the number of documents handled by medical literature repositories such as the National Center for Biotechnology Information (NCBI) or MEDLINE has grown exponentially in recent years. This has led to the need to develop more efficient computational techniques and methods for searching and classifying documents to extract relevant knowledge that drives new findings in scientific research. The approaches undertaken to extract information from scientific literature databases usually depend exclusively on the titles and abstracts of scientific articles for the classification of documents. However, users who search for full texts are more likely to find relevant articles than those who only search in titles and abstracts. This finding emphasizes the relevance of full text collections for text retrieval and serves as a foundation for research on algorithms that take advantage of the rapid increasing growth of digital archives. On the other hand, the use of full text documents generates a higher number of terms, which must be analyzed to verify if they can positively lead to improvements in the classification process. This raises the need to discard the terms that do not contribute to better discriminate the class that a document belongs to and to reduce the huge number of terms to give to the classifier. This approach is called preprocessing and has several techniques that may be applied. The best preprocessing techniques must be selected to reduce the overwhelming number of terms introduced by the full text and Semantic enrichment (another technique typically used to enrich the data set with domain specific content) without degrading the accuracy rate. When working with full text, the terms from the different sections (all the MEDLINE scientific documents have the same sections structure: Title, Abstract, Introduction, Methods and materials, Results and Conclusions) are available. Knowing the impact of each document section and which section combination produce better classifications is also part of this research work. Furthermore, it is essential to determine if handling full text documents is better than simply searching in the Title and Abstract. And it is important to analyze whether searching full text documents is better than searching a specific combination of document sections. This leads to another interesting research topic related to the representation of features (terms) for text mining when using the full text, which is discussed in this thesis. For this, based on documents extracted from the MEDLINE corpus, in full text, the documents were divided into the different sections, so that it was possible to study the individual impact of each section in the classification process, as well as the impact of the combination of the different sections performed with different weights. With this division it is possible to apply a different learning algorithm to each individual section. Also it is feasible to study the impact of multi-relational algorithms to discover if a multi-relational approach could contribute in a positive way to the classification process, specifically in models building. In summary, throughout these months of research work for this thesis, the answer the following questions was searched: • Can the analysis of full text documents contribute to the improvement of the classification process? • What preprocessing technique is best performed to reduce the overwhelming number of terms introduced by the full text and the semantic enrichment? • Which is the best technique to identify the most discriminative terms for the classification process? • What is the impact of each document section, and what are the best section combinations for the classification process? Is full text better than just title and abstract classification? Is a particular combination of sections better than full text? • Can ILP in conjunction with propositional learners achieve better results for document classification? • Can ILP embedded as a base learner in meta-learning improve document classification results? To answer these questions, a set of experiments was developed, in order to support the decisions and the path that was followed in the elaboration of this thesis. To build a sufficiently large corpus of full text documents, it was necessary to assemble a corpus based on OHSUMED (a well-known corpus of documents used in numerous text mining works in biomedicine). As the OHSUMED data set contains only the Title and Abstract of the documents, it was not possible to carry out the research using only this data set. The process of creating the corpus of full text documents from the original OHSUMED corpus was carried out with the incorporation of other components, namely the MeSH terms (Medical Subject Headings) from MEDLINE and the full text documents available at the PubMed Central repository. In this way, the XML documents with MeSH terms available at the MEDLINE repository have been used to link with the full text documents existing at the PubMed Central repository, allowing the construction of a data set like OHSUMED with full text documents. The data set used in this research is divided into 26 categories representing different type of diseases, for example: neoplasms (C04), digestive system diseases (C06), cardiovascular diseases (C14), immunologic diseases (C20), and pathological conditions, signs, and symptoms (C23), just to mention a few. With the created full text corpus, the first step of the research study focused on analysing the effects of the preprocessing techniques on the classification process, as well as the impact of the different document sections in the referred classification process, by assigning weights to each of the sections. A detailed study was carried out, identifying the sections and adding them in the following six groups: Title, Abstract, Introduction, Methods, Results, and Conclusions. The novel term frequency inverse section frequency (TF×ISF) based on the traditional term frequency inverse document frequency (TF×IDF) was presented and implemented in a study with 43 different combinations of the different weighted sections. This study aimed to compare the results obtained in the document classification with only titles and abstracts versus the use of full text documents with sections, using the section weighting and evaluating the importance of having terms from several sections. In summary, two main objectives were pursued. The first was to evaluate how the impact of the preprocessing techniques in the different sections affects the classification process. The second is to determine the best combination of weights for the sections in the learning process. The results obtained show that the Title and Abstract data set only reaches values close to the best Kappa value in three of the twelve data sets of the analyzed corpora, while if the full text is processed, good Kappa values are reached in ten of the analyzed corpora. It is also clearly demonstrated that adding document sections in the classification process substantially improves Kappa results in the vast majority of data sets versus processing Title and Abstract only. The next step of the research focused on know if the extra work of full text analysis has a significant impact in performance of text mining tasks, or if the impact depends on the scientific domain or specific corpus under analysis. The goal was to provide a full text classifier framework, called LearnSec, which incorporates domain specific knowledge to improve the classification process with propositional and multi-relational learning, find out which is the best subset of text components (sections) that achieve the best results. The framework has functionalities to facilitate the generation of database repositories of texts originally stored in XML format. The framework integrates a set of database operations that help in the generation of data sets. Another functionality is the generation of attribute/value data sets (in WEKA format) to a relational learning and in a First Order Logic (in Inductive Logic Programming -ILP- format) to a propositional learning. Apart from the traditional Bag-of-Words approach to text classification, the framework supports domain specific background knowledge that may be advantageous in the classification tasks for both propositional and relational learners. The proposed architecture has the capability to incorporate additional components into the framework. It is the first implementation to embed a multi-relational algorithm into a data mining tool like WEKA. The main objectives were to demonstrate the usefulness of the framework and also to evaluate the effectiveness of preprocessing techniques. The results obtained are very promising, showing that using full text documents, propositional and multi-relational algorithms achieved better results in terms of F-measure and Kappa measures than the results achieved so far. Inductive Logic Programming (a multi-relational learning system) uses logic programming (a firstorder logic) to represent examples, background knowledge, and hypotheses that have the ability to represent concepts that are not expressible using feature vectors as in the propositional algorithms. Therefore, Inductive Logic Programming is suitable for analyzing structured text documents and building highly complex models, which is very useful in bioinformatics and natural language processing. But, on the other hand, the number of features can be overwhelming for the algorithm to present results in a timely manner. This led to studies on existing approaches and the development of a novel algorithm to identify the k-Best-Discriminative-Terms (k-BDT) to be used in ILP systems. This research also looked into the problem related to the high number of features existing in full text data sets. The experiments carried out allowed the implementation and evaluation of the new k-BDT method, when compared with the existing feature selection techniques such as Information Gain and Correlation. The experiment has the following phases: (1) Identify the best “k” values; (2) Application and comparison of K-Best-Discriminative-Terms method with Information Gain and Correlation methods using a data set with only Title and Abstract; (3) Application and comparison of K-Best-Discriminative-Terms method with Information Gain and Correlation methods using a data set with full text. The results achieved allow to extract the two following conclusions: • The new K-BDT method is not the best approach to use with Title and Abstract data sets. This conclusion is supported by the lower results obtained when compared with Information Gain or Correlation methods. • The new K-BDT method, when applied to the full text data set, achieves better results than the ones obtained by the Information Gain or Correlation methods. All of these previous studies have shown that it is essential to use all sections of the documents rather than only use the titles and abstracts. Following the investigation, a new approach to the classification process of full text documents using a multi-view representation with sections was proposed. This approach, together with the use of ensemble learning algorithms, allowed the application of propositional algorithms, as well as multi-relational algorithms for each view. These algorithms are applied as a base learning model and their results are combined by means of a propositional learning algorithm. In this way, different machine learning models can operate on different data samples (the views of the document) and combine their results to improve the classification process. The experiments carried out were used to evaluate the new multi-view ensembled learning architecture, comparing the results obtained in the classification of documents that have only the Title and Abstract against the results obtained in the classification of full text documents structured in sections. The proposed multi-view method statistically outperforms standard text classification proposals in 15 of the 24 data sets tested. This suggests that the use of different sections as input views, which are used as input to different classifiers, can improve the accuracy of the final classification process. This study also analyzed the results obtained by the base classifiers to check which views (the different sections of the document) provided the best results. The values show that the Results section obtains the highest Kappa values for all analyzed data sets, and the Conclusions section obtains the lowest values in almost all experiments. The inclusion of an Inductive Logic Programming algorithm in the proposed multi-view architecture, within WEKA, makes it easier to use different types of algorithms more efficiently. To solve the problem related to the automatic hypothesis creation process in Inductive Logic Programming, which typically includes a lot of redundancy, a methodology that allows a significant reduction in the redundancy of the hypothesis space was developed and implemented. The proposed method considerably reduces the execution time needed to build models with ILP. The next milestone in this research work was to demonstrate the importance of enriching the biomedical corpus with semantic information to improve the classification of full texts. The results obtained show that the application of enrichment techniques to full text documents significantly improves the result obtained in text classification tasks. The experiments intended to address three main objectives. The first one was to verify the possibility of using an existing data mining framework to implement the pipeline defined in the multi-view architecture to add a multi-relational learning algorithm. The second objective was to verify whether Inductive Logic programming algorithms can be used together with propositional algorithms and how do they behave when used together. The third objective was to improve the long run time required to build models with ILP. To fulfill the first objective, the WEKA tool was enriched with an ILP algorithm, showing the viability of the multi-view architecture with a multi-relational learning algorithm. Using ILP as a basic learner shows better results. For the second objective, the agility to use propositional algorithms with multi-relational algorithms in the multi-view architecture, the results demonstrate that the ILP can perform better when applied to sections such as Title, Abstract and Conclusions. Regarding the third objective, this study presents an approach that allows a great reduction in the redundancy of the hypothesis space search procedure, having a positive effect on the time taken for the creation of ILP models. All the research were supported by the experiments carried out and the results obtained leads to the following conclusions that can be enumerated as the main contributions of this thesis. 1. The results of the experiments carried out to evaluate the impact that the different sections of scientific articles have on the classification process showed that there are sections with little relevant content that could be discarded when performing the document preprocessing in the classification, and sections with terms whose importance could be highlighted by increasing their weights. The results obtained show that the addition of document sections in the classification process substantially improves the Kappa results in the vast majority of data sets. 2. The semantic enrichment of the OHSUMED corpus was made using the Semantic Repository (SemRep) that allows extracting semantic predications from biomedical texts. The applied semantic enrichment led to an expected growth in the number of terms, which required the use of feature selection techniques to overcome this issue. The results obtained lead to the conclusion that the application of semantic enrichment techniques to full text documents significantly improves the task of text classification. 3. All of the previous studies has led to the conclusion that using all sections of the documents rather than just the title and abstract is crucial. The study and research carried out throughout this thesis resulted in the elaboration of a new approach to the full text classification process using a multi-view representation (sections). The experimental comparison with the traditional classification indicates that the proposed architecture is better when used for full text classification. This study also analyzed the results obtained by the base classifiers to see which views provided the best results, showing that the Results section has the best Kappa results for all data sets examined, while the Conclusions section has the lowest values in almost all tests. The fundamental hypothesis that guided this thesis was to evaluate whether if the use of ILP could improve the results when applied to the classification process. The use of ILP had some advantages, namely its ability to use its representation language as well as the use of background knowledge. Embedding an Inductive Logic Programming algorithm in the well-known WEKA data mining tool ensures the proposed multi-view representation, as well as the agility to use different types of algorithms more efficiently. Thus, it became possible to apply different algorithms (base-learners) to each section, with the results later being ensembled (by meta-learners). It was possible to use multirelational learning (ILP) together with propositional algorithms as base-learners. This approach has contributed to speed up the process of building solutions of ensemble learning that use different type of algorithms. The analysis related to the problem of the ILP dealing with a high number of attributes led to the development of a new feature selection method, the aforementioned k-BDT. Another aspect analyzed was the problem related to the search space, which has the characteristic of making the execution time of the algorithm quite slow. The development of refinement operators was another contribution made by this thesis, as it contributed to the speeding up of ILP execution. The work presented can be improved with new lines of research, and also to overcome some limitations detected during the thesis implementation. As future work, and to overcome the limitations found, seven base lines of improvement are suggested: 1. To embed the complete framework pipeline within the WEKA platform; 2. To enclose the k-Best-Discriminative-Terms method in the WEKA platform; 3. To make the ILP/Aleph configuration parameters available in the WEKA platform, to turn the tuning of parameters easier to introduce and test; 4. To improve the background knowledge in order to enhance results and keep the contribution to reduce the long execution time; 5. To apply Graph Mining for feature selection enrichment and validate if this approach can improve the ILP results, and be an effective approach for using as a background knowledge. 6. To automate the weighing calculation process for each section, that would undoubtedly enhance the system, allowing a better fit depending on each specific corpus. 7. To automatically deduce which models are the best suited for the multi-view architecture. Each corpus and each section should be treated with the most appropriate algorithm (base-learner), to later select a meta-classifier that outperforms the rest.