Minería de datos en la misión Gaia: visualización del catálogo, optimización del procesado y parametrización de estrellas

Álvarez González, Marco Antonio

Minería de datos en la misión Gaiavisualización del catálogo, optimización del procesado y parametrización de estrellas

Álvarez González, Marco Antonio

Supervised by:

Carlos Dafonte Director
Minia Manteiga Co-director

Defence university: Universidade da Coruña

Fecha de defensa: 16 September 2019

Committee:

Juan R. Rabuñal Chair
Ana María Ulla Miguel Secretary
Enrique Solano Márquez Committee member

Type: Thesis

Teseo: 599153 DIALNET RUC editor

Abstract

This Thesis has been developed in the context of the Gaia mission, the cornerstone of the European Space Agency (ESA), which is conducting a survey of a billion stars in the Milky Way to generate the largest known star catalog up to date. Such a catalog becomes a great challenge to the scientific community in computational astrophysics. It lS estimated that the total data archive will surpass 1 Petabyte and, in order to analyze such a huge amount of data, the Data Processing and Analysis Consortium (DPAC) has been organized, formed by more than four hundred scientists and engineers. The members of the research group in which I developed this Thesis, is part of DPAC. Our work is mainly based on the application of Artificial Intelligence techniques on the data gathered by Gaia. We also develop tools for the scientific community in order to perform their own analysis using these techniques. The main goals of this Thesis are the following: • Estímate, by means of supervised learning techniques, the main astrophysical parameters of the stars observed by the RVS instrument of Gaia with enough signal to noise ratio: effective temperature1 logarithm of surface gravity, iron abundances relative to hydrogen or metallicity, and abundances of ex - elements relative to iron. We will demonstrate the effectiveness of this technique applied to the Gaia data. • Provide the scientific community with a useful tool for analyzing homogeneous datasets by applying an unsupervised learning technique. Due to the enormous amounts of data that this tool must handle, the optimization of the algorithm used ls an essential factor. This work will detail the techniques used that allow this tool to process millions of data, minimizing the time consumption. • Develop a tool that facilitates the analysis of the results obtained by the classification technique on millions of stellar objects. In that way this tool should be able to present the results through different visualizations, allowing to explore their characteristics. An optimized data treatment is indispensable because this tool is developed in a Big Data environment. It will be verified how this tool is very useful to analyze data and we also detail the strategies used to visualize sets of millions of astronomical objects in an agile and fluid way. In all cases, the large amount of data to be processed make the application of distributed processing techniques mandatory in order to avoid excessive resource consumption: execution time and memory usage, which may prevent a satisfactory execution of the proposed methods. Processing all this information in the framework of the Gaia project requires an important computing capacity, so we develop different optimizations using distributed computing techniques, such as Apache Spark, and through graphic processing methods, such as CUDA. Another important aspect is that the resulting software must be integrated into the existing execution chains in DPAC and deployed in the associated data processing center (DPC), which requires a process to adapt ,the original software for the destination platform. Finally, we will demonstrate the usefulness of the unsupervised learning technique in other disciplines. It will be seen how this technique can improve the intrusion detection in network communications traffic ar in the generatian of user profiles to imprave social netwark marketing.