Development of efficient De Bruijn graph-based algorithms for genome assembly

  1. Freire Castro, Borja
Supervised by:
  1. José Ramón Paramá Gabia Co-director
  2. Leena Salmela Co-director

Defence university: Universidade da Coruña

Fecha de defensa: 10 January 2023

Committee:
  1. Christian Boucher Chair
  2. Antonio Fariña Secretary
  3. Anália Lourenço Committee member

Type: Thesis

Teseo: 780733 DIALNET lock_openRUC editor

Abstract

During the last two decades, thanks to the development of new sequencing techniques, the study of the genome has become very popular in order to discover the genetic variation present in both humans and other organisms. The predominant mode of genome analysis is through the assembly of reads in one or multiple chains for as long as possible. The most traditional way of assembly is the one that involves reads from a single genome. In this field, in the last decade, third-generation readings have emerged with new challenges for which there are no efficient solutions. The first contribution that has been made in this thesis is Compact-Flye, a tool for the efficient assembly of third-generation reads on the Flye algorithm. This tool is based on the ingenious use of compact data structures to improve typical assembly steps such as counting and indexing k-mers. Apart from the assembly of a genome, there are techniques that seek to assemble all the genomes contained in a given sample. This assembly is known as multiple sequence assembly or haplotype reconstruction, a subject also treated in this thesis. Our first approach to solving this has been viaDBG, which is the first solution based on de Bruijn graphs that offers results comparable to current techniques in viral genome assembly while maintaining the efficiency of these graphs. Our second contribution is ViQUF, which is a natural improvement on its predecessor. ViQUF completely changes the algorithm of viaDBG but continues to be based on the same structures, although with some variations that allow it not only to improve results in terms of time and quality, but also to provide additionalinformation such as an estimate of the relative presence of each species in the sample.