Thesis of Tatiana Rocher

Compressing and Indexing labeled sequences

This thesis in text algorithmics studies the compression, indexation and querying on a labeled text. A labeled text is a text to which we add information. As an example, in a V(D)J recombination, a marker for lymphocytes, the text is a DNA sequence and the labels are the genes' names. A person's immune system can be represented with a set of V(D)J recombinations. With high-throughput sequencing, we have access to millions of V(D)J recombinations which are stored and need to be recovered and compared quickly. The first contribution of this thesis is a compression method for a labeled text which uses the concept of storage by references. The text is divided into sections which point to pre-established labeled sequences. The second contribution offers two indexes for a labeled text. Both use a Burrows-Wheeler transform to index the text and a Wavelet Tree to index the labels. These indexes allow efficient queries on text, labels or both. We would like to use one of these indexes on V(D)J recombinations which are obtained with hematology services from the diagnostic or follow-up of patients suffering from leukemia.

Jury

Directeurs de thèse : Mathieu Giraud, Mikaël Salson Rapporteurs : Guillaume Blin, Lynda Tamine-Techani Examinateurs : Arnaud Lefebvre, Laëtitia Jourdan

Thesis of the team Bonsai defended on 12/02/2018