CRIStAL - Centre de Recherche en Informatique et Automatique de Lille

Camille Marchet's HDR defence

from September 4 to 4, 2025

HDR defence of Camille MARCHET, Thursday 04 September 2025 from 13:30 in the Atrium, ESPRIT building, CRIStAL laboratory.

Title of the habilitation: A feeling for the index: k-mer data-structures for data reuse in large scale genomics and transcriptomics

Composition of the jury:

Reporter: Sarah Djebali, Research Associate, IRSD, INSERM U1220, Toulouse
Reporter: Elodie Laine, Professor, CQSB CNRS UMR 7238; INSERM 1284, IBPS, Sorbonne University, Paris
Reporter: Sven Rahmann, Professor, Chair of Algorithmic Bioinformatics at Saarland University
Examiner: Christina Boucher, Professor, University of Florida
Examiner: Guy Perrière, Director of Research, LBBE CNRS UMR 5558, University of Claude Bernard Lyon 1
Guarantor: Rémi Bardenet, Director of Research, CRIStAL UMR 9189 CNRS, University of Lille

Abstract:

This habilitation thesis explores algorithmic strategies for indexing large-scale biological sequence datasets comprising billions of objects and terabytes to petabytes of raw data. The work focuses on DNA and RNA as textual inputs to data, and draws on several years of personal research centered at the CRIStAL laboratory, on the challenges of designing structures that organize k-mer sets, sets of short, fixed-length substrings of sequences. By drawing these k-mers to every possible position, DNA and RNA sequences are tokenized into sets that conserve relevant biological information, enabling scalable and efficient analysis.

As sequencing technologies produce exponentially growing volumes of RNA and DNA data, the need for efficient, scalable, and interpretable data structures becomes central to enabling meaningful analysis. This thesis presents a structured overview of existing k-mer representation families, from De Bruijn graphs to Burrows-Wheeler-transform-inspired methods, emphasizing their computational properties and trade-offs. It introduces several original contributions, including a static, fast, and memory-efficient dictionary, as well as a dynamic structure that leverages textual regularities to support optimized set operations.

Additionally, I detail methods for handling multi-sample k-mer sets (sets of k-mer sets), leading to in REINDEER, a tool specifically optimized for RNA abundance indexing across thousands of datasets. Practical applications in clinical research contexts, such as leukemia studies, illustrate the real-world impact of these innovations.

The discussion concludes with challenges related to integrating such structures into existing and future international genomic repositories. I advocate for a broader perspective on data structure research, designing tools that remain accessible to a wide user community through smart queries, which in turn push the boundaries of current data structure design.

Salle Atrium bâtiment ESPRIT Laboratoire CRIStAL Villeneuve d'Ascq

Voir l'agenda complet »

CASC 2026
from August 31 to September 4, 2026
Projet COMET
October 1, 2026 at 9 AM
EWRL 2026
from October 5 to 7, 2026
IHM 2026
from October 19 to 23, 2026
SOFA WEEK 2026
from November 23 to 27, 2026
Workshop
Klaus Dolag et Jenny Sorce from March 8 to 12, 2027

Igor Martayan (thesis)
September 4, 2026 at 2 PM

Camille Marchet's HDR defence

Agenda scientifique

Defenses

AGENDA

UTILITIES

Recruitment