Camille Marchet's HDR defence

on September 4, 2025

HDR defence of Camille MARCHET, Thursday 04 September 2025 from 13:30 in the Atrium, ESPRIT building, CRIStAL laboratory.

Title of the habilitation: A feeling for the index: k-mer data-structures for data reuse in large scale genomics and transcriptomics

Composition of the jury:

  • Reporter: Sarah Djebali, Research Associate, IRSD, INSERM U1220, Toulouse

  • Reporter: Elodie Laine, Professor, CQSB CNRS UMR 7238; INSERM 1284, IBPS, Sorbonne University, Paris

  • Reporter: Sven Rahmann, Professor, Chair of Algorithmic Bioinformatics at Saarland University

  • Examiner: Christina Boucher, Professor, University of Florida

  • Examiner: Guy Perrière, Director of Research, LBBE CNRS UMR 5558, University of Claude Bernard Lyon 1

  • Guarantor: Rémi Bardenet, Director of Research, CRIStAL UMR 9189 CNRS, University of Lille

Abstract:

This habilitation thesis explores algorithmic strategies for indexing large-scale biological sequence datasets comprising billions of objects and terabytes to petabytes of raw data. The work focuses on DNA and RNA as textual inputs to data, and draws on several years of personal research centered at the CRIStAL laboratory, on the challenges of designing structures that organize k-mer sets, sets of short, fixed-length substrings of sequences. By drawing these k-mers to every possible position, DNA and RNA sequences are tokenized into sets that conserve relevant biological information, enabling scalable and efficient analysis.

As sequencing technologies produce exponentially growing volumes of RNA and DNA data, the need for efficient, scalable, and interpretable data structures becomes central to enabling meaningful analysis. This thesis presents a structured overview of existing k-mer representation families, from De Bruijn graphs to Burrows-Wheeler-transform-inspired methods, emphasizing their computational properties and trade-offs. It introduces several original contributions, including a static, fast, and memory-efficient dictionary, as well as a dynamic structure that leverages textual regularities to support optimized set operations.

Additionally, I detail methods for handling multi-sample k-mer sets (sets of k-mer sets), leading to in REINDEER, a tool specifically optimized for RNA abundance indexing across thousands of datasets. Practical applications in clinical research contexts, such as leukemia studies, illustrate the real-world impact of these innovations.

The discussion concludes with challenges related to integrating such structures into existing and future international genomic repositories. I advocate for a broader perspective on data structure research, designing tools that remain accessible to a wide user community through smart queries, which in turn push the boundaries of current data structure design.

Salle Atrium bâtiment ESPRIT Laboratoire CRIStAL Villeneuve d'Ascq

More news