ASTER - Algorithms and software for third generation RNA sequencing

Porteur du projet  : Hélène Touzet Equipe Bonsaï

Recherche fondamentale
Projet de recherche collaborative

We propose to develop algorithms and software for analyzing third generation sequencing data. Third generation is an emerging technology that promises to give a better picture for studying genomes, transcriptomes, metagenomes and metatranscriptomes of all living organisms. It will be key for discovering new fundamental mechanisms in cell biology, with broad implications in environmental research, health and agriculture. Compared to second generation sequencing, third generation sequencing is able to produce fragments that cover significantly
larger regions of the molecule, up to several thousands of bases. This important
feature allows to overcome the main limitations of second generation sequencers and offersa real potential of disruption. Remarkably, this transition does not significantly affect the difficulty and costs at which sequence data can be obtained. One can even expect that third generation will further promote the easy access to sequencing technologies with the advent of low-cost and highly portable instruments, such as the MinION commercialized by Oxford Nanopore Technologies. In this project, we focus on transcriptome sequencing by nanoporetechnology. Transcriptome is the sequencing of expressed RNA in a population of cells. It is of great interest to understand what fraction of the genome is expressed and to characterize it, and serves as a basis for multiple downstream analyses, including gene prediction, gene expression regulation, variant calling, species identification. However, analyzing this data is computationally challenging due to a very high rate of sequencing errors on the one hand and the intrinsic complexity of transcriptomes on the other hand. So there is a pressing need
for models and algorithms that can accommodate this new kind of data and that are also scalable. In this perspective, we will develop innovative computational analysis methods for transcriptomes (RNA from a single organism), 16S ribosomal RNA and metatranscriptomes (RNA sampled from a community). For that, we will consider several settings, depending on whether a reference genome and/or supporting second generation data are available. This will give raise to a number of specialized algorithms in several primary analysis steps that complement one another : alignment, error correction, identification of gene structures, identification of variants. To achieve these goals, we will make use of state-of-the-art techniques in text algorithms and invent new ones : new models for seeds, alignment-free heuristics, compression, graph structures, text indexes. The project unites two expert groups in bioinformatics algorithms (Bonsai, CRIStAL in Lille and Erable, LBBE in Lyon), and two sequencing and analysis platforms that have been very active in the MinION Access Program (Genoscope and Institut Pasteur de Lille). Bonsai and Erable both have a long-standing experience in the design of algorithms and software high-throughput sequencing data analysis (Kissplice, CRAC, and sortmeRNA). Genoscope and Institut Pasteur de Lille will allow all partners of the project to have early access to the latest data with the MinION and the upcoming Promethion, as well as an expert view on these data. For example, Genoscope has recently developed NaS, a comprehensive
bioinformatics pipeline for error correction of nanopore data. All algorithms proposed
within the project will be made available to a broader community through the development of open-source user-friendly bioinformatics software, that will benefit from a fast dissemination through the national network France Genomique and high-level publications. In conjunction, the underlying components will be added to the GATB library, which will further increase the audience of this work. The generated sequencing data will also be made publicly availableand deposited in open archives, in order to serve as benchmarks for other research groups.2.

Mots clefs / Keywords :
algorithmique, séquençage, bioinformatique