Thesis of Adrien Boiret

Normalization and Learning of Transducers on Trees and Words

Since the arrival of the Web, various kinds of semi-structured data formats were introduced in the areas of computer science and technology relevant for the Web, such as document processing, database management, knowledge representation, and information exchange. In this thesis, we study the conversion of semi-structured data from one schema to another. For document processing, the most powerful solutions to this problem were proposed by the XML technology. In the XML format, semi-structured data is restricted to data trees, so that schemas can be defined by tree automata, possibly enhanced by constraints on data values. Document transformations can be defined in XSLT, a purely functional programming language with logical XPath queries. The core of XSLT are macros tree transducers with navigation by XPath queries. We contribute new learning algorithms on tree transducers, that are based on methods from grammatical inference. We address three limitiations of previous approaches: schema restrictions, lookaheads, and concatenation in the output. 1. For deterministic top-down tree transducers with regular domain inspection, we show a normal form and a Gold style learning algorithm in with limited resources. 2. We show how to learn rational functions, described by deterministic transducers of words with lookahead. We propose a new normal form for such transducers which provides a compromise between lookahead and state minimization, that leads to a learning algorithm in Gold’s learning model with polynomial resources. 3. For linear tree-to-word transducers with concatenation in the output, we present a normal form and show how to decide equivalence in polynomial time.

Jury

Directeurs de thèse : Joachim Niehren, Aurélien Lemay (co-encadrant) Rapporteurs : Olivier Carton, Helmut Seidl Examinateurs : Sebastian Maneth, Sophie Tison

Thesis of the team LINKS defended on 07/11/2016