< Back to previous page

Publication

Foundations of XML: Regular Expressions Revisited

Book - Dissertation

The regular languages constitute a highly robust class of languages. They can be represented by many different formalisms such as finite automata [92], regular expressions [67], finite semigroups [97], monadic second-order logic [18], right-linear grammars [21], quantifier-free first-order updates [39], ... Regular languages are closed under a wide range of operations and, even more importantly, almost any interesting problem concerning regular languages is decidable. All kinds of aspects of the regular languages have been studied over the past 50 years. From a practical perspective, the most widespread way to specify regular languages is by using regular expressions (REs). They are used in applications in many different areas of computer science, including bioinformatics [84], programming languages [108], model checking [105], and XML schema language [100]. XML is the lingua franca and the de facto standard for data exchange on the Web. When two parties exchange data in XML, the XML documents usually adhere to a certain format. Thereto, XML schema languages are used to describe the structure an XML document can have. The most common XML schema languages are DTD, XML Schema [100], both W3C standards, and Relax NG [22]. From a formal language theory point of view each of these is a grammar-based formalism with regular expressions at their right-hand sides. These expression, however, differ from the standard regular expressions in that they are extended by additional operators but are also restricted by the requirement to be deterministic. Although these requirements are recorded in W3C and ISO standards, it is not clear what their impact on the different schema languages is. The goal of this thesis therefore is a study of these consequences; in particular, we study the complexity of optimization in the presence of additional operators, illustrate the difficulties of migrating from one schema language to another, and study the implications of the determinism constraint (also in the presence of additional operators) and try to make it accessible in practice. Although the questions we ask are mainly inspired by questions about XML, we believe that the answers are also interesting for the general theoretical computer science community as they answer fundamental questions concerning regular expressions. Therefore, this thesis consists of two parts. In the first part, we study fundamental aspects of regular expression. In the second, we apply the former results to applications concerning XML schema languages. Although most of the work is of a foundational nature, we also developed software to bring these theoretical insights to practice.
Number of pages: 2006
Publication year:2009
Accessibility:Open