The Theory and Practice of Data Description (thesis)
Massive amounts of useful data are stored and processed in ad hoc formats for which critical tools like parsers and formatters do not exist. Ad hoc data formats are often poorly documented, and the data itself can be very large scale with a significant number of errors, like missing or malformed data and out-of-range values. Traditional databases and XML systems provide rich infrastructure for processing well-behaved data, but are of little help when dealing with data in ad hoc formats.
In this thesis, we discuss our attempts to address the challenges of ad hoc data. We explain the design and implementation of PADS/ML, a new language and system that facilitates generation of data processing tools for ad hoc formats. The PADS/ML design includes features such as dependent, polymorphic and recursive datatypes, which allow programmers to describe the syntax and semantics of ad hoc data in a concise, easy-to-read notation. The PADS/ML implementation compiles these descriptions into ML structures and functors that include types for parsed data, functions for parsing and printing, and auxiliary support for user-specified, format-dependent and format-independent tool generation.
In addition, we present a general theory of data description languages like PADS/C, PADS/ML, DataScript, and PacketTypes. In the spirit of Landin, we present a calculus of dependent types to serve as the semantic foundation for this family of languages. In the calculus, each type describes the physical layout and semantic properties of a data source. In the semantics, we interpret types simultaneously as the in-memory representation of the data described and as parsers for the data source. The parsing functions are robust, automatically detecting and recording errors in the data stream without halting parsing. We show the parsers are type-correct, returning data whose type matches the simple-type interpretation of the specification. We also prove the parsers are ``error-correct,'' accurately reporting the number of physical and semantic errors that occur in the returned data. We use the calculus to describe the features of various data description languages. Finally, we discuss how the semantics has impacted the PADS/C and PADS/ML implementations.