Language Support for Processing Ad Hoc Data
The term ad hoc data refers to the billions of bytes of non-standard and continuously evolving data spread across all computer systems.
Such data includes server logs, distributed system performance and debugging data, telephone call records, financial data and online repositories of scientific data.
This paper presents PADS/D, a system that generates monitoring, analysis and transformation tools for distributed ad hoc data from declarative specifications. The generated tools include an archiver, a database loading system, a statistical analyzer, an alert system, an RSS feed generator, and debugging tools. In addition, the system generates libraries for application developers, including modules for parsing, printing, error management, data traversal and transformation which developers can use to create their own application-specific tools. Advanced users can build new generic tools applicable to any collection of data sources.
The PADS/D data description language allows data analysts to specify where their ad hoc data is located, how to obtain it, when to get it (or give up trying), and what preprocessing the system should do when it arrives. As its name suggests, PADS/D is layered on top of the PADS sublanguage, developed in previous research efforts, for specifying the format of the data sources. We illustrate the expressiveness of PADS/D by giving descriptions for several different distributed systems including CoMon, the monitoring system for PlanetLab, and a monitoring system for a web hosting service provided by AT&T.We define a formal semantics for the language, describe our implementation, and evaluate its performance. We show our system is capable of scaling to distributed systems the size of CoMon, the current monitor for Planetlab's 800+ nodes.