Distributed Queries and Incremental Updates in Information Retrieval Systems (Thesis)
Abstract:
The proliferation of the world's ``information highways'' has renewed
interest in efficient document indexing techniques. This thesis
explores the architecture of information retrieval systems for
querying and indexing documents. Distributed queries are studied with
analytical and trace-driven simulations. We focus on physical index
design, inverted index caching, and database scaling in a distributed
system. All three issues influence response time and throughput.
Incremental updates of inverted lists are studied using a new
dual-structure index data structure. This index structure separates
long and short inverted lists dynamically and optimizes the retrieval,
update, and storage of each type of list. To study the behavior of
the index, engineering trade-offs are described that favor either
update time or query performance. We explore these trade-offs
quantitatively by using actual data and hardware and simulation to
determine the best algorithm under a variety of criteria. Finally,
implementation of our incremental update algorithms is compared to an
existing information retrieval system.