10-14
The Online Index-Merging Problem

We discuss the online index-merging problem: documents arrive at a system continuously, and must be immediately indexed for fulltext search. Meanwhile, the system is continuously answering queries on its fulltext index. The system is I/O-limited, so a single online indexing data structure (e.g. Btree) cannot be used: this approach would require a random I/O for every word in every document that arrives. Instead, multiple off-line index structures are used. The off-line indexes are precomputed in memory and written out using only sequential I/O; at any time two or more of the indexes can be merged into a single new index using only sequential I/O. Every query must consult every off-line index, so the I/O cost of a query is proportional to the number of indexes. Thus, there is a tension between query cost and merging cost: by spending I/O on merging indexes, we can reduce the I/O required for future queries. The online index-merging problem is to find a merging schedule that minimizes the total I/O cost.

The problem is related to several well-studied problems in network design, but possesses some unique characteristics. For a restricted class of inputs, we describe an index-merging schedule that is optimal up to a small constant factor. For general inputs, we propose (by analogy with the ski rental problem) an algorithm with apparently good empirical properties, but it has so far resisted our attempts to prove nontrivial performance bounds.

Joint work with Frank McSherry.

Date and Time

Friday October 14, 2005 4:00pm - 5:30pm

Location

Computer Science Small Auditorium (Room 105)

Event Type

CS Colloquium Series

Speaker

John MacCormick, from MSR

Host

Kai Li

Contributions to and/or sponsorship of any event does not constitute departmental or institutional endorsement of the specific program, speakers or views presented.

CS Talks Mailing List

10-14 The Online Index-Merging Problem

10-14
The Online Index-Merging Problem