We discuss the online index-merging problem: documents arrive at a system continuously, and must be immediately indexed for fulltext search. Meanwhile, the system is continuously answering queries on its fulltext index. The system is I/O-limited, so a single online indexing data structure (e.g. Btree) cannot be used: this approach would require a random I/O for every word in every document that arrives. Instead, multiple off-line index structures are used. The off-line indexes are precomputed in memory and written out using only sequential I/O; at any time two or more of the indexes can be merged into a single new index using only sequential I/O. Every query must consult every off-line index, so the I/O cost of a query is proportional to the number of indexes. Thus, there is a tension between query cost and merging cost: by spending I/O on merging indexes, we can reduce the I/O required for future queries. The online index-merging problem is to find a merging schedule that minimizes the total I/O cost.
The problem is related to several well-studied problems in network design, but possesses some unique characteristics. For a restricted class of inputs, we describe an index-merging schedule that is optimal up to a small constant factor. For general inputs, we propose (by analogy with the ski rental problem) an algorithm with apparently good empirical properties, but it has so far resisted our attempts to prove nontrivial performance bounds.
Joint work with Frank McSherry.