Query-Independent Ranking for Large-Scale Persistent Search Systems
Existing search services rely heavily on citation-based authority (e.g.
PageRank) to assess the quality of publications. The quality and
relevance of results is particularly important in persistent search, but
the current rank computations are strongly biased against new pages. We
propose SiteRank, a new ranking mechanism that handles new publications
well and also dramatically reduces the computation costs.
This performance improvement is especially valuable when authority is
computed in a persistent search service. Current systems, whether
small-scale notifiers (e.g. CNN Alerts) or persistent queries on
traditional search engines (e.g. Google Alerts), suffer from limited
coverage and/or low refresh rates. We propose Distributed Persistent
Search (DPS), a new architecture based on a publish-subscribe framework
that achieves linear improvement in publication processing and
notification routing, as a function of the number of servers used.
In order to fully utilize the distributed architecture of DPS and
eliminate the single point of failure that is the rank server, we also
propose Distributed SiteRank, a fully distributed citation-based rank
computation which scales well with the number of documents and can be
used in both traditional and persistent search systems.