Robotized Meta Search Engine -- Final Project for COS 598b, Information Access: Issues for Web and Digital Libraries There are numerous search engines for the web and a number of meta search engines which try to combine the advantages of multiple search engines. The key point for a meta engine is to know which engines are the best for any given query. Existing meta engines collect such information by remembering and comparing the results from various engines for the queries clients submitted. This project presents a more aggressive approach to collect such information, which is inspired by the use of spiders(robots) for regular search engines. Given a document fetched from a web server, the robotized search engine will recursively follow the links in the document and fetch more documents so that it can collect information about the relevance of documents to keywords. My robotized meta search engine collects information about the relevance of search engines to keywords in a similar way, i.e. given the documents returned by various search engines for a client query, my meta engine recursively finds more keywords in the documents and sends the search engines new queries with these keywords. The hit count returned by a search engine for a given keyword will be stored in local database as the relevance value of the engine to the keyword. When the meta engine receives a query from a client, it searches the local database for the keywords in the query, selects a certain number of engines with the highest relevance values to the keywords, sends the query to these selected engines, sorts the results from the engines and returns them to the client. The main design issues in this project are 1) how to expand the information database from the given one and 2) how to represent the relevance of an engine to a keyword. To expand the information database from the given one, which consists of the results from the engines for clients' queries, a natural and straightforward approach is to recursively send more queries to the engines based on the previous results. We are more interested in the keywords an engine is relevant to than those that an engine is not so relevant to, therefore we want to send new queries with the keywords that will likely get high hit counts from the engine. I choose the words most frequently appearing in the returned documents but not in the query for this purpose because engines, especially spider-based engines, have semantical locality. The hit count of an engine for a given keyword represents the amount of information on this keyword in the engine's local database, therefore, is a good candidate for the relevance value. However, different engines have different meanings for the numbers they return in a query result. For example, altavisa says "14326 documents match your query...", yahoo says "Found 5 categories and 82 sites for ...", excite says "598387 hits...", etc. We need engine-specific adjustments for those numbers so that they can be fairly compared to each other. This has not been implemented yet. There are some other issues common to meta engines or spiders, such as translating operators, filtering and merging results, load control on web servers(or search engines for my spider), etc. Unfortunately I do not have time to address all of them in this project. The meta engine and spider are implemented in Perl 5.004 with database driver support. The meta engine is a CGI program and the spider is a daemon process which spawns child processes upon requests from the meta engine. Both talk to search engines via lynx and access database via a mysql daemon. They talk to each other via internet sockets. The main implementation issue is to extract information out of the query results from various engines, such as the hit counts and the links to the documents containing the keywords vs. the links to their advertising clients. I define search patterns in Perl regular expression format for each supported engine and keep redundant code minimum. Another issue is to exclude words like "the", "yours", etc. from keywords. I didn't find a source to get the whole set of such words from, so I manually typed in all such words I could think of and excluded words shorted than four letters. Future work on this project would be recognization of "next 10 pages" pointers in returned results, operator translations, invitation to real-world use of the meta engine, measurement of improvements in meta search results, etc. This project is a good lesson for me to write applications that talk to commercial search engines and explore various aspects of the web. My belief is that a computer scientist would fail when he/she tries to replace human intelligence with computers. Many web issues are on the edge of such risk and valuable research and it's non-trivial to determine where to advance and where to stop. Minwen Ji Department of Computer Science Princeton, NJ 08544 mji@cs.princeton.edu http://www.cs.princeton.edu/~mji