
Philip Shilane & Victor Shnayder
Summary:
Our goal for a memex browser is to archive all of the articles a user views with a web browser and enable efficient searching and retrieval of those articles. Users will be able to quickly find articles they previously read through queries of their personal archive.
The Limits of Memory
As more content has moved to the web, browsing documents through a web browser has become the dominant mechanism for users to view news and scientific documents. With the proliferation of readily accessible information, users can read dozens of articles a day on a wide range of topics. While the amount of content a user reads has likely increased because information is so readily available, a human’s ability to remember that information is unchanged.

(Price per storage capacity is decreasing at an exponential rate.)
How often have we been in the situation where we want to find a useful article we read weeks earlier, but we cannot remember the title or authors? The article might have been a useful reference for a research paper, the itinerary for an upcoming trip, or an interesting movie review we would like to forward to friends. In the best case, the article is still available on the web and a search engine will be able to help us find the article again. In the worst case, the article has been removed from the web and there is no way to retrieve information that we previously read. The most common scenario is that we are able to find the article again, though it may take an extended effort trying the suggestions of a search engine or attempting to retrace the route to the document through related sites.
Even the best web search engines are of limited use for retrieving a previously read article because they were designed to solve a different problem. General web search engines attempt to use a small number of keywords provided by the user to retrieve a few documents from the billions of pages on the web. The vast majority of those billions of pages was never viewed by the user and should be eliminated from consideration. A general search engine is also unable to take advantage of information specific to the user, such as the dates the user read the article.
Features of a Memex Browser
Prototype
We have built prototype applications that implement some of the above features. The first component is a local Internet proxy that records the http requests into a file. A script periodically checks the file for newly viewed pages and uses the wget program to download the page and its referenced images. The page is then rewritten to reference the local copies of the images. Standard translators were used to convert PDF and postscript files into text that was added to the index.
We used MySQL to store the web page information including date, URL, text, page type, as well as meta information added by the user including notes and links to other interesting sites.
Search Features
My prototype, shown below, allows the user to specify a variety of search criteria. The user can enter a set of search terms that must appear in the web pages. Page types can be specified including html, pdf, postscript, and Word documents. The base URL can also be specified, either by selecting from the list of most frequent URL’s, or by entering the URL separately. The date when the page was first viewed can also be selected, either entering the number of months or days in the past to search, or by entering a range of dates within which to search.

(User
is searching for html pages containing the term “computer” within www.google.com between

(Searching for all html pages with the term Linux within the site www.linux.org viewed within the last five months.)
Adding comments
Beyond storing documents viewed by a user, Vannevar Bush’s original description of the memex would also store meta-data about the pages entered by a user. This meta-data could link together a series of documents that are interrelated or that shows the user’s progression of thought. I implemented features to store comments and links with a document. When a user views a document in the memex browser, a window appears that allows the user to enter notes and links to other documents, which is then stored in the database. When a user views that document in the future, the meta-data previously entered will appear again as a reminder of his/her previous work.

(A window for storing meta-data about the document appears, allowing the user to store notes and links to other documents.)

(The
user is viewing a document about the situation with
Memex Statistics
From November 2002 through January 2003, Victor stored the pages he viewed through his web browser in the memex database. The database included 5,784 documents totaling 720 Mb of storage space. Continuing at this pace, the memex database would index approximately 23,000 documents and 2.8 Gb of data each year.
References