Information Extraction from User Workstations
Date and Time
Thursday, March 16, 2006 - 4:00pm to 5:30pm
Computer Science Large Auditorium (Room 104)
Tom Mitchell, from Carnegie Mellon University
Automatically extracting structured facts from unstructured text is a key step toward natural language understanding. Many researchers study this problem, typically in the context of text collections such as newsfeeds or the web. This talk will explore information extraction from user workstations. While many of the subproblems are the same as for extraction from other corpora, there are characteristics of workstations that suggest very different approaches from "traditional" information extraction. For example, suppose the facts we wish to extract from the workstation consist of assertions about the key activites of the workstation user (e.g., which courses they are taking, which committees they serve on), and relations among the people, meetings, topics, emails, files, etc. associated with each such activity. Interestingly, workstations contain a great deal of redundant clues regarding these facts (e.g., evidence that Bob and Sue are both involved in the hiring committee exists in email, the calendar, individual files, ...). This redundancy suggests considering information extraction as a problem of integrating diverse clues from multiple sources rather than a problem of examining a single sentence in great detail. This talk will explore this formulation of the information extraction problem, and present our recent work on automatically extracting facts using workstation-wide information obtained by calling Google desktop search as a subroutine.