03-11
Unsupervised Learning of Natural Language Structure

There is precisely one complete language processing system to date: the human brain. Though there is debate on how much built-in bias human learners might have, we definitely acquire language in a primarily unsupervised fashion. On the other hand, computational approaches to language processing are almost exclusively supervised, relying on hand-labeled corpora for training. This reliance is largely due to repeated failures of unsupervised approaches. In particular, the problem of learning syntax (grammar) from completely unannotated text has received a great deal of attention for well over a decade, with little in the way of positive results.

We argue that previous methods for this task have generally failed because of the representations they used. Overly complex models are easily distracted by non-syntactic correlations (such as topical associations), while overly simple models aren't rich enough to capture important first-order properties of language (such as directionality, adjanency, and valence). We describe several syntactic representations which are designed to capture the basic character of natural language syntax as directly as possible.

With these representations, high-quality parses can be learned from surprisingly little text, with no labeled examples and no language-specific biases. Our results are the first to show above-baseline performance in unsupervised parsing, and far exceed the baseline (in multiple languages).These specific grammar learning methods are useful since parsed corpora exist for only a small number of languages. More generally, most high-level NLP tasks, such as machine translation and question-answering, lack richly annotated corpora, making unsupervised methods extremely appealing even for common languages like English.

Date and Time

Thursday March 11, 2004 4:30pm - 6:00pm

Location

Computer Science Small Auditorium (Room 105)

Event Type

CS Colloquium Series

Speaker

Dan Klein, from Stanford University

Host

Moses Charikar

Contributions to and/or sponsorship of any event does not constitute departmental or institutional endorsement of the specific program, speakers or views presented.

CS Talks Mailing List

03-11 Unsupervised Learning of Natural Language Structure

03-11
Unsupervised Learning of Natural Language Structure