Quick links

Unsupervised Learning of Natural Language Structure

Date and Time
Thursday, March 11, 2004 - 4:30pm to 6:00pm
Computer Science Small Auditorium (Room 105)
Dan Klein, from Stanford University
Moses Charikar
There is precisely one complete language processing system to date: the human brain. Though there is debate on how much built-in bias human learners might have, we definitely acquire language in a primarily unsupervised fashion. On the other hand, computational approaches to language processing are almost exclusively supervised, relying on hand-labeled corpora for training. This reliance is largely due to repeated failures of unsupervised approaches. In particular, the problem of learning syntax (grammar) from completely unannotated text has received a great deal of attention for well over a decade, with little in the way of positive results.

We argue that previous methods for this task have generally failed because of the representations they used. Overly complex models are easily distracted by non-syntactic correlations (such as topical associations), while overly simple models aren't rich enough to capture important first-order properties of language (such as directionality, adjanency, and valence). We describe several syntactic representations which are designed to capture the basic character of natural language syntax as directly as possible.

With these representations, high-quality parses can be learned from surprisingly little text, with no labeled examples and no language-specific biases. Our results are the first to show above-baseline performance in unsupervised parsing, and far exceed the baseline (in multiple languages).These specific grammar learning methods are useful since parsed corpora exist for only a small number of languages. More generally, most high-level NLP tasks, such as machine translation and question-answering, lack richly annotated corpora, making unsupervised methods extremely appealing even for common languages like English.

Follow us: Facebook Twitter Linkedin