04-28
Alexander Wettig FPO

Alexander Wettig will present his FPO "Building Better Language Models with Data Curation" on Tuesday, April 28, 2026 at 2:00 PM in FC 110.

Zoom Link: https://princeton.zoom.us/j/98043325388?pwd=IIFV4labMYR0YyBXNC5ocNaaZ2i8aQ.1&jst=2

The members of Alex’s committee are as follows:
Examiners: Danqi Chen (Adviser), Sanjeev Arora, Karthik Narasimhan
Readers: Peter Henderson, Ludwig Schmidt (Stanford)

A copy of his thesis is available upon request.  Please email gradinfo@cs.princeton.edu if you would like a copy of the thesis.

Everyone is invited to attend his talk.

Abstract follows below:
Training data is among the most consequential choices when developing a language model, and the key ingredients of successful data curation have become closely guarded secrets of industrial research labs. This thesis works toward making data curation an open and reproducible science, developing tools and methods for shaping the training data distribution of language models and studying how quality filtering, domain mixing, and data composition affect downstream capabilities. Our methodology is grounded in data ablations: controlled experiments that vary the training data while holding all other variables fixed.

The major part of this thesis focuses on data curation for pre-training, where we introduce three complementary techniques. (1) QuRating selects documents by quantifying human notions of text qualities, such as writing style, educational value, and required expertise. With the right balance of quality and diversity, we curate training datasets that produce strong language models that perform on par with models trained with 50% more compute. (2) WebOrganizer operates at the corpus level, introducing topic and format taxonomies that classify web pages into meaningful categories to enable explicit optimization of the data mixture. This allows us to achieve well-calibrated performance across tasks, and the best results are achieved when domain mixing is combined with quality selection. (3) Finally, we consider the relationship of quality filtering and deduplication as part of the Olmo-3 project and develop quality-aware upsampling strategies used in curating trillions of tokens of high-quality pre-training data.

In the second part, we extend our methodology to the setting of continued pre-training for long-context understanding. Through systematic ablations guided by reliable downstream evaluation, we identify good sources of long-context data and uncover details such as the importance of mixing both long and short documents. Our final recipe produces the ProLong model, which achieves state-of-the-art long-context performance at a fraction of the data budget of comparable efforts.

We conclude by discussing how data curation will co-evolve with future progress in AI.

Date and Time
Tuesday April 28, 2026 2:00pm - 4:00pm
Location
Friend Center 110
Event Type

Contributions to and/or sponsorship of any event does not constitute departmental or institutional endorsement of the specific program, speakers or views presented.

CS Talks Mailing List