Haoyu Zhao will present his FPO "Data-Centric Foundations of Compositional Generalization in Large Language Models" on Friday, May 29, 2026 at 1:00 PM in CS 402.
The members of Haoyu’s committee are as follows:
Examiners: Sanjeev Arora (Adviser) Elad Hazan, Chi Jin,
Readers: Karthik Narasimhan, Danqi Chen
A copy of his thesis is available upon request. Please email gradinfo@cs.princeton.edu if you would like a copy of the thesis.
Everyone is invited to attend the talk.
Abstract follows below:
Large language models (LLMs) exhibit remarkable capabilities in language understanding, reasoning, and code generation, yet they often remain brittle when tasks require systematic composition of learned components. This dissertation argues that such strengths and weaknesses are shaped not only by model architecture or scale, but fundamentally by the structure of the data distributions through which models are trained and evaluated.
We develop a data-centric perspective on compositional generalization through two complementary settings. First, we study how compositional structure can emerge in language models from appropriately structured data. In the setting of masked language modeling, we show that transformers trained on synthetically generated data can approximate classical parsing procedures such as the Inside–Outside algorithm, providing both theoretical and empirical evidence that latent hierarchical structure can arise without explicit supervision. We then extend this perspective to higher-level behavior through controlled skill composition tasks, showing that the ability to combine multiple skills is not automatic, but depends critically on the diversity and structure of compositional examples seen during training.
Second, we study compositional reasoning under formal constraints, where correctness must be exact rather than approximate. We introduce Ineq-Comp, a benchmark for inequality theorem proving that systematically varies compositional structure through simple, human-intuitive transformations. Despite strong performance on standard problems, state-of-the-art provers degrade substantially on these variants, revealing a persistent gap between solving familiar patterns and composing known reasoning steps in formally verified settings. We then present a scalable theorem-proving pipeline that integrates synthetic data generation, scaffolded decomposition, and verifier-guided self-correction, showing that carefully designed data pipelines can substantially improve formal reasoning performance.
Taken together, these results support a central thesis: compositional generalization in large language models is fundamentally shaped by the structure of their effective data distribution. By viewing pretraining corpora, synthetic supervision, benchmarks, and verifier feedback as different ways of organizing that distribution, this dissertation provides both a conceptual framework and practical guidance for designing language models with stronger compositional reasoning abilities.
05-28
Haoyu Zhao FPO
Date and Time
Thursday May 28, 2026 1:00pm -
3:00pm
Location
Computer Science 402
Event Type
Contributions to and/or sponsorship of any event does not constitute departmental or institutional endorsement of the specific program, speakers or views presented.