Tianyu Gao will present his FPO "Enabling Language Models to Process Information at Scale" on Thursday, November 20, 2025 at 1:30 PM in CS 402.
The members of Tianyu’s committee are as follows:
Examiners: Danqi Chen (Adviser), Sanjeev Arora, Tri Dao
Readers: Karthik Narasimhan, Yoon Kim (MIT)
A copy of his thesis is available upon request. Please email gradinfo@cs.princeton.edu if you would like a copy of the thesis.
Everyone is invited to attend his talk.
Abstract follows below:
Language models (LMs) can effectively internalize knowledge from vast amounts of pretraining data, enabling them to achieve remarkable performance on exam-style benchmarks. Expanding their ability to compile, synthesize, and reason over large volumes of information on the fly will further unlock transformative applications, ranging from AI literature assistants to generative search engines. In this thesis, I will present my research on advancing LMs for processing information at scale. (1) I will first discuss my contributions to creating efficient pre-training and customization methods for LMs, which enable scalable deployment of LM-powered applications across diverse settings. (2) I will then introduce my foundational work on using contrastive learning to produce high-performing text embeddings, which form the cornerstone of effective and scalable search. (3) Finally, I will present my evaluation framework for LM-based information-seeking systems, emphasizing the importance of providing citations for verifying the model-generated answers. Our evaluation highlights shortcomings in LMs’ abilities to reliably process long-form texts (e.g., dozens of webpages), which I address by developing state-of-the-art long-context LMs that outperform leading industry efforts while using a small fraction of the computational budget. I will conclude the thesis by sharing my vision for the next generation of autonomous information processing systems and outline the foundational challenges that must be addressed to realize this vision.