11-11
Dan Friedman FPO

Dan Friedman will present his FPO "Algorithmic Interpretability for Language Models" on Tuesday, November 11, 2025 at 1:00 PM in AI Lab Room 274 and Zoom.

Location: Zoom link: https://www.google.com/url?q=https%3A%2F%2Fprinceton.zoom.us%2Fj%2F8636741030&sa=D&ust=1739205600000000&usg=AOvVaw3dEtX1Kq6pcwvAHQJnrMYo

The members of Dan’s committee are as follows:
Examiners: Danqi Chen (Adviser), Tom Griffiths, Adji Bousso Dieng
Readers: Sanjeev Arora, Alexander Rush (Cornell University)

A copy of his thesis is available upon request. Please email gradinfo@cs.princeton.edu if you would like a copy of the thesis.

Everyone is invited to attend his talk.

Abstract follows below:
Modern natural language processing systems are black boxes. As a result, these models are fundamentally difficult to audit, debug, or trust. Standard post-hoc interpretability methods offer partial insight into model behavior but do not provide complete or faithful descriptions of the underlying computation. The aim of my research is to achieve an “algorithm-level” understanding of language processing systems, by aligning neural models with human-readable, rule-based programs. This thesis is organized into two parts. The first part involves developing intrinsically interpretable methods—machine learning models that are designed from the outset to be easily converted into symbolic form. I will present a method for finding interpretable features in text data sets using probabilistic grammars, which can recover “shortcuts” in NLP classification datasets—shallow patterns in the training data that are correlated with the target labels but do not generalize reliably. Next, I will introduce my work on developing transformers that are mechanistically interpretable by design, which can be trained using gradient-based optimization and then automatically converted into discrete, human-readable programs. These Transformer Programs are capable of learning effective solutions for a variety of tasks, and can be analyzed using standard code-analysis tools. The second part of the thesis involves aligning existing black-box models with symbolic rules, either by theoretical analysis or by extracting rule-based descriptions of their behavior. This includes a theoretical and empirical analysis of how transformers can represent classic, rule-based chatbot algorithms like ELIZA, and a method for extracting rule-based explanations of attention features using Sparse Autoencoders. I will conclude by outlining a roadmap for future research towards the goal of algorithm-level understanding of neural language models.

Date and Time

Tuesday November 11, 2025 1:00pm - 3:00pm

AI Lab Room 274 (off campus)

Contributions to and/or sponsorship of any event does not constitute departmental or institutional endorsement of the specific program, speakers or views presented.

CS Talks Mailing List

11-11 Dan Friedman FPO

11-11
Dan Friedman FPO