| Instructor | Ellen Zhong |
| Time | Tuesdays 1:20-4:10p, Friend Center 007 |
| Office hours | Mondays 4:00-5:00p, CS 314, or by appointment | Slack | Link |
| Syllabus | Link |
Recent breakthroughs in machine learning algorithms have transformed the study of the 3D structure of proteins and other biomolecules. This seminar class will survey recent papers on ML applied to tasks in protein structure prediction, structure determination, computational protein design, physics-based modeling, and more. We will take a holistic approach when discussing papers, including discussing their historical context, algorithmic contributions, and potential impact on scientific discovery and applications such as drug discovery.
For more information on the discussion format, expectations, and grading, see the course syllabus.
A non-exhaustive list of topics we will cover include:
Selected papers will cover a broad range of algorithmic concepts and machine learning techniques including:
In addition to the assigned papers, optional primers or reviews on relevant topics will be made available for background reading.
Final project guidelines: link
Tuesday, March 3rd, 1:20pm ET
Mark Goldstein (Flatiron Institute)
Title: Diffusion Generative Models
Abstract: Generative models have become central tools in computational biology and beyond, enabling the design of proteins, antibodies, small molecules, and more. But what exactly is a generative model? In this lecture, we survey the landscape of deep generative modeling — or more plainly, ways to fit and sample from complex probability distributions using large neural networks.
We start with "early" approaches like normalizing flows and VAEs, make our way through energy-based models, and arrive at diffusion- and flow-based models. We discuss the intuition behind each paradigm, which limitations motivated the community to move on, and why diffusion/flow models have stuck around for some time. We also cover the basic design choices one faces when training or using a diffusion model in practice.
We then turn to applications in protein design, examining how these frameworks have been adapted to handle the geometric and biochemical structure of biomolecules. We will see a few successful applications, illustrating how the principles from the first half of the lecture come to life in state-of-the-art systems for controllable structure and sequence design.
Bio: Mark Goldstein is a Research Fellow in the Center for Computational Mathematics at the Flatiron Institute. Previously, he completed his PhD at the NYU Courant Institute of Mathematical Sciences, CILVR group, advised by Rajesh Ranganath and Thomas Wies. He works on deep generative models and machine learning in the sciences.
Tuesday, March 17th, 1:20pm ET
Zeming Lin (Biohub // Evolutionary Scale)
Title: Protein Language Models
Abstract: In this talk I’ll introduce protein language modeling, focusing on the ESM family of models that I’ve helped develop over my career. We’ll see how self-supervised training on millions of natural sequences uncovers the “grammar” of evolution, enabling models to infer structure, understand function, and guide design.
Bio: Zeming is a cofounder at EvolutionaryScale, which recently joined forces with Biohub, an institute focused on combining frontier biology research with frontier AI. Previously, he's done a PhD at NYU and worked at Meta as a research engineer, helping develop pytorch.
Tuesday, March 24th, 1:20pm ET
Elana Simon (Stanford University)
Title: InterPLM: Discovering Interpretable Features in Protein Language Models via Sparse Autoencoders
Abstract: Protein language models (PLMs) have demonstrated remarkable success in protein modeling and design, yet their internal mechanisms for predicting structure and function remain poorly understood. We present a systematic approach to extract and analyze interpretable features from PLMs using sparse autoencoders (SAEs). By training SAEs on embeddings from the PLM ESM-2, we identify thousands of human-interpretable features that correlate with biological concepts like binding sites, structural motifs, and functional domains. In contrast, examining individual neurons in ESM-2 reveals significantly less conceptual alignment, suggesting that PLMs represent most concepts in superposition. Beyond capturing known annotations, we show that ESM-2 learns coherent concepts that do not map onto existing annotations and propose a pipeline using language models to automatically interpret novel latent features learned by the SAEs. As practical applications, we demonstrate how these latent features can fill in missing annotations in protein databases and enable targeted steering of protein sequence generation. Our results demonstrate that PLMs encode rich, interpretable representations of protein biology and we propose a systematic framework to extract and analyze these latent features. In the process, we recover both known biology and potentially new protein motifs.
Bio: Elana Simon is a PhD student at Stanford advised by James Zou, working on understanding what machine learning models learn from biological sequences and structures. Previously, she was an ML engineer at Reverie Labs designing small-molecule cancer drugs and studied computer science at Harvard, where she worked with Debora Marks on protein language models. She also writes in-depth ML-biology analyses on her blog matmols and has been actively involved in research and advocacy for Fibrolamellar Hepatocellular Carcinoma.
Tuesday, April 28th, 1:20pm ET (tentative)
Sam Rodriques (FutureHouse, Edison Scientific)
Title: Building AI scientists
Bio: Sam Rodriques is an inventor and entrepreneur and the founder of FutureHouse, a research lab focused on building AI scientists, and Edison Scientific, which commercializes AI agents for scientific discovery. He was previously head of the Applied Biotechnology Lab at the Francis Crick Institute and earned his PhD at MIT. Named one of Time Magazine’s 100 most influential people in AI in 2025, his work spans accelerating biomedical discovery, engineering human biology, and developing new institutional models for scientific research.
Please fill out this form and contact Ellen if you are interested in signing up for this class. See a previous year's course website for a sample of topics and papers we will cover.
Post-lecture feedback: Please fill out this form if you are assigned to give feedback on a lecture.
| Week | Date | Topic | Readings | Presenters | Questions and Feedback |
|---|---|---|---|---|---|
| 1 | January 27 | Course overview; Introduction to machine learning in structural biology |
Additional Resources:
1. Dill et al. The Protein Folding Problem. Annual Review of Biophysics 2008. |
Ellen Zhong [Slides] | N/A |
| 2 | February 3 | Protein structure prediction; CASP; Supervised learning; Protein-specific metrics |
1. Senior, A.W., Evans, R., Jumper, J. et al. Improved protein structure prediction using
potentials from deep learning. Nature 2020.
2. Ingraham, J. et al. Learning Protein Structure with a Differentiable Simulator. ICLR 2019 Oral. [Talk] Additional Resources: 3. AlphaFold1 CASP13 slides 4. https://moalquraishi.wordpress.com/2018/12/09/alphafold-casp13-what-just-happened/ 5. trRosetta: Yang et al. Improved protein structure prediction using predicted interresidue orientations. PNAS 2020. |
Ellen Zhong [Slides], Yufan Xia [Slides] | Pre-lecture questions Feedback: Jack McMahon, Ziyu Xiong |
| 3 | February 10 | Breakthroughs in protein structure prediction |
1. Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure
prediction with Alphafold. Nature 2021.
2. Abramson, J., Adler, J., Dunger, J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 2024. Additional Resources: 3. Tunyasuvunakool, K., Adler, J., Wu, Z. et al. Highly accurate protein structure prediction for the human proteome. Nature 2021. 4. AlphaFold2 slides. [CASP14 talk] [Michael Figurnov slides] 5. https://moalquraishi.wordpress.com/2020/12/08/alphafold2-casp14-it-feels-like-ones-child-has-left-home/. 6. Primer on transformers: [1] [2] 7. The Illustrated AlphaFold(3) |
Jack Shaw, Maxwell Soh [Slides-AF2] [Slides-AFDB] [Slides-AF3] | Pre-lecture questions Feedback: Robert Heeter, Yagiz Devre |
| 4 | February 17 | Protein design I |
1. Ingraham et al. Generative
models for graph-based protein design. NeurIPS 2019.
2. ESM-IF1: Hsu et al. Learning inverse folding from millions of predicted structures. ICML 2022. 3. Pacesa et al. One-shot design of functional protein binders with BindCraft. Nature 2025. Additional Resources: 4. Dauparas et al. Robust deep learning–based protein sequence design using ProteinMPNN. Science 2022. | Jack McMahon, Joseph Clark, Md Toki Tahmid [Slides-Structure-Transformers] [Slides-ESM-IF1] [Slides-BindCraft] | Pre-lecture questions Feedback:Tony Chen, Khai Evdaev |
| 5 | February 24 | Protein structure determination I: Cryo-EM reconstruction |
1. Zhong et al. Reconstructing continuous distributions of
protein structure from cryo-EM images. ICLR 2020 Spotlight.
2. Zhong et al. CryoDRGN: reconstruction of heterogeneous cryo-EM structures using neural networks. Nature Methods 2021. [pdf] 3. Levy et al. CryoDRGN-AI: neural ab initio reconstruction of challenging cryo-EM and cryo-ET datasets Nature Methods 2025. Additional Resources: 4. Computer vision related works:
i. Mildenhall, Srinivasan, Tancik et al. NeRF:
Representing
Scenes as Neural Radiance Fields for View Synthesis. ECCV 2020 Oral. [project page]
ii. Tancik et al. Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains. NeurIPS 2020 Spotlight.
iii. Xie et al. Neural Fields in Visual Computing and
Beyond. Computer Graphics Forum 2022.
5. Cryo-EM background:
Singer & Sigworth. Computational Methods for
Single-Particle Cryo-EM. Annual Review of Biomedical Data Science, 2020.
6. Primer on Variational Autoencoders:
[1]
[2]
[3]
[4]
|
Guest lecture by Rish Raghu, Robert Heeter [Slides-Cryo-Background] [Slides-CryoDRGN] | Pre-lecture questions Feedback: Xingjian Hou, Sterling Hall |
| 6 | March 3 | Diffusion and flow matching generative models of proteins |
1. Ho et al. Denoising Diffusion Probabilistic Models. NeurIPS 2020.
2. Jing et al. AlphaFold Meets Flow Matching for Generating Protein Ensembles. ICML 2024. | Guest lecture by Mark Goldstein. Seminar by Bowen Jing |
Pre-lecture questions
Final Project Part 1 Due March 7th (Project proposal) |
| 7 | March 10 | No class -- Spring Recess | |||
| 8 | March 17 | Protein + Biological Language Modeling I |
1. Rives et al. Language models enable zero-shot prediction of the effects of mutations on protein function. PNAS 2021.
2. Lin et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023. 3. Hie et al. Learning the language of viral evolution and escape. Science 2021. |
Guest lecture by Zeming Lin. Conor Warren [Slides] | Pre-lecture questions. Feedback: Joseph Clark, Jack Shaw |
| 9 | March 24 | Protein + Biological Language Modeling II |
1. Simon & Zou. InterPLM: discovering interpretable features in protein language models via sparse autoencoders. Nature Methods 2025.
2. Brixi et al. Genome modelling and design across all domains of life with Evo 2. Nature 2026. |
Guest lecture by Elana Simon. Tony Chen | Pre-lecture questions. Feedback: Yufan Xia, Jack McMahon |
| 10 | March 31 | Physics-based modeling | |||
| 11 | April 7 | Protein Structure Determination II | |||
| 12 | April 14 | Other molecules: RNA, small molecules, etc. | |||
| 13 | April 21 | Skip | |||
| 14 | April 28 | AI Scientists | |||
| 15 | Tuesday, May 5 or May 12 (TBD), 1:20-4:10pm | Final project presentations |