Instructor Ellen Zhong
Time Tuesdays 1:20-4:10p, Friend Center 007
Office hours Mondays 4:00-5:00p, CS 314, or by appointment
Slack Link
Syllabus Link

Recent breakthroughs in machine learning algorithms have transformed the study of the 3D structure of proteins and other biomolecules. This seminar class will survey recent papers on ML applied to tasks in protein structure prediction, structure determination, computational protein design, physics-based modeling, and more. We will take a holistic approach when discussing papers, including discussing their historical context, algorithmic contributions, and potential impact on scientific discovery and applications such as drug discovery.

For more information on the discussion format, expectations, and grading, see the course syllabus.


Goals

  • Learn about machine learning methods applied to problems in structural biology
  • Learn how to critically read and evaluate papers
  • Learn how to pose research problems and practice oral and written scientific communication skills
  • Bonus: Exposure to relevant basic and applied ML research in industry from guest speakers


Topics

A non-exhaustive list of topics we will cover include:

  • An introduction to structural biology
  • Protein structure prediction before and after AlphaFold2
  • Computer vision and cryo-electron microscopy (cryo-EM)
  • Computational protein design, in particular, antibody and vaccine design
  • Physics-based modeling and statistical mechanics
  • Small molecule drug discovery

Selected papers will cover a broad range of algorithmic concepts and machine learning techniques including:

  • Supervised learning and designing appropriate benchmarks and metrics
  • Language modeling and transformers
  • Generative modeling techniques including VAEs, GANs, normalizing flows, and diffusion models
  • Geometric deep learning
  • Neural rendering and multi-view 3D reconstruction

In addition to the assigned papers, optional primers or reviews on relevant topics will be made available for background reading.


Assignments

Final project guidelines: link


Guest Speakers

Tuesday, March 3rd, 1:20pm ET
Mark Goldstein (Flatiron Institute)

Title: Diffusion Generative Models

Abstract: Generative models have become central tools in computational biology and beyond, enabling the design of proteins, antibodies, small molecules, and more. But what exactly is a generative model? In this lecture, we survey the landscape of deep generative modeling — or more plainly, ways to fit and sample from complex probability distributions using large neural networks.

We start with "early" approaches like normalizing flows and VAEs, make our way through energy-based models, and arrive at diffusion- and flow-based models. We discuss the intuition behind each paradigm, which limitations motivated the community to move on, and why diffusion/flow models have stuck around for some time. We also cover the basic design choices one faces when training or using a diffusion model in practice.

We then turn to applications in protein design, examining how these frameworks have been adapted to handle the geometric and biochemical structure of biomolecules. We will see a few successful applications, illustrating how the principles from the first half of the lecture come to life in state-of-the-art systems for controllable structure and sequence design.

Bio: Mark Goldstein is a Research Fellow in the Center for Computational Mathematics at the Flatiron Institute. Previously, he completed his PhD at the NYU Courant Institute of Mathematical Sciences, CILVR group, advised by Rajesh Ranganath and Thomas Wies. He works on deep generative models and machine learning in the sciences.


Tuesday, March 17th, 1:20pm ET
Zeming Lin (Biohub // Evolutionary Scale)

Title: Protein Language Models

Abstract: In this talk I’ll introduce protein language modeling, focusing on the ESM family of models that I’ve helped develop over my career. We’ll see how self-supervised training on millions of natural sequences uncovers the “grammar” of evolution, enabling models to infer structure, understand function, and guide design.

Bio: Zeming is a cofounder at EvolutionaryScale, which recently joined forces with Biohub, an institute focused on combining frontier biology research with frontier AI. Previously, he's done a PhD at NYU and worked at Meta as a research engineer, helping develop pytorch.


Tuesday, March 24th, 1:20pm ET
Elana Simon (Stanford University)

Title: InterPLM: Discovering Interpretable Features in Protein Language Models via Sparse Autoencoders

Abstract: Protein language models (PLMs) have demonstrated remarkable success in protein modeling and design, yet their internal mechanisms for predicting structure and function remain poorly understood. We present a systematic approach to extract and analyze interpretable features from PLMs using sparse autoencoders (SAEs). By training SAEs on embeddings from the PLM ESM-2, we identify thousands of human-interpretable features that correlate with biological concepts like binding sites, structural motifs, and functional domains. In contrast, examining individual neurons in ESM-2 reveals significantly less conceptual alignment, suggesting that PLMs represent most concepts in superposition. Beyond capturing known annotations, we show that ESM-2 learns coherent concepts that do not map onto existing annotations and propose a pipeline using language models to automatically interpret novel latent features learned by the SAEs. As practical applications, we demonstrate how these latent features can fill in missing annotations in protein databases and enable targeted steering of protein sequence generation. Our results demonstrate that PLMs encode rich, interpretable representations of protein biology and we propose a systematic framework to extract and analyze these latent features. In the process, we recover both known biology and potentially new protein motifs.

Bio: Elana Simon is a PhD student at Stanford advised by James Zou, working on understanding what machine learning models learn from biological sequences and structures. Previously, she was an ML engineer at Reverie Labs designing small-molecule cancer drugs and studied computer science at Harvard, where she worked with Debora Marks on protein language models. She also writes in-depth ML-biology analyses on her blog matmols and has been actively involved in research and advocacy for Fibrolamellar Hepatocellular Carcinoma.


Tuesday, April 28th, 1:20pm ET (tentative)
Sam Rodriques (FutureHouse, Edison Scientific)

Title: Building AI scientists

Bio: Sam Rodriques is an inventor and entrepreneur and the founder of FutureHouse, a research lab focused on building AI scientists, and Edison Scientific, which commercializes AI agents for scientific discovery. He was previously head of the Applied Biotechnology Lab at the Francis Crick Institute and earned his PhD at MIT. Named one of Time Magazine’s 100 most influential people in AI in 2025, his work spans accelerating biomedical discovery, engineering human biology, and developing new institutional models for scientific research.


Schedule

Please fill out this form and contact Ellen if you are interested in signing up for this class. See a previous year's course website for a sample of topics and papers we will cover.

Post-lecture feedback: Please fill out this form if you are assigned to give feedback on a lecture.

Week Date Topic Readings Presenters Questions and Feedback
1 January 27 Course overview; Introduction to machine learning in structural biology Additional Resources:
1. Dill et al. The Protein Folding Problem. Annual Review of Biophysics 2008.
Ellen Zhong [Slides] N/A
2 February 3 Protein structure prediction; CASP; Supervised learning; Protein-specific metrics 1. Senior, A.W., Evans, R., Jumper, J. et al. Improved protein structure prediction using potentials from deep learning. Nature 2020.
2. Ingraham, J. et al. Learning Protein Structure with a Differentiable Simulator. ICLR 2019 Oral. [Talk]

Additional Resources:
3. AlphaFold1 CASP13 slides
4. https://moalquraishi.wordpress.com/2018/12/09/alphafold-casp13-what-just-happened/
5. trRosetta: Yang et al. Improved protein structure prediction using predicted interresidue orientations. PNAS 2020.
Ellen Zhong [Slides], Yufan Xia [Slides] Pre-lecture questions Feedback: Jack McMahon, Ziyu Xiong
3 February 10 Breakthroughs in protein structure prediction 1. Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure prediction with Alphafold. Nature 2021.
2. Abramson, J., Adler, J., Dunger, J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 2024.

Additional Resources:
3. Tunyasuvunakool, K., Adler, J., Wu, Z. et al. Highly accurate protein structure prediction for the human proteome. Nature 2021.
4. AlphaFold2 slides. [CASP14 talk] [Michael Figurnov slides]
5. https://moalquraishi.wordpress.com/2020/12/08/alphafold2-casp14-it-feels-like-ones-child-has-left-home/.
6. Primer on transformers: [1] [2]
7. The Illustrated AlphaFold(3)
Jack Shaw, Maxwell Soh [Slides-AF2] [Slides-AFDB] [Slides-AF3] Pre-lecture questions Feedback: Robert Heeter, Yagiz Devre
4 February 17 Protein design I 1. Ingraham et al. Generative models for graph-based protein design. NeurIPS 2019.
2. ESM-IF1: Hsu et al. Learning inverse folding from millions of predicted structures. ICML 2022.
3. Pacesa et al. One-shot design of functional protein binders with BindCraft. Nature 2025.

Additional Resources:
4. Dauparas et al. Robust deep learning–based protein sequence design using ProteinMPNN. Science 2022.
Jack McMahon, Joseph Clark, Md Toki Tahmid [Slides-Structure-Transformers] [Slides-ESM-IF1] [Slides-BindCraft] Pre-lecture questions Feedback:Tony Chen, Khai Evdaev
5 February 24 Protein structure determination I: Cryo-EM reconstruction 1. Zhong et al. Reconstructing continuous distributions of protein structure from cryo-EM images. ICLR 2020 Spotlight.
2. Zhong et al. CryoDRGN: reconstruction of heterogeneous cryo-EM structures using neural networks. Nature Methods 2021. [pdf]
3. Levy et al. CryoDRGN-AI: neural ab initio reconstruction of challenging cryo-EM and cryo-ET datasets Nature Methods 2025.

Additional Resources:
4. Computer vision related works:
iii. Xie et al. Neural Fields in Visual Computing and Beyond. Computer Graphics Forum 2022.
5. Cryo-EM background:
Singer & Sigworth. Computational Methods for Single-Particle Cryo-EM. Annual Review of Biomedical Data Science, 2020.
6. Primer on Variational Autoencoders: [1] [2] [3] [4]
Guest lecture by Rish Raghu, Robert Heeter [Slides-Cryo-Background] [Slides-CryoDRGN] Pre-lecture questions Feedback: Xingjian Hou, Sterling Hall
6 March 3 Diffusion and flow matching generative models of proteins 1. Ho et al. Denoising Diffusion Probabilistic Models. NeurIPS 2020.
2. Jing et al. AlphaFold Meets Flow Matching for Generating Protein Ensembles. ICML 2024.
Guest lecture by Mark Goldstein. Seminar by Bowen Jing Pre-lecture questions

Final Project Part 1 Due March 7th (Project proposal)
7 March 10 No class -- Spring Recess
8 March 17 Protein + Biological Language Modeling I 1. Rives et al. Language models enable zero-shot prediction of the effects of mutations on protein function. PNAS 2021.
2. Lin et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023.
3. Hie et al. Learning the language of viral evolution and escape. Science 2021.
Guest lecture by Zeming Lin. Conor Warren [Slides] Pre-lecture questions. Feedback: Joseph Clark, Jack Shaw
9 March 24 Protein + Biological Language Modeling II 1. Simon & Zou. InterPLM: discovering interpretable features in protein language models via sparse autoencoders. Nature Methods 2025.
2. Brixi et al. Genome modelling and design across all domains of life with Evo 2. Nature 2026.
Guest lecture by Elana Simon. Tony Chen Pre-lecture questions. Feedback: Yufan Xia, Jack McMahon
10 March 31 Physics-based modeling
11 April 7 Protein Structure Determination II
12 April 14 Other molecules: RNA, small molecules, etc.
13 April 21 Skip
14 April 28 AI Scientists
15 Tuesday, May 5 or May 12 (TBD), 1:20-4:10pm Final project presentations