Instructor Ellen Zhong
Time Tuesdays 1:20-4:10p, Friend Center 007
Office hours Mondays 4:00-5:00p, CS 314, or by appointment
Slack Link
Syllabus Link

Recent breakthroughs in machine learning algorithms have transformed the study of the 3D structure of proteins and other biomolecules. This seminar class will survey recent papers on ML applied to tasks in protein structure prediction, structure determination, computational protein design, physics-based modeling, and more. We will take a holistic approach when discussing papers, including discussing their historical context, algorithmic contributions, and potential impact on scientific discovery and applications such as drug discovery.

For more information on the discussion format, expectations, and grading, see the course syllabus.


Goals

  • Learn about machine learning methods applied to problems in structural biology
  • Learn how to critically read and evaluate papers
  • Learn how to pose research problems and practice oral and written scientific communication skills
  • Bonus: Exposure to relevant basic and applied ML research in industry from guest speakers


Topics

A non-exhaustive list of topics we will cover include:

  • An introduction to structural biology
  • Protein structure prediction before and after AlphaFold2
  • Computer vision and cryo-electron microscopy (cryo-EM)
  • Computational protein design, in particular, antibody and vaccine design
  • Physics-based modeling and statistical mechanics
  • Small molecule drug discovery

Selected papers will cover a broad range of algorithmic concepts and machine learning techniques including:

  • Supervised learning and designing appropriate benchmarks and metrics
  • Language modeling and transformers
  • Generative modeling techniques including VAEs, GANs, normalizing flows, and diffusion models
  • Geometric deep learning
  • Neural rendering and multi-view 3D reconstruction

In addition to the assigned papers, optional primers or reviews on relevant topics will be made available for background reading.


Assignments

Final project guidelines: link


Guest Speakers

Tuesday, March 3rd, 1:20pm ET
Mark Goldstein (Flatiron Institute)

Title: Diffusion models and flow matching for molecule generation and design

Bio: Mark Goldstein is a Research Fellow in the Center for Computational Mathematics at the Flatiron Institute. Previously, he completed his PhD at the NYU Courant Institute of Mathematical Sciences, CILVR group, advised by Rajesh Ranganath and Thomas Wies. He works on deep generative models and machine learning in the sciences.


Tuesday, March 17th, 1:20pm ET
Zeming Lin (CZI Biohub // Evolutionary Scale)

Title: Evolutionary scale protein language modeling

Bio: TBD


Tuesday, March 24th, 1:20pm ET
Elana Simon (Stanford University)

Title: Discovering interpretable features in protein language models

Bio: Elana Simon is a PhD student at Stanford advised by James Zou, working on understanding what machine learning models learn from biological sequences and structures. Previously, she was an ML engineer at Reverie Labs designing small-molecule cancer drugs and studied computer science at Harvard, where she worked with Debora Marks on protein language models. She also writes in-depth ML-biology analyses on her blog matmols and has been actively involved in research and advocacy for Fibrolamellar Hepatocellular Carcinoma.


Tuesday, April 28th, 1:20pm ET (tentative)
Sam Rodriques (FutureHouse, Edison Scientific)

Title: Building AI scientists

Bio: Sam Rodriques is an inventor and entrepreneur and the founder of FutureHouse, a research lab focused on building AI scientists, and Edison Scientific, which commercializes AI agents for scientific discovery. He was previously head of the Applied Biotechnology Lab at the Francis Crick Institute and earned his PhD at MIT. Named one of Time Magazine’s 100 most influential people in AI in 2025, his work spans accelerating biomedical discovery, engineering human biology, and developing new institutional models for scientific research.


Schedule

Please fill out this form and contact Ellen if you are interested in signing up for this class. See a previous year's course website for a sample of topics and papers we will cover.

Post-lecture feedback: Please fill out this form if you are assigned to give feedback on a lecture.

Week Date Topic Readings Presenters Questions and Feedback
1 January 27 Course overview; Introduction to machine learning in structural biology Additional Resources:
1. Dill et al. The Protein Folding Problem. Annual Review of Biophysics 2008.
Ellen Zhong [Slides] N/A
2 February 3 Protein structure prediction; CASP; Supervised learning; Protein-specific metrics 1. Senior, A.W., Evans, R., Jumper, J. et al. Improved protein structure prediction using potentials from deep learning. Nature 2020.
2. Ingraham, J. et al. Learning Protein Structure with a Differentiable Simulator. ICLR 2019 Oral. [Talk]

Additional Resources:
3. AlphaFold1 CASP13 slides
4. https://moalquraishi.wordpress.com/2018/12/09/alphafold-casp13-what-just-happened/
5. trRosetta: Yang et al. Improved protein structure prediction using predicted interresidue orientations. PNAS 2020.
Ellen Zhong [Slides], Yufan Xia [Slides] Pre-lecture questions Feedback: Jack McMahon, Ziyu Xiong
3 February 10 Breakthroughs in protein structure prediction 1. Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure prediction with Alphafold. Nature 2021.
2. Abramson, J., Adler, J., Dunger, J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 2024.

Additional Resources:
3. Tunyasuvunakool, K., Adler, J., Wu, Z. et al. Highly accurate protein structure prediction for the human proteome. Nature 2021.
4. AlphaFold2 slides. [CASP14 talk] [Michael Figurnov slides]
5. https://moalquraishi.wordpress.com/2020/12/08/alphafold2-casp14-it-feels-like-ones-child-has-left-home/.
6. Primer on transformers: [1] [2]
7. The Illustrated AlphaFold(3)
Jack Shaw, Maxwell Soh [Slides-AF2] [Slides-AFDB] [Slides-AF3] Pre-lecture questions Feedback: Robert Heeter, Yagiz Devre
4 February 17 Protein design I 1. Ingraham et al. Generative models for graph-based protein design. NeurIPS 2019.
2. ESM-IF1: Hsu et al. Learning inverse folding from millions of predicted structures. ICML 2022.
3. Pacesa et al. One-shot design of functional protein binders with BindCraft. Nature 2025.

Additional Resources:
4. Dauparas et al. Robust deep learning–based protein sequence design using ProteinMPNN. Science 2022.
Jack McMahon, Joseph Clark, Md Toki Tahmid Pre-lecture questions Feedback:Tony Chen, Khai Evdaev
5 February 24 Protein structure determination I: Cryo-EM reconstruction 1. Zhong et al. Reconstructing continuous distributions of protein structure from cryo-EM images. ICLR 2020 Spotlight.
2. Zhong et al. CryoDRGN: reconstruction of heterogeneous cryo-EM structures using neural networks. Nature Methods 2021. [pdf]
3. Levy et al. CryoDRGN-AI: neural ab initio reconstruction of challenging cryo-EM and cryo-ET datasets Nature Methods 2025.

Additional Resources:
4. Computer vision related works:
iii. Xie et al. Neural Fields in Visual Computing and Beyond. Computer Graphics Forum 2022.
5. Cryo-EM background:
Singer & Sigworth. Computational Methods for Single-Particle Cryo-EM. Annual Review of Biomedical Data Science, 2020.
6. Primer on Variational Autoencoders: [1] [2] [3] [4]
Guest lecture by Rish Raghu, Robert Heeter Pre-lecture questions Feedback: Xingjian Hou, Sterling Hall
6 March 3 Protein design II: Diffusion and flow matching models of sequence and structure
7 March 10 No class -- Spring Recess
8 March 17 Protein + Biological Language Modeling I
9 March 24 Protein + Biological Language Modeling II
10 March 31 Physics-based modeling
11 April 7 Protein Structure Determination II
12 April 14 Other molecules: RNA, small molecules, etc.
13 April 21 Skip
14 April 28 AI Scientists
15 Tuesday, May 5 or May 12 (TBD), 1:20-4:10pm Final project presentations