COS 598L Advanced Topics in Computer Science: Machine Learning for Structural Biology

Instructor	Ellen Zhong
Time	Tuesdays 1:20-4:10p, Friend Center 007
Office hours	Mondays 4:00-5:00p, CS 314, or by appointment
Slack	Link
Syllabus	Link

Recent breakthroughs in machine learning algorithms have transformed the study of the 3D structure of proteins and other biomolecules. This seminar class will survey recent papers on ML applied to tasks in protein structure prediction, structure determination, computational protein design, physics-based modeling, and more. We will take a holistic approach when discussing papers, including discussing their historical context, algorithmic contributions, and potential impact on scientific discovery and applications such as drug discovery.

For more information on the discussion format, expectations, and grading, see the course syllabus.

Goals

Learn about machine learning methods applied to problems in structural biology
Learn how to critically read and evaluate papers
Learn how to pose research problems and practice oral and written scientific communication skills
Bonus: Exposure to relevant basic and applied ML research in industry from guest speakers

Topics

A non-exhaustive list of topics we will cover include:

An introduction to structural biology
Protein structure prediction before and after AlphaFold2
Computer vision and cryo-electron microscopy (cryo-EM)
Computational protein design, in particular, antibody and vaccine design
Physics-based modeling and statistical mechanics
Small molecule drug discovery

Selected papers will cover a broad range of algorithmic concepts and machine learning techniques including:

Supervised learning and designing appropriate benchmarks and metrics
Language modeling and transformers
Generative modeling techniques including VAEs, GANs, normalizing flows, and diffusion models
Geometric deep learning
Neural rendering and multi-view 3D reconstruction

In addition to the assigned papers, optional primers or reviews on relevant topics will be made available for background reading.

Assignments

Final project guidelines: link

Guest Speakers

Tuesday, March 3rd, 1:20pm ET
Mark Goldstein (Flatiron Institute)

Title: Diffusion models and flow matching for molecule generation and design

Bio: Mark Goldstein is a Research Fellow in the Center for Computational Mathematics at the Flatiron Institute. Previously, he completed his PhD at the NYU Courant Institute of Mathematical Sciences, CILVR group, advised by Rajesh Ranganath and Thomas Wies. He works on deep generative models and machine learning in the sciences.

Tuesday, March 17th, 1:20pm ET
Zeming Lin (CZI Biohub // Evolutionary Scale)

Title: Evolutionary scale protein language modeling

Bio: TBD

Tuesday, March 24th, 1:20pm ET
Elana Simon (Stanford University)

Title: Discovering interpretable features in protein language models

Bio: Elana Simon is a PhD student at Stanford advised by James Zou, working on understanding what machine learning models learn from biological sequences and structures. Previously, she was an ML engineer at Reverie Labs designing small-molecule cancer drugs and studied computer science at Harvard, where she worked with Debora Marks on protein language models. She also writes in-depth ML-biology analyses on her blog matmols and has been actively involved in research and advocacy for Fibrolamellar Hepatocellular Carcinoma.

Tuesday, April 28th, 1:20pm ET (tentative)
Sam Rodriques (FutureHouse, Edison Scientific)

Title: Building AI scientists

Bio: Sam Rodriques is an inventor and entrepreneur and the founder of FutureHouse, a research lab focused on building AI scientists, and Edison Scientific, which commercializes AI agents for scientific discovery. He was previously head of the Applied Biotechnology Lab at the Francis Crick Institute and earned his PhD at MIT. Named one of Time Magazine’s 100 most influential people in AI in 2025, his work spans accelerating biomedical discovery, engineering human biology, and developing new institutional models for scientific research.

Schedule

Please fill out this form and contact Ellen if you are interested in signing up for this class. See a previous year's course website for a sample of topics and papers we will cover.

Post-lecture feedback: Please fill out this form if you are assigned to give feedback on a lecture.

Week	Date	Topic	Readings	Presenters	Questions and Feedback
1	January 27	Course overview; Introduction to machine learning in structural biology	Additional Resources: 1. Dill et al. The Protein Folding Problem. Annual Review of Biophysics 2008.	Ellen Zhong [Slides]	N/A
2	February 3	Protein structure prediction; CASP; Supervised learning; Protein-specific metrics	1. Senior, A.W., Evans, R., Jumper, J. et al. Improved protein structure prediction using potentials from deep learning. Nature 2020. 2. Ingraham, J. et al. Learning Protein Structure with a Differentiable Simulator. ICLR 2019 Oral. [Talk] Additional Resources: 3. AlphaFold1 CASP13 slides 4. https://moalquraishi.wordpress.com/2018/12/09/alphafold-casp13-what-just-happened/ 5. trRosetta: Yang et al. Improved protein structure prediction using predicted interresidue orientations. PNAS 2020.	Ellen Zhong [Slides], Yufan Xia [Slides]	Pre-lecture questions Feedback: Jack McMahon, Ziyu Xiong
3	February 10	Breakthroughs in protein structure prediction	1. Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure prediction with Alphafold. Nature 2021. 2. Abramson, J., Adler, J., Dunger, J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 2024. Additional Resources: 3. Tunyasuvunakool, K., Adler, J., Wu, Z. et al. Highly accurate protein structure prediction for the human proteome. Nature 2021. 4. AlphaFold2 slides. [CASP14 talk] [Michael Figurnov slides] 5. https://moalquraishi.wordpress.com/2020/12/08/alphafold2-casp14-it-feels-like-ones-child-has-left-home/. 6. Primer on transformers: [1] [2] 7. The Illustrated AlphaFold(3)	Jack Shaw, Maxwell Soh [Slides-AF2] [Slides-AFDB] [Slides-AF3]	Pre-lecture questions Feedback: Robert Heeter, Yagiz Devre
4	February 17	Protein design I	1. Ingraham et al. Generative models for graph-based protein design. NeurIPS 2019. 2. ESM-IF1: Hsu et al. Learning inverse folding from millions of predicted structures. ICML 2022. 3. Pacesa et al. One-shot design of functional protein binders with BindCraft. Nature 2025. Additional Resources: 4. Dauparas et al. Robust deep learning–based protein sequence design using ProteinMPNN. Science 2022.	Jack McMahon, Joseph Clark, Md Toki Tahmid	Pre-lecture questions Feedback:Tony Chen, Khai Evdaev
5	February 24	Protein structure determination I: Cryo-EM reconstruction	1. Zhong et al. Reconstructing continuous distributions of protein structure from cryo-EM images. ICLR 2020 Spotlight. 2. Zhong et al. CryoDRGN: reconstruction of heterogeneous cryo-EM structures using neural networks. Nature Methods 2021. [pdf] 3. Levy et al. CryoDRGN-AI: neural ab initio reconstruction of challenging cryo-EM and cryo-ET datasets Nature Methods 2025. Additional Resources: 4. Computer vision related works: i. Mildenhall, Srinivasan, Tancik et al. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. ECCV 2020 Oral. [project page] ii. Tancik et al. Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains. NeurIPS 2020 Spotlight. iii. Xie et al. Neural Fields in Visual Computing and Beyond. Computer Graphics Forum 2022. 5. Cryo-EM background: Singer & Sigworth. Computational Methods for Single-Particle Cryo-EM. Annual Review of Biomedical Data Science, 2020. 6. Primer on Variational Autoencoders: [1] [2] [3] [4]	Guest lecture by Rish Raghu, Robert Heeter	Pre-lecture questions Feedback: Xingjian Hou, Sterling Hall
6	March 3	Protein design II: Diffusion and flow matching models of sequence and structure
7	March 10	No class -- Spring Recess
8	March 17	Protein + Biological Language Modeling I
9	March 24	Protein + Biological Language Modeling II
10	March 31	Physics-based modeling
11	April 7	Protein Structure Determination II
12	April 14	Other molecules: RNA, small molecules, etc.
13	April 21	Skip
14	April 28	AI Scientists
15	Tuesday, May 5 or May 12 (TBD), 1:20-4:10pm	Final project presentations