Instructor Prof. Ellen Zhong
Time Thursdays 3:00-5:00p,
"Precept" / student-only discussion Wednesdays 1:00-2:00p, CS 301
Office hours Mondays 4:00-5:00p, CS 314
Slack Link
Syllabus Link

Recent breakthroughs in machine learning algorithms have transformed the study of the 3D structure of proteins and other biomolecules. This seminar class will survey recent papers on ML applied to tasks in protein structure prediction, structure determination, computational protein design, physics-based modeling, and more. We will take a holistic approach when discussing papers, including discussing their historical context, algorithmic contributions, and potential impact on scientific discovery and applications such as drug discovery.

For more information on the discussion format, expectations, and grading, see the course syllabus.


Goals

  • Learn about machine learning methods applied to problems in structural biology
  • Learn how to critically read and evaluate papers
  • Learn how to pose research problems and practice written scientific communication skills
  • Bonus: Exposure to relevant basic and applied ML research in industry from guest speakers


Topics

A non-exhaustive list of topics we will cover include:

  • An introduction to structural biology
  • Protein structure prediction before and after AlphaFold2
  • Computer vision and cryo-electron microscopy (cryo-EM)
  • Computational protein design, in particular, antibody and vaccine design
  • Physics-based modeling and statistical mechanics
  • Small molecule drug discovery

Selected papers will cover a broad range of algorithmic concepts and machine learning techniques including:

  • Supervised learning and designing appropriate benchmarks and metrics
  • Language modeling and transformers
  • Generative modeling techniques including VAEs, GANs, normalizing flows, and diffusion models
  • Geometric deep learning
  • Neural fields and multi-view 3D reconstruction

In addition to the assigned papers, optional primers or reviews on relevant topics will be made available for background reading.


Assignments

Assignment 1. Due 11am, Friday, September 30th via Canvas

Assignment 2. Due 11am, Friday, October 14th via Canvas

Assignment 3. Due 3pm, Thursday, November 3rd in class and via Canvas

Assignment 4. Due 3pm, Thursday, December 8th via Canvas

Assignment / Quiz 5. 3pm, Thursday, December 15th in class


Guest Speakers

Thursday September 22nd, 3pm ET
Dr. Michael Figurnov (DeepMind)

Title: Highly accurate protein structure prediction with AlphaFold

Abstract: Predicting a protein’s structure from its primary sequence has been a grand challenge in biology for the past 50 years, holding the promise to bridge the gap between the pace of genomics discovery and resulting structural characterization. In this talk, we will describe work at DeepMind to develop AlphaFold, a new deep learning-based system for structure prediction that achieves high accuracy across a wide range of targets. We demonstrated our system in the 14th biennial Critical Assessment of Protein Structure Prediction (CASP14) across a wide range of difficult targets, where the assessors judged our predictions to be at an accuracy “competitive with experiment” for approximately 2/3rds of proteins. The talk will focus on the underlying machine learning ideas, while also touching on the implications for biological research.

Bio: Michael Figurnov is a Staff Research Scientist at DeepMind. He has been working with the AlphaFold team for the past four years. Before joining DeepMind, he did his Ph.D. in Computer Science at the Bayesian Methods Research Group under the supervision of Dmitry Vetrov. His research interests include deep learning, Bayesian methods, and machine learning for biology.


Thursday November 10th, 12:30p ET (CS 105)
Dr. John Ingraham (Generate Biomedicines)

Title: Illuminating protein space with a programmable generative model

Abstract: Three billion years of evolution have produced a tremendous diversity of protein molecules, but it is yet unknown how thoroughly evolution has sampled the space of possible protein folds and functions. Here, by introducing a new, scalable generative prior for proteins and protein complexes, we provide further evidence that earth's extant molecular biodiversity represents only a small fraction of what is possible for polypeptides. To enable this, we introduce customized neural networks that enable long-range reasoning, that respect the statistical structures of polymer ensembles, and that can efficiently realize 3D structures of proteins from predicted geometries. We show how this framework broadly enables protein design under auxiliary constraints, which can be any composition of semantics, substructure, symmetries, shape, and even natural language prompts.

Bio: John Ingraham is the Head of Machine Learning at Generate Biomedicines, Inc, where he leads a team of scientists and engineers developing new kinds of machine learning systems for protein design. He has spent most of his career developing structured statistical models of the rich diversity found in protein sequences and structures, including as a postdoc at MIT CSAIL with Tommi Jaakkola and Regina Barzilay working on some of the first generative models for structure-based sequence design and before that in his PhD with Debora Marks at Harvard Medical School developing deep learning and statistical-physics inspired models of deep evolutionary sequence variation and protein folding.


Schedule

Week Date Topic Readings Format Assignment
1 September 8 Course overview; Introduction to machine learning in structural biology Optional reading:
1. Dill et al. The Protein Folding Problem. Annual Review of Biophysics 2008.
E.Z. lecture N/A
2 September 15 Protein structure prediction; CASP; Supervised learning; Protein-specific metrics 1. Senior, A.W., Evans, R., Jumper, J. et al. Improved protein structure prediction using potentials from deep learning. Nature 2020.
2. Ingraham, J. et al. Learning Protein Structure with a Differentiable Simulator. ICLR 2019 Oral. [Talk]

Optional further reading:
3. https://moalquraishi.wordpress.com/2018/12/09/alphafold-casp13-what-just-happened/
4. trRosetta: Yang et al. Improved protein structure prediction using predicted interresidue orientations. PNAS 2020.
Paper discussion N/A
3 September 22 Breakthroughs in protein structure prediction 1. Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure prediction with Alphafold. Nature 2021.
2. Tunyasuvunakool, K., Adler, J., Wu, Z. et al. Highly accurate protein structure prediction for the human proteome. Nature 2021.

Optional further reading:
3. AlphaFold2 slides. [CASP14 talk] [Michael Figurnov slides]
4. https://moalquraishi.wordpress.com/2020/12/08/alphafold2-casp14-it-feels-like-ones-child-has-left-home/.
5. Baek et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 2021. [paper]
6. Primer on transformers: [1] [2]
Guest Seminar (Michael Figurnov) + Paper discussion N/A
4 September 29 Complexes, integrative modeling, and limits of structure prediction 1. Evans et al. Protein complex prediction with AlphaFold-Multimer. bioRxiv.
2. Terwilliger et al. Improved AlphaFold modeling with implicit experimental information. bioRxiv. (Now published in Nature Methods)

Optional further reading:
3. Nuclear pore complexes: https://www.science.org/doi/full/10.1126/science.abq4792?intcmp=trendmd-sci
4. Cluspro: https://www.nature.com/articles/nprot.2016.169
Paper discussion Assignment 1 due at 11am Fri, Sept 30th
5 October 6 Cryo-EM and computer vision 1. Zhong et al. Reconstructing continuous distributions of protein structure from cryo-EM images. ICLR 2020 Spotlight.
2. Zhong et al. CryoDRGN: reconstruction of heterogeneous cryo-EM structures using neural networks. Nature Methods 2021. [pdf]
3. Mildenhall, Srinivasan, Tancik et al. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. ECCV 2020 Oral. [project page]

Optional further reading:
4. Computer vision related works:
ii. Xie et al. Neural Fields in Visual Computing and Beyond. Computer Graphics Forum 2022.
5. Cryo-EM background:
Singer & Sigworth. Computational Methods for Single-Particle Cryo-EM. Annual Review of Biomedical Data Science, 2020.
6. Primer on Variational Autoencoders: [1] [2] [3] [4]
E.Z. lecture + Paper discussion N/A
6 October 13 Cryo-EM and atomic modeling 1. Zhong et al. Exploring generative atomic models in cryo-EM reconstruction. NeurIPS 2020 workshop on Machine Learning for Structural Biology.
2. Rosenbaum et al. Inferring a continuous distribution of atom coordinates from cryo-EM images using VAEs. NeurIPS 2021 workshop on Machine Learning for Structural Biology.
3. Jamali et al. ModelAngelo: Automated Model Building in Cryo-EM Maps. arXiv.
Paper discussion Assignment 2 due at 11am Fri, Oct 14
7 October 20 No class -- Fall Recess N/A N/A
8 October 27 Physics-based modeling 1. Lindorff-Larsen et al. How fast-folding proteins fold. Science 2011. [Perspective]
2. Noe et al. Boltzmann generators: Sampling equilibrium states of many-body systems with deep learning. Science 2019. [talk]

Optional further reading:
3. Shaw et al. Atomic-Level Characterization of the Structural Dynamics of Proteins. Science 2010.
4. CVPR 2021 tutorial on normalizing flows.
5. Grathwohl, Chen, et al. FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models. ICLR 2019 Oral.
E.Z. lecture + Paper discussion N/A
9 November 3 Protein language modeling 1. Rives et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. PNAS 2020.
2. Hie et al. Learning Mutational Semantics. NeurIPS 2020.
3. Hie et al. Learning the language of viral evolution and escape. Science 2021.

Optional further reading:
4. ESM-2/ESMAtlas: Lin et al. Evolutionary-scale prediction of atomic level protein structure with a language model. bioRxiv 2022.
5. ESM-MSA-1b: Rao et al. MSA Transformer. ICML 2021.
6. ESM-1v: Meier et al. Language models enable zero-shot prediction of the effects of mutations on protein function. NeurIPS 2021.
7. ESM-IF1: Hsu et al. Learning inverse folding from millions of predicted structures. ICML 2022.
8. Riesselman et al. Deep generative models of genetic variation capture the effects of mutations. Nature Methods 2018.
9 Bepler & Berger. Learning protein sequence embeddings using information from structure. ICLR 2019.
10. Hie et al. Evolutionary velocity with protein language models predicts evolutionary dynamics of diverse proteins. Cell Systems 2022.
11. Heinzinger et al. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics 2019.
12. Alley et al. Unified rational protein engineering with sequence-based deep representation learning. Nature Methods 2019.
Guest Instructor (Adam Lerer) + Paper discussion Assignment 3 due 3pm Thu, Nov 3 in class
10 November 10 Computational protein design 1. Ingraham et al. Generative models for graph-based protein design. NeurIPS 2019.
2. Dauparas et al. Robust deep learning–based protein sequence design using ProteinMPNN. Science 2022.

Optional further reading:
3. Alford et al. The Rosetta all-atom energy function for macromolecular modeling and design. J. Chem Theory Comput. 2017.
4. Battaglia et al. Relational inductive biases, deep learning, and graph networks. arXiv 2018.
5. ESM-IF1: Hsu et al. Learning inverse folding from millions of predicted structures. ICML 2022.
Guest Seminar (John Ingraham) + Paper discussion N/A
11 November 17 Geometric deep learning and drug discovery 1. Gainza et al. Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nature Methods 2019.
2. Stark, Ganea, et al. EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction. ICML 2022.

Optional further reading:
3. Ganea et al. Independent SE(3)-Equivariant Models for End-to-End Rigid Protein Docking. ICLR 2022.
4. Overviews of machine learning in drug discovery and development: [Review] [Talk]
Paper discussion N/A
12 November 24 No class -- Thanksgiving N/A N/A
13 December 1 No class -- NeurIPS N/A N/A
14 December 8 Generative modeling of sequence and structure 1. Ingraham et al. Illuminating protein space with a programmable generative model. bioRxiv 2022.

Optional further reading:
2. Anand and Achim. Protein Structure and Sequence Generation with Equivariant Denoising Diffusion Probabilistic Models. arXiv 2022.
3. Trippe, Yim et al. Diffusion probabilistic modeling of protein backbones in 3D for the motif-scaffolding problem. arXiv 2022.
4. Wu et al. Protein structure generation via folding diffusion. arXiv 2022.
5. Sohl-Dickstein et al. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. ICML 2015.
6. Song et al. Score-Based Generative Modeling through Stochastic Differential Equations. ICLR 2021.
E.Z. lecture + Paper discussion Assignment 4 due 3pm Thu, Dec 8 via Canvas
15 December 15 Structural bioinformatics 1. Mackenzie et al. Tertiary alphabet for the observable protein structural universe. PNAS 2016.
2. van Kempen, Kim et al. Foldseek: fast and accurate protein structure search. bioRxiv 2022.
Paper discussion Assignment / Quiz 5, 3pm Thu, Dec 15 in class