Instructor Ellen Zhong
Time Thursdays 3:00-5:00p, Friend Center 007
"Precept" / student-only discussion Tuesdays 4:30-5:30p, CS 401
Office hours Wednesdays 4:00-5:00p, CS 314, or by appointment
Slack Link
Syllabus Link

Recent breakthroughs in machine learning algorithms have transformed the study of the 3D structure of proteins and other biomolecules. This seminar class will survey recent papers on ML applied to tasks in protein structure prediction, structure determination, computational protein design, physics-based modeling, and more. We will take a holistic approach when discussing papers, including discussing their historical context, algorithmic contributions, and potential impact on scientific discovery and applications such as drug discovery.

For more information on the discussion format, expectations, and grading, see the course syllabus.


Goals

  • Learn about machine learning methods applied to problems in structural biology
  • Learn how to critically read and evaluate papers
  • Learn how to pose research problems and practice oral and written scientific communication skills
  • Bonus: Exposure to relevant basic and applied ML research in industry from guest speakers


Topics

A non-exhaustive list of topics we will cover include:

  • An introduction to structural biology
  • Protein structure prediction before and after AlphaFold2
  • Computer vision and cryo-electron microscopy (cryo-EM)
  • Computational protein design, in particular, antibody and vaccine design
  • Physics-based modeling and statistical mechanics
  • Small molecule drug discovery

Selected papers will cover a broad range of algorithmic concepts and machine learning techniques including:

  • Supervised learning and designing appropriate benchmarks and metrics
  • Language modeling and transformers
  • Generative modeling techniques including VAEs, GANs, normalizing flows, and diffusion models
  • Geometric deep learning
  • Neural fields and multi-view 3D reconstruction

In addition to the assigned papers, optional primers or reviews on relevant topics will be made available for background reading.


Assignments

Flash talk assignment info: link

Final project guidelines: link


Guest Speakers

Monday, September 25th, 4:30pm ET
John Jumper (DeepMind)

Title: Highly accurate protein structure prediction with deep learning

Abstract: Our work on deep learning for biology, specifically the AlphaFold system, has demonstrated that neural networks are capable of highly accurate modeling of both protein structure and protein-protein interactions. In particular, the system shows a remarkable ability to extract chemical and evolutionary principles from experimental structural data. This computational tool has repeatedly shown the ability to not only predict accurate structures for novel sequences and novel folds but also to do unexpected tasks such as selecting stable protein designs or detecting protein disorder. In this lecture, I will discuss the context of this breakthrough in the machine learning principles, the diverse data and rigorous evaluation environment that enabled it to occur, and the many innovative ways in which the community is using these tools to do new types of science. I will also reflect on some surprising limitations -- insensitivity to mutations and the lack of context about the chemical environment of the proteins -- and how this may be traced back to the essential features of the training process. Finally, I will conclude with a discussion of some ideas on the future of machine learning in structure biology and how the experimental and computational communities can think about organizing their research and data to enable many more such breakthroughs in the future.

Bio: John Jumper received his PhD in Chemistry from the University of Chicago, where he developed machine learning methods to simulate protein dynamics. Prior to that, he worked at D.E. Shaw Research on molecular dynamics simulations of protein dynamics and supercooled liquids. He also holds an MPhil in Physics from the University of Cambridge and a B.S. in Physics and Mathematics from Vanderbilt University. At DeepMind, John is leading the development of new methods to apply machine learning to protein biology.


Thursday, November 16th, 3:00pm ET
Jason Yim (MIT)

Title: Diffusion models for protein structure and de novo design.

Abstract: Generative machine learning is revolutionizing protein design. In this talk, I will discuss recent advances in using diffusion models to generate protein structures and perform conditional generation towards protein design desiderata. First, I will go over FrameDiff, including an overview of the mathematical foundation of SE(3) diffusion and a practical algorithm for training a frame-based generative model over protein backbones. Next, I will overview how SE(3) diffusion is used in RFDiffusion, a state-of-the-art protein design method, which is pre-trained on protein structure prediction. We show that a single method, RFdiffusion, enables binder design, motif-scaffolding, and symmetric protein generation. Finally, I discuss current limitations and the technical challenges on the horizon for de novo protein design.

Bio: Jason Yim is a PhD candidate at the Massachusetts Institute of Technology (MIT) Computer Science and Artificial Intelligence Laboratory advised by Tommi Jaakkola and Regina Barzilay. His research focuses on developing generative models for scientific applications as well as experimental design in biological experiments. He has previously worked as a research engineer at DeepMind and interned at Microsoft AI4science.


Thursday, December 7th, 3:00pm ET
Stephan Eismann (AtomicAI)

Title: Enabling structure-based drug discovery for RNA using AI

Abstract: RNA molecules adopt three-dimensional structures that are critical to their function and of interest in drug discovery. Few RNA structures are known, however, and predicting them computationally has proven challenging. I will talk about ARES, a machine learning approach that enables identification of accurate structural models without assumptions about their defining characteristics, despite being trained with only 18 known RNA structures. ARES outperforms previous methods and consistently produced the best results in community-wide blind RNA structure prediction challenges. In addition to ARES, I will talk about recent advancements in tertiary RNA structure prediction at Atomic AI.

Bio: Stephan leads the ML team at Atomic AI. Originally from Germany, he did his PhD at Stanford University where his research focused on the development of novel ML algorithms for problems in structural biology


Schedule

Please fill out this form and contact Ellen if you are interested in signing up for this class. See last year's course website for a sample of topics and papers we will cover.

Post-lecture feedback: Please fill out this form if you are assigned to give feedback on a lecture.

Week Date Topic Readings Presenters Questions and Feedback
1 September 7 Course overview; Introduction to machine learning in structural biology Additional Resources:
1. Dill et al. The Protein Folding Problem. Annual Review of Biophysics 2008.
Ellen Zhong [Slides] N/A
2 September 14 Protein structure prediction; CASP; Supervised learning; Protein-specific metrics 1. Senior, A.W., Evans, R., Jumper, J. et al. Improved protein structure prediction using potentials from deep learning. Nature 2020.
2. Ingraham, J. et al. Learning Protein Structure with a Differentiable Simulator. ICLR 2019 Oral. [Talk]

Additional Resources:
3. AlphaFold1 CASP13 slides
4. https://moalquraishi.wordpress.com/2018/12/09/alphafold-casp13-what-just-happened/
5. trRosetta: Yang et al. Improved protein structure prediction using predicted interresidue orientations. PNAS 2020.
Ellen Zhong, David Shustin [Slides-1] [Slides-2] Pre-lecture questions Feedback: Yihao Liang, Ambri Ma
3 September 21 Breakthroughs in protein structure prediction 1. Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure prediction with Alphafold. Nature 2021.
2. Tunyasuvunakool, K., Adler, J., Wu, Z. et al. Highly accurate protein structure prediction for the human proteome. Nature 2021.

Additional Resources:
3. AlphaFold2 slides. [CASP14 talk] [Michael Figurnov slides]
4. https://moalquraishi.wordpress.com/2020/12/08/alphafold2-casp14-it-feels-like-ones-child-has-left-home/.
5. Baek et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 2021. [paper]
6. Primer on transformers: [1] [2]
Viola Chen, Xiaxin Shen, Ellen Zhong [Slides-1] [Slides-2] Pre-lecture questions Feedback: Andy Zhang, Brendan Wang
4 September 28 Protein structure determination I: Cryo-EM reconstruction 1. Zhong et al. Reconstructing continuous distributions of protein structure from cryo-EM images. ICLR 2020 Spotlight.
2. Zhong et al. CryoDRGN: reconstruction of heterogeneous cryo-EM structures using neural networks. Nature Methods 2021. [pdf]

Additional Resources:
3. Computer vision related works:
iii. Xie et al. Neural Fields in Visual Computing and Beyond. Computer Graphics Forum 2022.
5. Cryo-EM background:
Singer & Sigworth. Computational Methods for Single-Particle Cryo-EM. Annual Review of Biomedical Data Science, 2020.
6. Primer on Variational Autoencoders: [1] [2] [3] [4]
Ellen Zhong [Slides] Pre-lecture questions
5 October 5 Protein language modeling Sample of:
1. Rives et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. PNAS 2020.
2. Hie et al. Learning Mutational Semantics. NeurIPS 2020.
3. Hie et al. Learning the language of viral evolution and escape. Science 2021.
4. ESM-2/ESMAtlas: Lin et al. Evolutionary-scale prediction of atomic level protein structure with a language model. bioRxiv 2022.
5. ESM-MSA-1b: Rao et al. MSA Transformer. ICML 2021.
6. Riesselman et al. Deep generative models of genetic variation capture the effects of mutations. Nature Methods 2018.
7. Bepler & Berger. Learning protein sequence embeddings using information from structure. ICLR 2019.
8. Nijkamp et al. ProGen2: Exploring the Boundaries of Protein Language Models. arXiv 2022.
9. Ferruz et al. ProtGPT2 is a deep unsupervised language model for protein design. Nature Communications 2022.
10. Chen et al. xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Proteins. bioRxiv 2023.
11. Rao, Bhattacharya, Thomas et al. Evaluating Protein Transfer Learning with TAPE. NeurIPS 2019 Spotlight.
12. Zheng et al. Structure-informed Language Models Are Protein Designers. ICML 2023 Oral.
Paper discussion + Short presentations Flash talk info and sign up spreadsheet

Written summary due before class on Canvas.

Presentation upload form: here
6 October 12 Protein design I: Inverse folding 1. Ingraham et al. Generative models for graph-based protein design. NeurIPS 2019.
2. Dauparas et al. Robust deep learning–based protein sequence design using ProteinMPNN. Science 2022.
3. ESM-IF1: Hsu et al. Learning inverse folding from millions of predicted structures. ICML 2022.
Brendan Wang, Yukang Yang, Justin Wang [Slides-1] [Slides-2] [Slides-3] Pre-lecture questions
Feedback: Kaiqu Liang, Minkyu Jeon, Xiaxin Shen
7 October 19 No class -- Fall Recess Final Project Part 1 Due (Project proposal)
8 October 26 Structural bioinformatics 1. Mackenzie et al. Tertiary alphabet for the observable protein structural universe. PNAS 2016.
2. van Kempen, Kim et al. Fast and accurate protein structure search with Foldseek. Nature Biotechnology 2023.
Eugene Choi, Snigdha Sushil Mishra [Slides-1] [Slides-2] Pre-lecture questions Feedback: Jiahao Qiu, Viola Chen
9 November 2 Physics-based modeling 1. Lindorff-Larsen et al. How fast-folding proteins fold. Science 2011. [Perspective]
2. Noe et al. Boltzmann generators: Sampling equilibrium states of many-body systems with deep learning. Science 2019. [talk]

Optional further reading:
3. Shaw et al. Atomic-Level Characterization of the Structural Dynamics of Proteins. Science 2010.
4. CVPR 2021 tutorial on normalizing flows.
5. Grathwohl, Chen, et al. FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models. ICLR 2019 Oral.
Ellen Zhong, Yihao Liang, Jiahao Qiu [Slides-1] [Slides-2] Pre-lecture questions Feedback: Justin Wang, Eugene Choi
10 November 9 Protein structure determination II 1. Punjani and Fleet. 3DFlex: determining structure and motion of flexible proteins from cryo-EM. Nature Methods 2023.
2. Jamali et al. Automated model building and protein identification in cryo-EM maps. biorXiv 2023.
Minkyu Jeon, Ambri Ma [Slides] Pre-lecture questions Feedback: Alkin Kaz, Victor Chu
11 November 16 Protein Design II 1. Yim, Trippe, Bortoli, Mathieu et al. SE(3) diffusion model with application to protein backbone generation. ICML 2023.
2. Watson, Juergens, Bennett et al. De novo design of protein structure and function with RFdiffusion. Nature 2023.
3. Ingraham et al. Illuminating protein space with a programmable generative model. bioxRiv 2022.
Jason Yim (guest speaker), Alkin Kaz [Slides-1] [Slides-2] Pre-lecture questions Feedback: David Shustin, Howard Yen
12 November 23 No class -- Thanksgiving
13 November 30 Small molecule drug discovery 1. Corso, Stark, Jing et al. DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking. ICLR 2023.
2. Krishna, Wang, Ahern et al. Generalized Biomolecular Modeling and Design with RoseTTAFold All-Atom. biorXiv 2023.
Victor Chu, Howard Yen [Slides] Pre-lecture questions Feedback: Yukang Yang, Snigdha Sushil Mishra
14 December 7 RNA structure prediction 1. Townshend, Eismann, Watkins et al. Geometric deep learning of RNA structure. Science 2021.
Additional Resources:
2. Zhang et al. Advances and opportunities in RNA structure experimental determination and computational modeling. Nature Methods 2022.
Stephan Eismann (guest speaker)
15 Tuesday, December 12, 3:00-5:00pm Final project presentations