Instructor Danqi Chen (danqic AT
Teaching assistant Alexander Wettig (awettig AT
Lectures Monday/Wednesday 10:30-11:50am
Location Sherrerd Hall 101
Pre-lecture feedback meetings Monday 3:30-4pm for Wednesday lectures, Friday 4:45pm-5:15pm for Monday lectures, COS 412
Office hours Danqi's office hour: Monday 2:30-3:30pm, COS 412 (by appointment)
Alex's office hour: Wednesday 3-4pm, Friend Center (student space lobby)
Feedback form

We will use a Slack team for most communiations this semester (no Ed!). We will let you get in the Slack team after the first lecture; If you join the class late, just email us and we will add you. As long as you are on Slack, we prefer Slack messages over emails for all logistical questions. We also encourage students to use Slack for discussion of lecture content and projects.

Large language models (LLMs) have utterly transformed the field of natural language processing (NLP) in the last 3-4 years. They form the basis of state-of-art systems and become ubiquitous in solving a wide range of natural language understanding and generation tasks. With the unprecedented potential and capabilities, these models also give rise to new ethical and scalability challenges. This course aims to cover cutting-edge research topics centering around pre-trained language models. We will discuss their technical foundations (BERT, GPT, T5 models, mixture-of-expert models, retrieval-based models), emerging capabilities (knowledge, reasoning, few-shot learning, in-context learning), fine-tuning and adaptation, system design, as well as security and ethics. We will cover each topic and discuss important papers in depth. Students will be expected to routinely read and present research papers and complete a research project at the end.

This is an advanced graduate course and all the students are expected to have taken machine learning and NLP courses before and are familiar with deep learning models such as Transformers.

Learning goals:

  • This course is intended to prepare you for performing cutting-edge research in natural language processing, especially topics related to pre-trained language models. We will discuss the state-of-the-art, their capabilities and limitations.
  • Practice your research skills, including reading research papers, conducting literature survey, oral presentations, as well as providing constructive feedback.
  • Gain hands-on experience through the final project, from brainstorming ideas to implementation and empirical evaluation and writing the final paper.

Course structure

  • Class participation (25%): In each class, we will cover 1-2 papers. You are required to read these papers in depth and answer around 3 pre-lecture questions (see "pre-lecture questions" in the schedule table) before 11:59pm prior to the lecture day. These questions are designed to test your undersatnding and stimulate your thinking on the topic and will count towards class participation (we will not grade the correctness; as long as you do your best to answer these questions, you will be good). In the last 20 minutes of the class, we will review and discuss these questions in small groups.
  • Presentations (30%): For each lecture, we will ask two students to work together and deliver a 60-minute lecture. The goal is to educate the others in the class about the topic, so do think about how to best cover the material, do a good job with slides, and be prepared for lots of questions. The topics and scheduling will be decided at the beginning of the semester. All the students are expected to come to the class regularly and participate in discussion.
    • 1-2 papers have already been chosen for each topic. We also encourage you to include background, or useful materials from "recommended reading" when you see there is a fit.
    • You are also required to meet with the instructor before the lecture (Monday 3:30-4pm for Wednesday lectures and Friday 4:45-5:15pm for Monday lectures). Please send your draft slides on Slack before 11:59pm the day prior to the meeting and we will go over your slides during the meeting.
    • You are expected to present 1-2 times and you will receive feedback on your presentation from 3-4 classmates.
  • Lecture feedback (5%): In addition to giving lectures, you are also required to provide written feedback to the presenter(s) on their lecture, 1+ pages in length, commenting on the content, delivery, clarity, completeness, etc. No need for complete sentences, bullet points are fine, but should be thorough and constructive. These notes should be sent to the instructor/TA on Slack within a day of the lecture (a google doc link is preferred). You are expected to do this 2-3 times throughout the semester.
  • Final project (40%): At the end of the class, everyone is required to do a class project related to LLMs and submit a final paper. You should work as a team of 2 or 3. Two example types of project include:
    • Train or fine-tune a medium-sized language model (e.g., BERT/RoBERTa, T5, GPT-2) yourself for the task of your interest. You will probably need to access pre-trained models on HuggingFace's hub. If you don't have decent compute resources, we will provide certain compute budget for you to execute your projects.
    • Prompt and evaluate a very large language model (e.g., GPT-3, Codex) to understand their capabilities, limitations or risks. We will provide certain budget for you to access these large models if needed.
  • Everyone is required to submit a proposal by Oct 14th 11:59pm and the final paper is due on Dec 16th (dean's date). We will schedule in-class project presentations at the end of semester on Dec 5th. More detailed guidelines can be found here.

Useful materials:


Date Topic/papers Recommended reading Pre-lecture questions Presenters Feedback providers
Sep 7 (Wed) Introduction
1. Human Language Understanding & Reasoning
2. Attention Is All You Need (Transformers)
3. Blog Post: The Illustrated Transformer
4. HuggingFace's course on Transformers
- Danqi Chen
What are LLMs?
Sep 12 (Mon) BERT (encoder-only models)
1. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
1. Deep contextualized word representations (ELMo)
2. Improving Language Understanding by Generative Pre-Training (OpenAI GPT)
3. RoBERTa: A Robustly Optimized BERT Pretraining Approach
4. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
lec2 questions Danqi Chen
Sep 14 (Wed) T5 (encoder-decoder models)
1. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5)
1. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
2. mT5: A massively multilingual pre-trained text-to-text transformer
3. AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model
lec3 questions Abhishek Panigrahi, Victoria Graf
Edward Tian, Zihan Ding, Jiatong Yu, Anirudh Ajith
Sep 19 (Mon) GPT-3 (decoder-only models)
1. Language Models are Few-Shot Learners (GPT-3)
1. Language Models are Unsupervised Multitask Learners (GPT-2)
2. PaLM: Scaling Language Modeling with Pathways
3. OPT: Open Pre-trained Transformer Language Models
lec 4 questions Sabhya Chhabria, Michael Tang
Anika Maskara, Tianle Cai, Richard Zhu, Andrea Wynn
How to Use and Adapt LLMs?
Sep 21 (Wed) Prompting for few-shot learning
1. Making Pre-trained Language Models Better Few-shot Learners (blog post)
2. How Many Data Points is a Prompt Worth?
1. Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference
2. True Few-Shot Learning with Language Models
3. Cutting Down on Prompts and Parameters: Simple Few-Shot Learning with Language Models
4. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing
lec 5 questions Kaixuan Huang, Edward Tian
Sam Liang, Mengzhou Xia, Victoria Graf, Tianle Cai
Sep 26 (Mon) Prompting as parameter-efficient fine-tuning
1. Prefix-Tuning: Optimizing Continuous Prompts for Generation
2. The Power of Scale for Parameter-Efficient Prompt Tuning
1. Factual Probing Is [MASK]: Learning vs. Learning to Recall
2. P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks
3. LoRA: Low-Rank Adaptation of Large Language Models
4. Towards a Unified View of Parameter-Efficient Transfer Learning
lec 6 questions Chris Pan, Hongjie Wang
Sabhya Chhabria, Andrea Wynn, Sam Liang, Wenhan Xia
Sep 28 (Wed) In-context learning
1. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
2. An Explanation of In-context Learning as Implicit Bayesian Inference (we don't expect you to read this paper in depth, you can check out this blog post instead)
1. What Makes Good In-Context Examples for GPT-3?
2. Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity
3. Data Distributional Properties Drive Emergent In-Context Learning in Transformers
4. What Can Transformers Learn In-Context? A Case Study of Simple Function Classes
lec 7 questions Sam Liang, Kexin Jin
Anika Maskara, Zixu Zhang, Tong Wu, Victoria Graf
Oct 3 (Mon) Calibration of prompting LLMs
1. Calibrate Before Use: Improving Few-Shot Performance of Language Models
2. Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right
1. Noisy Channel Language Model Prompting for Few-Shot Text Classification
2. How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering
3. Language Models (Mostly) Know What They Know
lec 8 questions Vishvak Murahari, Howard Yen
Jiatong Yu, Howard Chen, Chris Pan, Andre Niyongabo Rubungo, Devon Wood-Thomas
Oct 5 (Wed) Reasoning
1. Chain of Thought Prompting Elicits Reasoning in Large Language Models
2. Large Language Models are Zero-Shot Reasoners
1. Explaining Answers with Entailment Trees
2. Self-Consistency Improves Chain of Thought Reasoning in Language Models
3. Faithful Reasoning Using Large Language Models
lec 9 questions Zihan Ding, Zixu Zhang
Vishvak Murahari, Beiqi Zou, Chris Pan, Xiangyu Qi
Oct 10 (Mon) Knowledge
1. Language Models as Knowledge Bases?
2. How Much Knowledge Can You Pack Into the Parameters of a Language Model?
1. Knowledge Neurons in Pretrained Transformers
2. Fast Model Editing at Scale
3. Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets
lec 10 questions Jane Pan, Mengzhou Xia
Andre Niyongabo Rubungo, Devon Wood-Thomas, Xiangyu Qi, Howard Chen
Dissecting LLMs: Data, Model Scaling and Risks
Oct 12 (Wed) Data
1. Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
1. The Pile: An 800GB Dataset of Diverse Text for Language Modeling
2. Deduplicating Training Data Makes Language Models Better
lec 11 questions Andre Niyongabo Rubungo, Tanushree Banerjee
Arseniy Andreyev, Wenhan Xia, Xindi Wu, Richard Zhu
Oct 14 (Fri) Final project proposal due at 11:59pm
Submit here.
Oct 17 (Mon) Fall recess (no class)
Oct 19 (Wed) Fall recess (no class)
Oct 24 (Mon) Scaling
1. Training Compute-Optimal Large Language Models
1. Scaling Laws for Neural Language Models
2. Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers
3. Scaling Laws for Autoregressive Generative Modeling
lec 12 questions Anika Maskara, Simon Park
Hongjie Wang, Sabhya Chhabria, Edward Tian, Kaixuan Huang
Oct 26 (Wed) Privacy
1. Extracting Training Data from Large Language Models
1. Quantifying Memorization Across Neural Language Models
2. Deduplicating Training Data Mitigates Privacy Risks in Language Models
3. Large Language Models Can Be Strong Differentially Private Learners
4. Recovering Private Text in Federated Learning of Language Models
lec 13 questions Xiangyu Qi, Tong Wu
Anirudh Ajith, Austin Wang, Tanushree Banerjee, Arseniy Andreyev
Oct 31 (Mon) Bias & Toxicity I: evaluation
1. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
2. OPT paper, Section 4
1. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?
2. Red Teaming Language Models with Language Models
3. Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection
lec 14 questions Maxine Perroni-Scharf, Richard Zhu
Tong Wu, Hongjie Wang, Howard Yen, Mengzhou Xia
Nov 2 (Wed) Bias & Toxicity II: mitigation
1. Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP
1. Challenges in Detoxifying Language Models
2. Detoxifying Language Models Risks Marginalizing Minority Voices
3. Plug and Play Language Models: A Simple Approach to Controlled Text Generation
4. GeDi: Generative discriminator guided sequence generation
lec 15 questions Anirudh Ajith, Arnab Bhattacharjee
Maxine Perroni-Scharf, Xindi Wu, Jane Pan, Howard Chen
Beyond Current LLMs: Models and Applications
Nov 7 (Mon) Sparse models
1. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
1. Efficient Large Scale Language Modeling with Mixtures of Experts
2. Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models
3. A Review of Sparse Expert Models in Deep Learning
lec 16 questions Zhou Lu, Wenhan Xia [slides] Michael Tang, Arnab Bhattacharjee, Kexin Jin, Beiqi Zou
Nov 9 (Wed) Retrieval-based LMs
1. Improving language models by retrieving from trillions of tokens
1. Generalization through Memorization: Nearest Neighbor Language Models
2. Training Language Models with Memory Augmentation
3. Few-shot Learning with Retrieval Augmented Language Models
lec 17 questions Tianle Cai, Beiqi Zou
Simon Park, Jane Pan, Maxine Perroni-Scharf, Abhishek Panigrahi
Nov 14 (Mon) Training LMs with human feedback
1. Training language models to follow instructions with human feedback
1. Learning to summarize from human feedback
2. Fine-Tuning Language Models from Human Preferences
3. MemPrompt: Memory-assisted Prompt Editing with User Feedback
4. LaMDA: Language Models for Dialog Application
lec 18 questions Howard Chen, Austin Wang
Abhishek Panigrahi, Simon Park, Kaixuan Huang, Arseniy Andreyev
Nov 16 (Wed) Code LMs
1. Evaluating Large Language Models Trained on Code
1. A Conversational Paradigm for Program Synthesis
2. InCoder: A Generative Model for Code Infilling and Synthesis
3. A Systematic Evaluation of Large Language Models of Code
4. Language Models of Code are Few-Shot Commonsense Learners
5. Competition-Level Code Generation with AlphaCode
lec 19 questions Arseniy Andreyev, Jiatong Yu
Howard Yen, Michael Tang, Tanushree Banerjee, Kexin Jin
Nov 21 (Mon) Multimodal LMs
1. Flamingo: a Visual Language Model for Few-Shot Learning
1. Blog post: Generalized Visual Language Models
2. Learning Transferable Visual Models From Natural Language Supervision (CLIP)
3. Multimodal Few-Shot Learning with Frozen Language Models
4. CM3: A Causal Masked Multimodal Model of the Internet
lec 20 questions Andrea Wynn, Xindi Wu
Arnab Bhattacharjee, Vishvak Murahari, Austin Wang, Zihan Ding
Nov 23 (Wed) Thanksgiving recess (no class)
Nov 28 (Mon) Guest lecture: Alexander Rush (Cornell/Hugging Face)
Multitask Prompted Training for Zero-Shot Models

1. Multitask Prompted Training Enables Zero-Shot Task Generalization
2. PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts
3. Scaling Instruction-Finetuned Language Models
4. Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
Nov 30 (Wed) AI Alignment + open discussion
1. A General Language Assistant as a Laboratory for Alignment
2. Alignment of Language Agents
3. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Devon Wood-Thomas (half of the lecture)
Richard Zhu, Sabhya Chhabria, Andrea Wynn, Anirudh Ajith
Dec 5 (Mon) in-class presentation (extended class)
Dec 7 (Wed) No class
Dec 16 (Fri) Final project due at 11:59pm (dean's date)