Grounding Language by Seeing, Hearing, and Interacting
In my talk, I will discuss three lines of work to bridge this gap between machines and humans. I will first discuss how we might measure grounded understanding. I will introduce a suite of approaches for constructing benchmarks, using machines in the loop to filter out spurious biases. Next, I will introduce PIGLeT: a model that learns physical commonsense understanding by interacting with the world through simulation, using this knowledge to ground language. From an English-language description of an event, PIGLeT can anticipate how the world state might change – outperforming text-only models that are orders of magnitude larger. Finally, I will introduce MERLOT, which learns about situations in the world by watching millions of YouTube videos with transcribed speech. Through training objectives inspired by the developmental psychology idea of multimodal reentry, MERLOT learns to jointly reason over language, vision, and sound.
Together, these directions suggest a path forward for building machines that learn language rooted in the world.
Bio: Rowan Zellers is a final year PhD candidate at the University of Washington in Computer Science & Engineering, advised by Yejin Choi and Ali Farhadi. His research focuses on enabling machines to understand language, vision, sound, and the world beyond these modalities. He has been recognized through an NSF Graduate Fellowship and a NeurIPS 2021 outstanding paper award. His work has appeared in several media outlets, including Wired, the Washington Post, and the New York Times. In the past, he graduated from Harvey Mudd College with a B.S. in Computer Science & Mathematics, and has interned at the Allen Institute for AI.
This talk will be recorded and live-streamed at https://mediacentrallive.princeton.edu/