Quick links

CS Department Colloquium Series

Programming and Proving in Homotopy Type Theory

Date and Time
Monday, March 25, 2013 - 4:30pm
Location
Computer Science Small Auditorium (Room 105)
Type
CS Department Colloquium Series
Host
David Walker
There is a strong correspondence between mathematics and programming, under which mathematical proofs correspond to programs. A _proof assistant_ is a tool that a mathematician/programmer can use to represent proofs/programs, in such a way that a computer can verify whether or not they are correct. The use of proof assistants in math and computer science is becoming ever more important for managing the increasing complexity of proofs and programs. Proof assistants have been used to check significant mathematical theorems, such as the Four Color Theorem and the Feit-Thompson Odd-Order Theorem, as well as large pieces of software, such as a C compiler and the definition of the Standard ML programming language.

In this talk, I will describe my work on Homotopy Type Theory, which is a new foundation for proof assistants. Homotopy Type Theory extends type theory, the basis of several successful proof assistants, with new principles that bring it closer to informal mathematical practice. Moreover, it is based on an exciting new correspondence between type theory and the mathematical disciplines of homotopy theory and category theory, and consequently can be used to directly describe structures in these areas of math. I will describe computer-checked proofs of some basic results in algebraic topology, including calculations of homotopy groups of spheres and Eilenberg-MacLane spaces. Homotopy Type Theory is also a programming language, in which the new mathematical principles correspond to new programming techniques, which make certain programs easier to write. I will also describe my work on using Homotopy Type Theory as a programming language and its applications in computer science.

Dan Licata received his PhD from Carnegie Mellon University in 2011, advised by Robert Harper. His dissertation, Dependently Typed Programming with Domain-Specific Logics, won the FoLLI E.W. Beth Dissertation Award in 2012. After graduating, he spent three semesters as a teaching fellow at Carnegie Mellon, designing and delivering a new introductory course on functional programming, parallelism, and verification. This year, he is a post-doctoral member at the Institute for Advanced Study, which is hosting a year-long program on an exciting new connection between programming languages and mathematics. His research focuses on creating better tools for computer-checked mathematics and software verification.

Fast learning algorithms for discovering the hidden structure in data

Date and Time
Thursday, March 28, 2013 - 4:30pm to 5:30pm
Location
Computer Science Small Auditorium (Room 105)
Type
CS Department Colloquium Series
Host
Robert Schapire
A major challenge in machine learning is to reliably and automatically discover hidden structure in data with minimal human intervention. For instance, one may be interested in understanding the stratification of a population into subgroups, the thematic make-up of a collection of documents, or the dynamical process governing a complex time series. Many of the core statistical estimation problems for these applications are, in general, provably intractable for both computational and statistical reasons; and therefore progress is made by shifting the focus to realistic instances that rule out the intractable cases. In this talk, I'll describe a general computational approach for correctly estimating a wide class of statistical models, including Gaussian mixture models, Hidden Markov models, Latent Dirichlet Allocation, Probabilistic Context Free Grammars, and several more. The key idea is to exploit the structure of low-order correlations that is present in high-dimensional data. The scope of the new approach extends beyond the purview of previous algorithms; and it leads to both new theoretical guarantees for unsupervised machine learning, as well as fast and practical algorithms for large-scale data analysis.

Daniel Hsu is a postdoc at Microsoft Research New England. Previously, he was a postdoc with the Department of Statistics at Rutgers University and the Department of Statistics at the University of Pennsylvania from 2010 to 2011, supervised by Tong Zhang and Sham M. Kakade. He received his Ph.D. in Computer Science in 2010 from UC San Diego, where he was advised by Sanjoy Dasgupta; and his B.S. in Computer Science and Engineering in 2004 from UC Berkeley. His research interests are in algorithmic statistics and machine learning.

Unraveling the heterogeneity and dynamics of regulatory elements in the human genome

Date and Time
Tuesday, March 12, 2013 - 4:30pm to 5:30pm
Location
Computer Science Small Auditorium (Room 105)
Type
CS Department Colloquium Series
In 2003, the Human Genome Project marked a major scientific milestone by releasing the first consensus DNA sequence of the human genome. The ENCODE Project (Encyclopedia of DNA elements) was launched to pick up where the Human Genome Project left off, with the ambitious goal of systematically deciphering the potential function of every base (letter) in the genome. ENCODE has generated the largest collection of functional genomic data in humans to date, measuring the cellular activity of thousands of cellular moieties in a variety of normal and diseased cellular contexts. In this talk, I will describe novel computational and machine learning approaches that I developed for integrative analysis of massive compendia of diverse biological data such as ENCODE to unravel the functional heterogeneity and variation of regulatory elements in the human genome and their implications in human disease.

I will begin with a gentle introduction to the diversity and scale of ENCODE data and a brief overview of robust, statistical methods that we developed for automated detection of DNA binding sites of hundreds of regulatory proteins from noisy, experimental data. Regulatory proteins can perform multiple functions by interacting with and co-binding DNA with different combinations of other regulatory proteins. I developed a novel discriminative machine learning formulation based on regularized Rule-based ensembles that was able to sort through the combinatorial complexity of possible regulatory interactions and learn statistically significant item-sets of co-binding events at an unprecedented level of detail. I found extensive evidence that regulatory proteins could switch partners at different sets of genomic domains within a single cell-type and across different cell-types affecting structural and chemical properties of DNA and regulating different functional categories of target genes. Using regulatory elements discovered from ENCODE data, we were also able to provide putative functional interpretations for up to 81% of all publicly available sequence variants (mutations) identified in large-scale disease studies and generate new hypotheses by integrating multiple sources of data.

Finally, I will present a brief overview of my recent efforts on using multivariate Hidden Markov models to analyze the dynamics of various chemical modifications to DNA across three key axes of variation - across multiple species, across different cell-types in a single species (human), and across multiple human individuals for the same cell-type. Our results indicate a remarkable universality of chemical modifications defining hidden regulatory states across the animal kingdom with dramatic differences in the variation and functional impact of these regulatory elements between cell-types and individuals.

Together, these efforts take us one step closer to learning comprehensive models of gene regulation in humans in order to improve our system-level understanding of cellular processes and complex diseases.

A 2D + 3D Rich Data Approach to Scene Understanding

Date and Time
Wednesday, March 13, 2013 - 4:30pm to 5:30pm
Location
Computer Science Small Auditorium (Room 105)
Type
CS Department Colloquium Series
Host
Szymon Rusinkiewicz
On your one-minute walk from the coffee machine to your desk each morning, you pass by dozens of scenes -- a kitchen, an elevator, your office -- and you effortlessly recognize them and perceive their 3D structure. But this one-minute scene-understanding problem has been an open challenge in computer vision for decades. Recently, researchers have come to realize that big data is critical for building scene-understanding systems that can recognize the semantics and reconstruct the 3D structure. In this talk, I will share my experience in leveraging big data for scene understanding, shifting the paradigm from 2D view-based categorization to 3D place-centric representations.

To push the traditional 2D representation to the limit, we built the Scene Understanding (SUN) Database, a large collection of images that exhaustively spans all scene categories. However, the lack of a "rich" representation still significantly limits the traditional recognition pipeline. While an image is a 2D array, the world is 3D and our eyes see it from a viewpoint, but this is not traditionally modeled. This paradigm shift toward rich representation also opens up new challenges that require a new kind of big data -- data with extra descriptions, namely rich data. Specifically, we focus on a highly valuable kind of rich data -- multiple viewpoints in 3D -- and we build the SUN3D database to obtain an integrated "place-centric" representation of scenes. This novel representation with rich data opens up exciting new opportunities for integrating scene recognition over space and for obtaining a scene-level reconstruction of large environments. It also has many applications such as organizing big visual data to provide photo-realistic indoor 3D maps. Finally, I will discuss some open challenges and my future plans for rich data and representation.

Jianxiong Xiao is a Ph.D. candidate in the Computer Science and Artificial Intelligence Laboratory (CSAIL) at Massachusetts Institute of Technology (MIT). Before that, he received a B.Eng. and a M.Phil. from the Hong Kong University of Science and Technology. His research interests are in computer vision, with a focus on scene understanding. His work has received the Best Student Paper Award at the European Conference on Computer Vision (ECCV) in 2012, and has appeared in popular press. Jianxiong was awarded the Google U.S./Canada Ph.D. Fellowship in Computer Vision in 2012 and MIT CSW Best Research Award in 2011. More information can be found on his website: http://mit.edu/jxiao.

Procedural Instructions for People and Machines

Date and Time
Monday, March 4, 2013 - 4:30pm to 5:30pm
Location
Computer Science Small Auditorium (Room 105)
Type
CS Department Colloquium Series
Host
Adam Finkelstein
Procedural tasks such as following a recipe or editing an image are very common. They require a person to execute a sequence of operations (e.g. chop the onions, or sharpen the image) and people commonly use step-by-step tutorials to learn these tasks. We focus on the domain of photo manipulations and have developed tools and techniques to help people learn, compare and automate photo manipulation procedures. We present a demonstration-based system for automatically generating succinct step-by-step visual tutorials of photo manipulations. Our tutorials are quick to generate. Moreover, users are faster and make fewer errors with our tutorials compared to book and video tutorials. We also demonstrate a new interface that allows learners to browse, analyze and compare large collections (i.e. thousands) of photo manipulation tutorials based on their command-level structure. Finally, we present a framework for generating content-adaptive macros (programs) that can transfer complex photo manipulation procedures to new target images. Together these tools allow people to learn, compare and automate procedural knowledge.

Floraine Berthouzoz is a Ph.D. candidate in the Computer Science Department at the University of California, Berkeley. Her research lies at the intersection of computer graphics and human-computer interactions, and aims at building tools that ultimately make it faster and easier for people to create high quality visual content. She has also developed a number of techniques that improve the way people learn procedural instructions. She is the recipient of a 2009 ETH Medal of Honor and a 2010 Google Anita Borg Memorial Scholarship. During her graduate studies, Floraine worked as a visiting researcher at the University of Tokyo, the Hebrew University, Columbia University and Adobe Systems. Floraine also co-founded CS KickStart, an outreach program for incoming undergraduate women. Since 2011, this program has significantly increased the number of undergraduate women pursuing computer science degrees at Berkeley.

New Algorithms for Nonnegative Matrix Factorization and Beyond

Date and Time
Wednesday, February 27, 2013 - 4:30pm to 5:30pm
Location
Computer Science Small Auditorium (Room 105)
Type
CS Department Colloquium Series
Host
Moses Charikar
Machine learning is a vibrant field with many rich techniques. However, these approaches are often heuristic: we cannot prove good bounds on either their performance or their running time except in limited settings. This talk will focus on designing algorithms whose performance can be analyzed rigorously. As an example of this project, I will describe my work on the nonnegative matrix factorization problem, which has important applications throughout machine learning (and theory). As is often the case, this problem is NP-hard when considered in full generality. However, we introduce a sub-case called separable nonnegative matrix factorization that we believe is the right notion in various contexts. We give a polynomial time algorithm for this problem, and leverage this algorithm to efficiently learn the topics in a Latent Dirichlet Allocation model (and beyond). This is an auspicious example where theory can lead to inherently new algorithms that have highly-practical performance on real data sets.

I will also briefly describe some of my other work on learning, including mixtures of Gaussians and robust linear regression, as well as promising directions for future work.

Ankur Moitra is an NSF CI Fellow at the Institute for Advanced Study, and also a senior postdoc in the computer science department at Princeton University. He completed his PhD and MS at MIT in 2011 and 2009 respectively, where he was advised by Tom Leighton. He received a George M. Sprowls Award (best thesis) and a William A. Martin Award (best thesis) for his doctoral and master's dissertations. He has worked in numerous areas of algorithms, including approximation algorithms, metric embeddings, combinatorics and smoothed analysis, but lately has been working at the intersection of algorithms and learning.

Human-Powered Data Management

Date and Time
Monday, March 11, 2013 - 4:30pm to 5:30pm
Location
Computer Science Small Auditorium (Room 105)
Type
CS Department Colloquium Series
Host
Kai Li
Fully automated algorithms are inadequate for a number of data analysis tasks, especially those involving images, video, or text. Thus, there is a need to combine “human computation” (or crowdsourcing), together with traditional computation, in order to improve the process of understanding and analyzing data. My thesis addresses several topics in the general area of human-powered data management. I design algorithms and systems for combining human and traditional computation for: (a) data processing, e.g., using humans to help sort, cluster, or clean data; (b) data extraction, e.g., having humans help create structured data from information in unstructured web pages; and (c) data gathering, i.e., asking humans to provide data that they know about or can locate, but that would be difficult to gather automatically. My focus in all of these areas is to find solutions that expend as few resources as possible (e.g., time waiting, human effort, or money spent), while still providing high quality results.

In this talk, I will first present a broad perspective of our research on human-powered data management, and I will describe some systems and applications that have motivated our research. I will then present details of one of the problems we have addressed: filtering large data sets with the aid of humans. Finally I will argue that human-powered data management is an area in its infancy, by describing a number of open problems I intend to address in my future research program.

Aditya Parameswaran is a Ph.D. student in the InfoLab at Stanford University, advised by Prof. Hector Garcia-Molina. He is broadly interested in data management, with research results in human computation, information extraction, and recommendation systems. Aditya is a recipient of the Key Scientific Challenges Award from Yahoo! Research (2010), two best-of-conference citations (VLDB 2010 and KDD 2012), and the Terry Groswith graduate fellowship at Stanford University.

Bayesian Nonparametric Models and "Big Data"

Date and Time
Monday, February 25, 2013 - 4:30pm to 5:30pm
Location
Computer Science Small Auditorium (Room 105)
Type
CS Department Colloquium Series
Host
David Blei
Bayesian nonparametrics is an area in machine learning in which models grow in size and complexity as data accrue. As such, they they are particularly relevant to the world of "Big Data", where it may be difficult or even counterproductive to fix the number of parameters a priori. A stumbling block for Bayesian nonparametrics has been that their algorithms for posterior inference generally show poor scalability. In this talk, we tackle this issue in the domain of large-scale text collections. Our model is a novel tree-structured model in which documents are represented by collections of paths in an infinite-dimensional tree. We develop a general and efficient variational inference strategy for learning such models based on stochastic optimization, and show that with this combination of modeling and inference approach, we are able to learn high-quality models using millions of documents.

John Paisley received the B.S.E. (2004), M.S. (2007) and Ph.D. (2010) in Electrical & Computer Engineering from Duke University, where his advisor was Lawrence Carin. He was a postdoctoral researcher with David Blei in the Computer Science Department at Princeton University, and currently with Michael Jordan in the Department of EECS at UC Berkeley. He works on developing Bayesian models for machine learning applications, particularly for dictionary learning and topic modeling.

CANCLED: Information Extraction from Music Audio

Date and Time
Friday, November 30, 2012 - 3:30pm to 4:30pm
Location
Computer Science Small Auditorium (Room 105)
Type
CS Department Colloquium Series
Host
Rebecca Fiebrink
This talk discusses a mix of concepts, problems and techniques at the crossroads of signal processing, machine learning and music. I will start by motivating the use of content-based methods for the analysis and retrieval of music. Then, I will introduce work in three projects being investigated in my lab: automatic chord recognition using hidden Markov models, music structure analysis using probabilistic latent component analysis, and feature learning using convolutional neural networks. In the process of doing so, I hope to illustrate some of the challenges and opportunities in the field of music informatics.

Juan Pablo Bello is an Associate Professor of Music Technology at New York University. In 1998 he received his BEng in Electronics from the Universidad Simón Bolívar in Caracas, Venezuela, and in 2003 he earned a doctorate in Electronic Engineering at Queen Mary, University of London. Juan teaches and researches on the computer-based analysis of musical signals and its applications to music information retrieval, digital audio effects and interactive music systems. He is the principal investigator of the music informatics group of the Music and Audio Research Lab (MARL) at NYU. For more information please visit:

https://files.nyu.edu/jb2843/public/
http://marl.smusic.nyu.edu/

What makes Big Visual Data hard?

Date and Time
Monday, December 10, 2012 - 4:30pm to 5:30pm
Location
Computer Science Small Auditorium (Room 105)
Type
CS Department Colloquium Series
Host
Thomas Funkhouser
There are an estimated 3.5 trillion photographs in the world, of which 10% have been taken in the past 12 months. Facebook alone reports 6 billion photo uploads per month. Every minute, 72 hours of video are uploaded to YouTube. Cisco estimates that in the next few years, visual data (photos and video) will account for over 85% of total internet traffic. Yet, we currently lack effective computational methods for making sense of all this mass of visual data. Unlike easily indexed content, such as text, visual content is not routinely searched or mined; it's not even hyperlinked. Visual data is Internet's "digital dark matter" [Perona,2010] -- it's just sitting there!

In this talk, I will first discuss some of the unique challenges that make Big Visual Data difficult compared to other types of content. In particular, I will argue that the central problem is the lack a good measure of similarity for visual data. I will then present some of our recent work that aims to address this challenge in the context of visual matching, image retrieval and visual data mining. As an application of the latter, we used Google Street View data for an entire city in an attempt to answer that age-old question which has been vexing poets (and poets-turned-geeks): "What makes Paris look like Paris?"

Follow us: Facebook Twitter Linkedin