CS Department Colloquium Series | Page 28 | Computer Science Department at Princeton University

Unraveling the heterogeneity and dynamics of regulatory elements in the human genome

Date and Time

Tuesday, March 12, 2013 - 4:30pm to 5:30pm

Location

Computer Science Small Auditorium (Room 105)

Type

CS Department Colloquium Series

Speaker

Anshul Kundaje, from Massachusetts Institute of Technology

In 2003, the Human Genome Project marked a major scientific milestone by releasing the first consensus DNA sequence of the human genome. The ENCODE Project (Encyclopedia of DNA elements) was launched to pick up where the Human Genome Project left off, with the ambitious goal of systematically deciphering the potential function of every base (letter) in the genome. ENCODE has generated the largest collection of functional genomic data in humans to date, measuring the cellular activity of thousands of cellular moieties in a variety of normal and diseased cellular contexts. In this talk, I will describe novel computational and machine learning approaches that I developed for integrative analysis of massive compendia of diverse biological data such as ENCODE to unravel the functional heterogeneity and variation of regulatory elements in the human genome and their implications in human disease.

I will begin with a gentle introduction to the diversity and scale of ENCODE data and a brief overview of robust, statistical methods that we developed for automated detection of DNA binding sites of hundreds of regulatory proteins from noisy, experimental data. Regulatory proteins can perform multiple functions by interacting with and co-binding DNA with different combinations of other regulatory proteins. I developed a novel discriminative machine learning formulation based on regularized Rule-based ensembles that was able to sort through the combinatorial complexity of possible regulatory interactions and learn statistically significant item-sets of co-binding events at an unprecedented level of detail. I found extensive evidence that regulatory proteins could switch partners at different sets of genomic domains within a single cell-type and across different cell-types affecting structural and chemical properties of DNA and regulating different functional categories of target genes. Using regulatory elements discovered from ENCODE data, we were also able to provide putative functional interpretations for up to 81% of all publicly available sequence variants (mutations) identified in large-scale disease studies and generate new hypotheses by integrating multiple sources of data.

Finally, I will present a brief overview of my recent efforts on using multivariate Hidden Markov models to analyze the dynamics of various chemical modifications to DNA across three key axes of variation - across multiple species, across different cell-types in a single species (human), and across multiple human individuals for the same cell-type. Our results indicate a remarkable universality of chemical modifications defining hidden regulatory states across the animal kingdom with dramatic differences in the variation and functional impact of these regulatory elements between cell-types and individuals.

Together, these efforts take us one step closer to learning comprehensive models of gene regulation in humans in order to improve our system-level understanding of cellular processes and complex diseases.

A 2D + 3D Rich Data Approach to Scene Understanding

Date and Time

Wednesday, March 13, 2013 - 4:30pm to 5:30pm

Location

Computer Science Small Auditorium (Room 105)

Type

CS Department Colloquium Series

Speaker

Jianxiong Xiao, from Massachusetts Institute of Technology

Host

Szymon Rusinkiewicz

On your one-minute walk from the coffee machine to your desk each morning, you pass by dozens of scenes -- a kitchen, an elevator, your office -- and you effortlessly recognize them and perceive their 3D structure. But this one-minute scene-understanding problem has been an open challenge in computer vision for decades. Recently, researchers have come to realize that big data is critical for building scene-understanding systems that can recognize the semantics and reconstruct the 3D structure. In this talk, I will share my experience in leveraging big data for scene understanding, shifting the paradigm from 2D view-based categorization to 3D place-centric representations.

To push the traditional 2D representation to the limit, we built the Scene Understanding (SUN) Database, a large collection of images that exhaustively spans all scene categories. However, the lack of a "rich" representation still significantly limits the traditional recognition pipeline. While an image is a 2D array, the world is 3D and our eyes see it from a viewpoint, but this is not traditionally modeled. This paradigm shift toward rich representation also opens up new challenges that require a new kind of big data -- data with extra descriptions, namely rich data. Specifically, we focus on a highly valuable kind of rich data -- multiple viewpoints in 3D -- and we build the SUN3D database to obtain an integrated "place-centric" representation of scenes. This novel representation with rich data opens up exciting new opportunities for integrating scene recognition over space and for obtaining a scene-level reconstruction of large environments. It also has many applications such as organizing big visual data to provide photo-realistic indoor 3D maps. Finally, I will discuss some open challenges and my future plans for rich data and representation.

Jianxiong Xiao is a Ph.D. candidate in the Computer Science and Artificial Intelligence Laboratory (CSAIL) at Massachusetts Institute of Technology (MIT). Before that, he received a B.Eng. and a M.Phil. from the Hong Kong University of Science and Technology. His research interests are in computer vision, with a focus on scene understanding. His work has received the Best Student Paper Award at the European Conference on Computer Vision (ECCV) in 2012, and has appeared in popular press. Jianxiong was awarded the Google U.S./Canada Ph.D. Fellowship in Computer Vision in 2012 and MIT CSW Best Research Award in 2011. More information can be found on his website: http://mit.edu/jxiao.

Procedural Instructions for People and Machines

Date and Time

Monday, March 4, 2013 - 4:30pm to 5:30pm

Location

Computer Science Small Auditorium (Room 105)

Type

CS Department Colloquium Series

Speaker

Floraine Berthouzoz, from University of California, Berkeley

Host

Adam Finkelstein

Procedural tasks such as following a recipe or editing an image are very common. They require a person to execute a sequence of operations (e.g. chop the onions, or sharpen the image) and people commonly use step-by-step tutorials to learn these tasks. We focus on the domain of photo manipulations and have developed tools and techniques to help people learn, compare and automate photo manipulation procedures. We present a demonstration-based system for automatically generating succinct step-by-step visual tutorials of photo manipulations. Our tutorials are quick to generate. Moreover, users are faster and make fewer errors with our tutorials compared to book and video tutorials. We also demonstrate a new interface that allows learners to browse, analyze and compare large collections (i.e. thousands) of photo manipulation tutorials based on their command-level structure. Finally, we present a framework for generating content-adaptive macros (programs) that can transfer complex photo manipulation procedures to new target images. Together these tools allow people to learn, compare and automate procedural knowledge.

Floraine Berthouzoz is a Ph.D. candidate in the Computer Science Department at the University of California, Berkeley. Her research lies at the intersection of computer graphics and human-computer interactions, and aims at building tools that ultimately make it faster and easier for people to create high quality visual content. She has also developed a number of techniques that improve the way people learn procedural instructions. She is the recipient of a 2009 ETH Medal of Honor and a 2010 Google Anita Borg Memorial Scholarship. During her graduate studies, Floraine worked as a visiting researcher at the University of Tokyo, the Hebrew University, Columbia University and Adobe Systems. Floraine also co-founded CS KickStart, an outreach program for incoming undergraduate women. Since 2011, this program has significantly increased the number of undergraduate women pursuing computer science degrees at Berkeley.

New Algorithms for Nonnegative Matrix Factorization and Beyond

Date and Time

Wednesday, February 27, 2013 - 4:30pm to 5:30pm

Location

Computer Science Small Auditorium (Room 105)

Type

CS Department Colloquium Series

Speaker

Ankur Moitra, from Institute for Advanced Study

Host

Moses Charikar

Machine learning is a vibrant field with many rich techniques. However, these approaches are often heuristic: we cannot prove good bounds on either their performance or their running time except in limited settings. This talk will focus on designing algorithms whose performance can be analyzed rigorously. As an example of this project, I will describe my work on the nonnegative matrix factorization problem, which has important applications throughout machine learning (and theory). As is often the case, this problem is NP-hard when considered in full generality. However, we introduce a sub-case called separable nonnegative matrix factorization that we believe is the right notion in various contexts. We give a polynomial time algorithm for this problem, and leverage this algorithm to efficiently learn the topics in a Latent Dirichlet Allocation model (and beyond). This is an auspicious example where theory can lead to inherently new algorithms that have highly-practical performance on real data sets.

I will also briefly describe some of my other work on learning, including mixtures of Gaussians and robust linear regression, as well as promising directions for future work.

Ankur Moitra is an NSF CI Fellow at the Institute for Advanced Study, and also a senior postdoc in the computer science department at Princeton University. He completed his PhD and MS at MIT in 2011 and 2009 respectively, where he was advised by Tom Leighton. He received a George M. Sprowls Award (best thesis) and a William A. Martin Award (best thesis) for his doctoral and master's dissertations. He has worked in numerous areas of algorithms, including approximation algorithms, metric embeddings, combinatorics and smoothed analysis, but lately has been working at the intersection of algorithms and learning.

Human-Powered Data Management

Date and Time

Monday, March 11, 2013 - 4:30pm to 5:30pm

Location

Computer Science Small Auditorium (Room 105)

Type

CS Department Colloquium Series

Speaker

Aditya Parameswaran, from Stanford University

Host

Kai Li

Fully automated algorithms are inadequate for a number of data analysis tasks, especially those involving images, video, or text. Thus, there is a need to combine “human computation” (or crowdsourcing), together with traditional computation, in order to improve the process of understanding and analyzing data. My thesis addresses several topics in the general area of human-powered data management. I design algorithms and systems for combining human and traditional computation for: (a) data processing, e.g., using humans to help sort, cluster, or clean data; (b) data extraction, e.g., having humans help create structured data from information in unstructured web pages; and (c) data gathering, i.e., asking humans to provide data that they know about or can locate, but that would be difficult to gather automatically. My focus in all of these areas is to find solutions that expend as few resources as possible (e.g., time waiting, human effort, or money spent), while still providing high quality results.

In this talk, I will first present a broad perspective of our research on human-powered data management, and I will describe some systems and applications that have motivated our research. I will then present details of one of the problems we have addressed: filtering large data sets with the aid of humans. Finally I will argue that human-powered data management is an area in its infancy, by describing a number of open problems I intend to address in my future research program.

Aditya Parameswaran is a Ph.D. student in the InfoLab at Stanford University, advised by Prof. Hector Garcia-Molina. He is broadly interested in data management, with research results in human computation, information extraction, and recommendation systems. Aditya is a recipient of the Key Scientific Challenges Award from Yahoo! Research (2010), two best-of-conference citations (VLDB 2010 and KDD 2012), and the Terry Groswith graduate fellowship at Stanford University.

Bayesian Nonparametric Models and "Big Data"

Date and Time

Monday, February 25, 2013 - 4:30pm to 5:30pm

Location

Computer Science Small Auditorium (Room 105)

Type

CS Department Colloquium Series

Speaker

John Paisley, from University of California, Berkeley

Host

David Blei

Bayesian nonparametrics is an area in machine learning in which models grow in size and complexity as data accrue. As such, they they are particularly relevant to the world of "Big Data", where it may be difficult or even counterproductive to fix the number of parameters a priori. A stumbling block for Bayesian nonparametrics has been that their algorithms for posterior inference generally show poor scalability. In this talk, we tackle this issue in the domain of large-scale text collections. Our model is a novel tree-structured model in which documents are represented by collections of paths in an infinite-dimensional tree. We develop a general and efficient variational inference strategy for learning such models based on stochastic optimization, and show that with this combination of modeling and inference approach, we are able to learn high-quality models using millions of documents.

John Paisley received the B.S.E. (2004), M.S. (2007) and Ph.D. (2010) in Electrical & Computer Engineering from Duke University, where his advisor was Lawrence Carin. He was a postdoctoral researcher with David Blei in the Computer Science Department at Princeton University, and currently with Michael Jordan in the Department of EECS at UC Berkeley. He works on developing Bayesian models for machine learning applications, particularly for dictionary learning and topic modeling.

CANCLED: Information Extraction from Music Audio

Date and Time

Friday, November 30, 2012 - 3:30pm to 4:30pm

Location

Computer Science Small Auditorium (Room 105)

Type

CS Department Colloquium Series

Speaker

Juan Bello, from New York University

Host

Rebecca Fiebrink

This talk discusses a mix of concepts, problems and techniques at the crossroads of signal processing, machine learning and music. I will start by motivating the use of content-based methods for the analysis and retrieval of music. Then, I will introduce work in three projects being investigated in my lab: automatic chord recognition using hidden Markov models, music structure analysis using probabilistic latent component analysis, and feature learning using convolutional neural networks. In the process of doing so, I hope to illustrate some of the challenges and opportunities in the field of music informatics.

Juan Pablo Bello is an Associate Professor of Music Technology at New York University. In 1998 he received his BEng in Electronics from the Universidad Simón Bolívar in Caracas, Venezuela, and in 2003 he earned a doctorate in Electronic Engineering at Queen Mary, University of London. Juan teaches and researches on the computer-based analysis of musical signals and its applications to music information retrieval, digital audio effects and interactive music systems. He is the principal investigator of the music informatics group of the Music and Audio Research Lab (MARL) at NYU. For more information please visit:

https://files.nyu.edu/jb2843/public/
http://marl.smusic.nyu.edu/

What makes Big Visual Data hard?

Date and Time

Monday, December 10, 2012 - 4:30pm to 5:30pm

Location

Computer Science Small Auditorium (Room 105)

Type

CS Department Colloquium Series

Speaker

Alexei (Alyosha) Efros, from Carnegie Mellon University

Host

Thomas Funkhouser

There are an estimated 3.5 trillion photographs in the world, of which 10% have been taken in the past 12 months. Facebook alone reports 6 billion photo uploads per month. Every minute, 72 hours of video are uploaded to YouTube. Cisco estimates that in the next few years, visual data (photos and video) will account for over 85% of total internet traffic. Yet, we currently lack effective computational methods for making sense of all this mass of visual data. Unlike easily indexed content, such as text, visual content is not routinely searched or mined; it's not even hyperlinked. Visual data is Internet's "digital dark matter" [Perona,2010] -- it's just sitting there!

In this talk, I will first discuss some of the unique challenges that make Big Visual Data difficult compared to other types of content. In particular, I will argue that the central problem is the lack a good measure of similarity for visual data. I will then present some of our recent work that aims to address this challenge in the context of visual matching, image retrieval and visual data mining. As an application of the latter, we used Google Street View data for an entire city in an attempt to answer that age-old question which has been vexing poets (and poets-turned-geeks): "What makes Paris look like Paris?"

"To the first machine that can appreciate the gesture:" Nicholas Negroponte and the MIT Architecture Machine Group

Date and Time

Wednesday, November 7, 2012 - 4:30pm to 5:30pm

Location

Computer Science Small Auditorium (Room 105)

Type

CS Department Colloquium Series

Speaker

Molly Steenson, from Assistant Professor, School of Journalism & Mass Communication, University of Wisconsin-Madison PhD

Host

Rebecca Fiebrink

Nicholas Negroponte today is best known for founding the MIT Media Lab in 1985. But before the Media Lab, he founded the Architecture Machine Group at MIT in 1968 (Arch Mach for short). As an architect captivated by computers, Negroponte created the lab as a test bed for interactive computers, sensors and programs that sought to change the manner in which computers and humans interacted with each other. He envisioned a world in which everybody would live in immersive, intelligent environments, even dedicating his first book The Architecture Machine “To the first machine that can appreciate the gesture." But what is the nature of that gesture? Arch Mach's structure and funding benefited from the same relationships as MIT’s major engineering and computing labs, collaborating especially closely with the Artificial Intelligence Lab, and its projects echoed the dynamics of Department of Defense funding that occurred in AI. In this talk, I will introduce Arch Mach's projects, projections—and gestures.

Molly Wright Steenson is a Ph.D. candidate in Princeton University’s School of Architecture where she researches the history of architecture and information. In January 2013, she will join the faculty of the University of Wisconsin-Madison in the School of Journalism & Mass Communication, where she will focus on digital media studies. Steenson was also a professor at the Interaction Design Institute Ivrea in Italy. As a design researcher, she examines the effect of personal technology on its users, with recent projects in India and China for Microsoft Research and Intel Research and ReD Asscoiates. Prior to her scholarly career, Molly started working with the Web in 1994, producing sites and leading design strategy for Reuters, Netscape, and Razorfish. She holds a Master’s in Environmental Design from the Yale School of Architecture and a B.A. in German from the University of Wisconsin-Madison.

Making proof-based verified computation almost practical

Date and Time

Friday, October 26, 2012 - 3:30pm to 4:30pm

Location

Computer Science Small Auditorium (Room 105)

Type

CS Department Colloquium Series

Speaker

Michael Walfish, from The University of Texas at Austin

Host

Michael Freedman

How can a machine specify a computation to another one and then, without executing the computation, check that the other machine carried it out correctly? And how can this be done without assumptions about the performer (replication, trusted hardware, etc.) or restrictions on the computation? This is the problem of _verified computation_, and it is motivated by the cloud and other third-party computing models. It has long been known that (1) this problem can be solved in theory using probabilistically checkable proofs (PCPs) coupled with cryptographic tools, and (2) the performance of these solutions is wildly impractical (trillions of CPU-years or more to verify simple computations).

I will describe a project at UT Austin that challenges the second piece of this folklore. We have developed an interactive protocol that begins with an efficient argument [IKO CCC '07] and incorporates new theoretical work to improve performance by 20+ orders of magnitude. In addition, we have broadened the computational model from Boolean circuits to a general-purpose model. We have fully implemented the system, accelerated it with GPUs, and developed a compiler that transforms computations expressed in a high-level language to executables that implement the protocol entities.

The resulting system, while not quite ready for the big leagues, is close enough to practicality to suggest that in the near future, PCPs could be a real tool for building actual systems.

Talk will be based on these papers and ongoing work:

Michael Walfish is an assistant professor in the Computer Science Department at The University of Texas at Austin. His research interests are in systems, security, and networking. His honors include an Air Force Young Investigator Award, an NSF CAREER Award, a Sloan Research Fellowship, a Teaching Excellence Award from the UT College of Natural Sciences, the Intel Early Career Faculty Honor Program, and the UT Society for Teaching Excellence. He received his B.A. from Harvard and his Ph.D. from MIT, both in Computer Science.