Quick links

CS Department Colloquium Series

Statistical models to study how a genetic variant impacts an organism

Date and Time
Monday, March 3, 2014 - 4:30pm
Location
Computer Science Small Auditorium (Room 105)
Type
CS Department Colloquium Series
Host
Olga Troyanskaya

Barbara Engelhardt

Barbara Engelhardt

Consider sequencing the genome of a newborn, and selecting targeted therapeutics early in life to reduce her lifetime risk of addiction, obesity, type II diabetes, or pancreatic cancer. While genome-wide association studies (GWAS) have unquestionably been successful in identifying reproducible genomic risk factors for complex human diseases, the promise of developing therapeutics to reduce the heritable portion of disease risk is far from fulfillment. The essential technological developments to fulfill this promise, however, are mainly in statistics and computation rather than in genomic experimental methods. I describe three genomic studies from my recent work. First, I identified a genetic variant that behaves differently depending on whether or not an individual takes cholesterol-reducing drugs. Second, I found that genetic variants that are associated with different cell traits are co-localized with a large variety of different regulatory mechanisms more often than expected. Third, I developed a model to uncover genetic variants that affect many traits simultaneously, where the trait measurements have substantial technical noise. Throughout, I emphasize statistical and computational challenges, and innovations necessary to fulfill this promise of genomic studies.

 

Transport Architectures for an Evolving Internet

Date and Time
Thursday, March 27, 2014 - 4:30pm to 5:30pm
Location
Computer Science Small Auditorium (Room 105)
Type
CS Department Colloquium Series
Host
Jennifer Rexford

Keith Winstein

Keith Winstein

The technologies that make up the Internet are changing every year, but some transport protocols continue to act as though the Internet behaved as it did 20 years ago. This can cause poor performance on newer networks -- cellular networks, datacenters -- and makes it more challenging to roll out networking technologies that break markedly with the past. How do we make applications and protocols keep up with an evolving network? I will describe the Sprout algorithm, a transport protocol designed for videoconferencing over cellular networks, that uses probabilistic inference to forecast network congestion in advance. On  commercial cellular networks, Sprout gives 2-to-4 times the throughput and 7-to-9 times less delay than Skype, Apple Facetime, and Google+ Hangouts.

This work led to Remy, a computer program that generates transport protocols automatically, as a function of a protocol designer's assumptions about the network and statement of an objective function. Remy's computer-generated algorithms can achieve higher performance and greater fairness than some sophisticated human-designed schemes. I will discuss our work on using Remy to probe open questions of Internet congestion control -- what's the cost of maintaining backwards compatibility with existing algorithms, including the Transmission Control Protocol as it exists today? Is there a tradeoff between a  protocol's performance today and its ability to adapt to networks of the future?

This talk includes joint work with Anirudh Sivaraman, Pratiksha Thaker, and Hari Balakrishnan.

URL: http://mit.edu/keithw

Keith Winstein is a doctoral candidate at MIT's Computer Science and Artificial Intelligence Laboratory. His work applies statistical and predictive approaches to teach computers to design better network protocols and applications. He created the Mosh (mobile shell) tool for remote access to Unix-like systems and the Sprout algorithm for cellular networks, which was awarded a 2014 Applied Networking Research Prize. From 2007 to 2010, Keith worked as a staff reporter at The Wall Street Journal, covering science and medicine.
 

Verified Approximate Computing

Date and Time
Monday, March 24, 2014 - 4:30pm to 5:30pm
Location
Computer Science Small Auditorium (Room 105)
Type
CS Department Colloquium Series
Host
Brian Kernighan

Michael Carbin

Michael Carbin

Many modern applications implement large-scale computations (e.g., machine learning, big data analytics, and financial analysis) in which there is a natural trade-off between the quality of the results that the computation produces and the performance and cost of executing the computation.  Exploiting this fact, researchers have recently developed a variety of new mechanisms that automatically manipulate an application to enable it to execute at a variety of points in its underlying trade-off space. The resulting approximate application can navigate this trade-off space to meet its performance and cost requirements.

I present a program verification and analysis system, Rely, for answering fundamental questions that arise when manipulating an application that implements an approximate computation. Examples of the questions that Rely is designed to answer include: What is the probability that the approximate application produces the same result as the original application? How much do the approximate application’s results differ from those of the original application? And is the approximate application safe and secure?

Rely answers these questions with a novel analysis and verification method for reasoning about the safety and accuracy of approximate applications. Rely also provides a novel language and program analysis for verifying quantitative reliability: the probability that the new approximate application produces the same result as the original computation.

Michael Carbin is a Ph.D. Candidate in Electrical Engineering and Computer Science at MIT. Michael started his research career as undergraduate student, working on BDD-based program analysis at Stanford University and on type-safe compile-time metaprogramming at Microsoft Research.  His work at Stanford received an award for Best Computer Science Undergraduate Honors Thesis.  As a graduate student, Michael has worked as a MIT Lemelson Presidential Fellow and a Microsoft Research Graduate Fellow on both the theory and practice of verified approximate computing and software self-healing. His recent research on verifying the reliability of programs that execute on unreliable hardware won a best paper award at OOPSLA 2013: Object-Oriented Programming, Systems, Languages & Analysis. 

Feature allocations, paintboxes, and probability functions

Date and Time
Wednesday, February 12, 2014 - 4:30pm to 5:30pm
Location
Computer Science Small Auditorium (Room 105)
Type
CS Department Colloquium Series
Host
Robert Schapire

Tamara Broderick

Tamara Broderick

Clustering involves placing entities into mutually exclusive categories. We wish to relax the requirement of mutual exclusivity, allowing objects to belong simultaneously to multiple classes, a formulation that we refer to as "feature allocation." The first step is a theoretical one. In the case of clustering the class of probability distributions over exchangeable partitions of a dataset has been characterized (via exchangeable partition probability functions and the Kingman paintbox). These characterizations support an elegant nonparametric Bayesian framework for clustering in which the number of clusters is not assumed to be known a priori. We establish an analogous characterization for feature allocation; we define notions of "exchangeable feature probability functions" and "feature paintboxes" that lead to a Bayesian framework that does not require the number of features to be fixed a priori. The second step is a computational one. Rather than appealing to Markov chain Monte Carlo for Bayesian inference, we develop a method to transform Bayesian methods for feature allocation (and other latent structure problems) into optimization problems with objective functions analogous to K-means in the clustering setting. These yield approximations to Bayesian inference that are scalable to large inference problems.

Tamara Broderick is a PhD candidate in the Department of Statistics at the University of California, Berkeley. Her research in machine
learning focuses on the design and study of Bayesian nonparametric models, with particular emphasis on feature allocation as a
generalization of clustering that relaxes the mutual exclusivity and exhaustivity assumptions of clustering. While at Berkeley, she has
been a National Science Foundation Graduate Student Fellow and a Berkeley Fellowship recipient. She graduated with an AB in Mathematics from Princeton University in 2007---with the Phi Beta Kappa Prize for highest average GPA in her graduating class and with Highest Honors in Mathematics. She spent the next two years on a Marshall Scholarship at the University of Cambridge, where she received a Masters of Advanced Study in Mathematics for completion of Part III of the Mathematical Tripos (with Distinction) in 2008 and an MPhil by Research in Physics in 2009. She received a Masters in Computer Science from UC Berkeley in 2013.

 

 

Computation, Communication, and Privacy Constraints on Learning

Date and Time
Wednesday, February 19, 2014 - 4:30pm to 5:30pm
Location
Computer Science Small Auditorium (Room 105)
Type
CS Department Colloquium Series
Host
David Blei

John Duchi

John Duchi

How can we maximally leverage available resources--such as computation, communication, multi-processors, or even privacy--when performing machine learning? In this talk, I will suggest statistical risk (a rigorous notion of the accuracy of learning procedures) as a way to incorporate such criteria in a framework for development of algorithms. In particular, we follow a two-pronged approach, where we (1) study the fundamental difficulties of problems, bringing in tools from optimization, information theory, and statistical minimax theory, and (2) develop algorithms that optimally trade among multiple criteria for improved performance. The resulting algorithms are widely applied in industrial and academic settings, giving up to order of magnitude improvements in speed and accuracy for several problems. To illustrate the practical benefits that a focus on the tradeoffs of statistical learning procedures brings, we explore examples from computer vision, speech recognition, document classification, and web search.

John is currently a PhD candidate in computer science at Berkeley, where he started in the fall of 2008. His research interests include optimization, statistics, machine learning, and computation. He works in the Statistical Artificial Intelligence Lab (SAIL) under the joint supervision of Michael Jordan and Martin Wainwright. He obtained his MA in statistics in Fall 2012, and received a BS and MS from Stanford University in computer science under the supervision of Daphne Koller. John has won several awards and fellowships, including a best student paper award at the International Conference on Machine Learning (ICML) and the NDSEG and Facebook graduate fellowships. John has also worked as a software engineer and researcher at Google Research under the supervision of Yoram Singer and Fernando Pereira.

 

CANCELED- Practical Foundations for Scalable Concurrency

Date and Time
Thursday, February 20, 2014 - 4:30pm to 5:30pm
Location
Computer Science Small Auditorium (Room 105)
Type
CS Department Colloquium Series
Host
David Walker

Aaron Turon

Aaron Turon

The traditional way to tame concurrency is to disallow it: "critical sections" of code can be protected by locks, preventing interference from concurrent threads, but also cutting down opportunities for parallelism.  The past two decades have produced a wealth of data structures that instead tolerate (or even embrace) concurrency for the sake of scaling with available parallelism.  But despite the importance of these "scalable" (e.g., lock-free) data structures, many basic questions about them have remained unanswered.  My work has focused on several such questions; this talk will cover two of them.

First, what are the fundamental abstractions for scalable concurrency?  I will present Reagents, the first high-level programming model for building lock-free data structures. Reagents solve a key problem for concurrency libraries: they allow library clients to combine data structure operations while retaining strong scalability and atomicity properties, and without requiring the clients to understand the underlying algorithms.

Second, how can we reason about scalable concurrency under realistic assumptions about shared memory?  The verification literature almost universally assumes that threads interact through a "strong" (sequentially consistent) shared memory.  But modern compilers and CPUs provide much weaker guarantees that cause many previously verified algorithms to fail.  I will present GPS, the first program logic to provide a full suite of modular verification techniques under a weak memory model.

Aaron Turon is a postdoctoral researcher in the Foundations of Programming group at the Max Planck Institute for Software Systems (MPI-SWS), supervised by Derek Dreyer.  He received his Ph.D. from Northeastern University in May 2013 under the supervision of Mitchell Wand, and his BA with Honors at the University of Chicago in 2007. Aaron's interests span programming languages and software verification, but his primary focus for the last several years has been the design and verification of high-performance concurrency libraries.
 

Protein phenotypes from genomic information: 3D, mutation, and design

Date and Time
Thursday, February 6, 2014 - 4:30pm to 5:30pm
Location
Computer Science Small Auditorium (Room 105)
Type
CS Department Colloquium Series
Host
Olga Troyanskaya

Abstract: Attributes of living systems are constrained in evolution. An alternative to the analysis of conserved attributes ('characters') is analysis of functional interactions ('couplings') that cause conservation. For proteins, the evolutionary sequence record can be exploited to provide exquisitely accurate information about 3D structures and functional sites. Recent progress is based on high throughput sequencing as an experimental technology and global probability models under the maximum entropy principle as a key theoretical tool. I will describe how these advances are used in accurate prediction of 3D interactions, complexes, protein plasticity, designing proteins for synthetic biology and therapeutics - and extrapolate to the study of the effects of human genetic variation.

There is a now major opportunity to link genomic information to phenotype and apply this to concrete engineering and health problems, such as disease likelihood, the emergence of drug resistance. My lab will concentrate on four interrelated areas at different scales of biology that address the challenge to infer causality in biological information.

Debora has a PhD in computational biology and a track record of developing novel algorithms and statistics to successfully address unsolved biological problems. She has a passion for statistics and is driven by a desire to understand and interpret human genetic variation.

Do not blame users for misconfigurations

Date and Time
Monday, December 9, 2013 - 4:30pm to 5:30pm
Location
Computer Science Small Auditorium (Room 105)
Type
CS Department Colloquium Series
Host
Kai Li
Today's data centers usually employ high redundancy to tolerate hardware and software errors. But unfortunately another type of errors, namely configuration errors (i.e., misconfigurations) can still take down the whole data centers. It has contributed to more than 30% high severity issues and caused some of the recent major downtime in cloud service companies including Amazon, Microsoft Azure, Google, Facebook, etc. Unfortunately, many software developers and system designers put most of blames on users for configuration errors, and do not pay enough attention to handling misconfiguration in a more active way. Comparing to software bugs, configuration issues have much less tooling support for error detections, issue tracking, tolerance testing, as well as design reviews.

In this talk, I will present our recent work on characterizing misconfigurations in commercial and open source systems (as well as a major commercial cloud service provider), and also some of our solutions in addressing this configuration problem. We approach it from the perspective that configurations are a part of user interface, and thereby need to consider from user perspectives. More specifically, as a practical first step, we need to avoid error-prone requirements and also react gracefully to user mistakes. Our solutions have been used two commercial companies and have influenced some popular open source systems such as "Squid" redesigning their configuration. In this talk, I will also some of the negative interactions (and "challenges") with some open source developers.

Yuanyuan Zhou is currently a Qualcomm Chair Professor at UC-San Diego. Before UCSD, she was a tenured associate professor at University of Illinois at Urbana Champaign. Her research interests span the areas of operating systems, software engineering, system reliability and maintainability. She has 3 papers selected into the IEEE Micro Special Issue on Top Picks from Architecture Conferences, best paper at SOSP'05, and 2011 ACM SIGSOFT FSE (Foundation of Software Engineering) Distinguished Paper. She has co-founded two startups. Her recent startup, PatternInsight, has successfully deployed software quality assurance tools in many companies including Cisco, Juniper, Qualcomm, Motorola, Intel, EMC, Lucent, Tellabs, etc. Recently Pattern Insight has sold its Log Insight business to VmWare. In addition, Intel has licensed some of her solutions on detecting concurrency bugs. She has graduated 14 Ph.D. students so far, many of whom have joined top industrial companies and academia such as University of Wisconsin at Madison, University of Toronto, University of Waterloo, etc as tenure-track faculty.

Pedagogy behind CS50

Date and Time
Tuesday, December 3, 2013 - 4:30pm to 5:30pm
Location
Computer Science Small Auditorium (Room 105)
Type
CS Department Colloquium Series
Host
Brian Kernighan
CS50 is Harvard University's introductory course for majors and non-majors alike, a one-semester amalgam of courses generally known as CS1 and CS2. In 2007, we set out to alter the course's style and tone to resonate with those "less comfortable" and "more comfortable" alike, albeit without sacrificing the course's historical rigor. We maintained the course's underlying syllabus but revamped every problem set, providing students not only with more direction but context as well. And we augmented the course's support structure.

As of 2013, CS50 is Harvard's second-largest course with nearly 700 students, up from 132 in 2006, and those "less comfortable" now compose the course's largest demographic. We present in this talk what we have done and why we have done it. And we offer a glimpse of CS50x, Harvard College's first course to be offered on an even larger scale via edX.

BIO David J. Malan is a Senior Lecturer on Computer Science at Harvard University for the School of Engineering and Applied Sciences. He received his A.B., S.M., and Ph.D. in Computer Science from the same in 1999, 2004, and 2007, respectively. His research in graduate school focused primarily on cybersecurity and computer forensics. His more recent publications focus on instructional technologies and pedagogy. He now teaches Harvard College's introductory computer science course, Computer Science 50, otherwise known as CS50, which is available as OpenCourseWare at cs50.tv as well as via edx.org.

Reclaiming Security for Web Programmers

Date and Time
Thursday, November 21, 2013 - 4:30pm to 5:30pm
Location
Computer Science Small Auditorium (Room 105)
Type
CS Department Colloquium Series
Host
David Walker
The Web enables new classes of programs that pose new security risks. For example, because Web programs freely mix data and code from untrusted sources, major websites have been compromised by third-party components, such as malicious ads. In addition, users cannot fully control which programs run; Web programs are visited, not installed. Therefore, Web security is entirely in the hands of programmers, not users.

I address the problem of securing existing Web programs, which are universally written in JavaScript. Unfortunately, JavaScript has several warts that make it difficult to reason about even simple snippets of code. Several companies have developed "Web sandboxes" to make JavaScript programming safe. However, these Web sandboxes have no security guarantees.

In this talk, I show how programing languages research can be used to verify properties of Web applications. I present a type-based verification method for JavaScript that we use to find bugs in and produce the first verification of an existing, third-party Web sandbox.

Programming language techniques can give us mathematical proofs of security, but attackers attack implementations, not theorems. I discuss our approach to doing principled, real-world Web security research, which combines semantics with systems. I also review additional projects that use our tools and techniques.

Arjun Guha is an assistant professor of Computer Science at UMass Amherst. He enjoys tackling practical problems, while adhering to the mathematical foundations of programming languages. For example, his dissertation on JavaScript semantics and type-checking, from Brown University, is used by several other researchers as a foundation for their own work. As a postdoc at Cornell University, he developed a model of software-defined networking (SDN) in the Coq Proof Assistant, which is the foundation for a verified runtime and other SDN tools. For his research, he has developed and contributed to several software systems, such as LambdaJS, Frenetic, and Flapjax that are in active use.

Follow us: Facebook Twitter Linkedin