Talk | Page 9 | Computer Science Department at Princeton University

Small Summaries for Big Data

Date and Time

Wednesday, January 30, 2013 - 12:00pm to 1:30pm

Location

Computer Science 302

Type

Talk

Speaker

Graham Cormode, from AT&T Shannon Laboratories

In dealing with big data, we often need to look at a small summary to get the big picture. Over recent years, many new techniques have been developed which allow important properties of large distributions to be extracted from compact and easy-to-build summaries. In this talk, I'll highlight some examples of different types of summarization: sampling, sketching, and special-purpose. Then I'll outline the road ahead for further development of such summaries.

Graham Cormode is a Principal Member of Technical Staff in the Database Management Group at AT&T Shannon Laboratories in New Jersey. Previously, he was a researcher at Bell Labs, after postdoctoral study at the DIMACS center in Rutgers University from 2002-2004. His PhD was granted by the University of Warwick in 2002. His work considers aspects of managing and working with large amounts of data, with particular emphasis on privacy and anonymization, and large scale analytics.

The Why and How of an All-Flash Enterprise Storage Array

Date and Time

Wednesday, January 9, 2013 - 4:30pm to 5:30pm

Location

Computer Science 402

Type

Talk

Speaker

Neil Vachharajani, from Pure Storage

Host

Vivek Pai

Enterprise storage is an $30 billion a year industry dominated by spinning disks. Flash storage is poised to take a large chunk of the market, having grown significantly in capacity and production, driven by consumer electronics. Flash's technical advantages over disk promise storage arrays that are faster and easier to use while consuming less power and costing less.

The downsides of flash (inc. large erase blocks, limited overwrites, and higher price) mean that using flash as a drop-in replacement for disk leads to increased price, volatile performance, and decreased reliability.

In this talk, we describe the design of the Pure FlashArray, an enterprise storage array built around consumer flash storage. The array and its software, Purity, play to the advantages of flash while minimizing the downsides. Purity writes to flash in multiples of the erase block size and stores its metadata in a key-value store that minimizes overwrites and stores approximate answers, trading extra reads for fewer writes. And, Purity reduces data stored on flash through a range of techniques, including compression, deduplication, and thin provisioning.

The net result is a flash array that deliver a sustained read-write workload of over 100,000 4kb I/O requests per second while maintaining sub-millisecond latency. With many customers seeing 4x or greater data reduction, the Pure FlashArray ends up being cheaper than disk too.

Making the Mobile Web Fast

Date and Time

Friday, December 7, 2012 - 12:00pm to 1:30pm

Location

Computer Science 302

Type

Talk

Speaker

Matt Welsh, from Google Inc.

Host

Michael Freedman

Web usage on mobile devices is exploding, and in many countries, mobile devices are the predominant means of accessing the Internet. But the mobile web is still incredibly slow; clearly the web was not designed for mobile networks. Our team at Google is working to solve this problem, by optimizing all layers of the stack: device operating systems, browsers, network protocols, servers, and web content. In this talk, I'll talk about why the mobile web is slow, provide some insight into the inner workings of 3G and 4G networks, and provide an overview of some of our efforts to make the mobile web fast. I'll also talk a bit about what it was like to make the leap from academia to industry. Matt Welsh is the head of mobile web performance at Google. Prior to joining Google, he was a professor of Computer Science at Harvard, where he worked on projects ranging from deploying wireless sensor networks on volcanoes in Ecuador to building a swarm of robotic bees. He did his PhD at UC Berkeley.

ASPEN: Seamless Declarative Programming across Sensors, Streams, and the Cloud

Date and Time

Wednesday, December 5, 2012 - 12:30pm to 1:30pm

Location

Computer Science 302

Type

Talk

Speaker

Zachary Ives, from University of Pennsylvania

Due to advances in semiconductor and communications technology, we are steadily moving towards a world filled with a plethora of embedded sensing devices as well as "virtual" sensors within devices -- with the potential to monitor the environment and to react to or even anticipate changes. The environment being monitored might be a smart hospital, an entire campus, or even an office or data center. Often the behaviors we wish to produce involve combining constantly updating (streaming) data from multiple heterogeneous sensor and Internet sources. A major challenge is how to manage the resulting complexity, and to build large-scale, rich monitoring, learning, and control applications.

In the ASPEN project we have been exploring an approach that seeks to separate logical dataflow from most of the algorithmic logic -- using a declarative, SQL-like (but iterative and incremental) programming model to capture the dataflow, data transformation, and state management needed by an application, combined with small bits of procedural code to handle complex logic. Our platform provides distributed query optimization that takes runtime conditions into account, while also supporting a range of learning, prediction, and connection-finding algorithms. In this talk I will describe our basic ASPEN prototype including cluster and sensor subsystems, and provide an overview of how we address issues of query optimization, distributed query execution, and incremental recomputation.

Work done jointly with Mengmeng Liu, Svilen Mihaylov, Boon Thau Loo, and Sudipto Guha

Zachary Ives is an Associate Professor and the Markowitz Faculty Fellow at the University of Pennsylvania. His research interests include data integration and sharing, "big data", sensor networks, and data provenance and authoritativeness. He is a recipient of the NSF CAREER award, and an alumnus of the DARPA Computer Science Study Panel and Information Science and Technology advisory panel. He has also been awarded the Christian R. and Mary F. Lindback Foundation Award for Distinguished Teaching. He serves as the undergraduate curriculum chair for Penn's Singh Program in Market and Social Systems Engineering. He is a co-author of the textbook Principles of Data Integration.

Software-based Services for Cloud Networks

Date and Time

Monday, November 26, 2012 - 4:30pm to 5:30pm

Location

Computer Science 402

Type

Talk

Speaker

Anees Shaikh, from IBM Research

Software-defined networking (SDN) aims to provide a well-defined programming and automation interface to network devices. In moving network control and functionality to software running on standard server platforms, SDNs depart from the traditional, vertically integrated model of network hardware and software. We view SDNs as a new opportunity for supporting more seamless integration of networks with IT processes, and for providing higher-value network services.

In this talk, after describing the SDN model and our approach for building SDNs, we will focus on the problem of providing networking services for cloud computing platforms. As more enterprises look to leverage the cost and flexibility advantages of cloud computing, the lack of rich networking support remains a challenge. We will discuss the requirements of enterprise line-of-business applications for additional network functions in the cloud, and describe our research efforts to develop networking services for multi-tenant enterprise clouds using software-defined networking techniques.

Anees Shaikh is a Research Staff Member and Manager with the IBM TJ Watson Research Center in New York. He currently leads the Systems Networking Research group at Watson, focusing on problems related to data center and cloud networks. www: http://researcher.ibm.com/researcher/view.php?person=us-aashaikh

Broadband Internet Performance: A View From the Gateway

Date and Time

Monday, November 12, 2012 - 10:00am to 11:00am

Location

Computer Science 402

Type

Talk

Speaker

Renata Teixeira, from University of Paris (LIP6)

Host

Jennifer Rexford

Policymakers, ISPs, and users are increasingly interested in studying the performance of Internet access links. This talk will first present our findings on the the accuracy and overhead of state-of-the-art available bandwidth estimation tools to measure residential broadband throughput. Because of many confounding factors in a home network or on end hosts, however, thoroughly understanding access network performance requires deploying measurement infrastructure in users’ homes as gateway devices. This talk will then present the first study of network access link performance measured directly from home gateway devices. In conjunction with the Federal Communication Commission’s study of broadband Internet access in the United States, we study the throughput and latency of network access links using longitudinal measurements from nearly 4,000 gateway devices across 8 ISPs from a deployment of over 4,200 devices. We study the performance users achieve and how various factors ranging from the user’s choice of modem to the ISP’s traffic shaping policies can affect performance. Our study yields many important findings about the characteristics of existing access networks. Our findings also provide insights into the ways that access network performance should be measured and presented to users, which can help inform ongoing broader efforts to benchmark the performance of access networks.

The work presented in this talk appeared at:

O. Goga, R. Teixeira, "Speed Measurements of Residential Internet Access", in Proc. of Passive and Active Measurement Conference, March 2012.
S. Sundaresan, W. de Donato, N. Feamster, R. Teixeira, S. Crawford, and A. Pescape, "Broadband Internet Performance: A View From the Gateway", in Proc. of ACM SIGCOMM, August 2011.

Renata Teixeira received the B.Sc. degree in computer science and the M.Sc. degree in electrical engineering from Universidade Federal do Rio de Janeiro, Brazil, in 1997 and 1999, respectively, and the Ph.D. degree in computer science from the University of California, San Diego, in 2005. During her Ph.D. studies, she worked on Internet routing at the AT&T Research. She is currently a researcher with the Centre National de la Recherche Scientifique (CNRS) at LIP6, UPMC Sorbonne Universites, Paris, France. She was a visiting scholar at UC Berkeyley/ICSI. Her research interests are in measurement, analysis, and management of data networks. She has authored more than 40 papers in this area. Renata serves in the editorial board of the IEEE/ACM Transactions on Networking and of the ACM SIGCOMM Computer Communication Review. She is also a member of the steering committee of the ACM Internet Measurement Conference and has been active in the program committees of ACM SIGCOMM, ACM IMC, PAM, IEEE INFOCOM, among others.

Tractable market making in combinatorial prediction markets

Date and Time

Wednesday, October 10, 2012 - 12:30pm to 1:30pm

Location

Computer Science 402

Type

Talk

Speaker

Miroslav Dudik*07, from Microsoft Research New York City

Prediction markets are emerging as a powerful and accurate method of aggregating information from populations of experts (and non-experts). Traders in prediction markets are incentivized to reveal their information through buying and selling "securities" for events such as "Romney to win Florida". The prices of securities reflect the aggregate belief about the events and the key challenge is to correctly price the securities.

We present a new automated market maker for providing liquidity across multiple logically interrelated securities. Our approach lies somewhere between the industry standard---treating related securities as independent and thus not transmitting any information from one security to another---and a full combinatorial market maker for which pricing is computationally intractable. Our market maker, based on convex optimization and constraint generation, is tractable like independent securities yet propagates some information among related securities like a combinatorial market maker, resulting in more complete information aggregation. Our techniques borrow heavily from variational inference in exponential families. We prove several favorable properties of our scheme and evaluate its information aggregation performance on survey data involving hundreds of thousands of complex predictions about the 2008 U.S. presidential election.

Joint work with Sebastien Lahaie and David Pennock.

Bio: Miroslav Dudik joined MSR-NYC in May 2012. His interests are in combining theoretical and applied aspects of machine learning, statistics, convex optimization and algorithms. He received his PhD from Princeton in 2007.

Virtualization Management and Data Center Monitoring for Energy-efficient Clouds

Date and Time

Tuesday, September 25, 2012 - 12:00pm to 1:30am

Location

Computer Science 302

Type

Talk

Speaker

Canturk Isci, from IBM Research

Host

Margaret Martonosi

Energy and thermal management in the cloud has become a primary concern as the global footprint of large-scale compute systems reaches significant levels, and continues to increase with impressive pace, following the ever-increasing compute demand and data explosion. Various new control knobs across systems, middleware, virtualization and facilities levels promote many interesting optimization opportunities for improving energy efficiency. However, these are coupled with significant management and monitoring challenges for effectively exploiting the present opportunities.

In this talk we first look at some of the potential opportunities and emerging challenges for improving the energy-efficiency in clouds, and highlight our research at virtualization, distributed systems and data center levels targeting these opportunities and challenges. We describe an effective distributed resource management approach based on a novel, black-box virtual machine (VM) demand estimation technique, which serves as a key enabler for improving the energy efficiency of virtualized systems via consolidation-driven, dynamic VM placement. We demonstrate that this demand estimation technique is highly-general across virtualization technologies and can very accurately track actual VM demands even under extreme consolidation and resource constraints. With multiple real-system prototypes, we present the dramatic improvements in the performance and agility of energy-aware virtualization management when guided by demand estimation. Next, we briefly look at the increasing cost and complexity of monitoring in the cloud and describe a new approach based on mobile, autonomous robots for reliable, fast and cost-effective monitoring and management in data centers. We demonstrate the operation principles of actual robot prototypes that are deployed in various live data centers, and discuss their applications to predictive and diagnostic analytics for energy, thermal and asset management.

Bio:
Canturk Isci is a Research Staff Member in IBM TJ Watson Research Center. His research interests include virtualization, data center energy and thermal management, microarchitectural and system-level techniques for energy-efficient and adaptive computing. Prior to joining IBM Research, Canturk was a Senior Member of Technical Staff at VMware, where we he worked on distributed resource and power management, performance and scalability of virtualized systems. He is the recipient of a best paper award in ICAC 2011, best research poster in VMworld 2008 and academic fellowships from British Council, Princeton and Bilkent University. He serves as the industry chair in IEEE Computer Society, Special Technical Community on Sustainable Computing. Canturk has a B.S. in Electrical Engineering from Bilkent University, an M.Sc. with Distinction in VLSI System Design from University of Westminster, and a Ph.D. in Electrical Engineering from Princeton University.

Jane Street Tech Talk: NILE

Date and Time

Thursday, September 27, 2012 - 5:30pm to 6:30pm

Location

Computer Science Tea Room

Type

Talk

Jane Street will be hosting a discussion about the design and architecture of Nile, a persistent message queue intended to be used as a core component in a heterogeneous environment of distributed applications, designed for scalability, redundancy, reliable failover, low latency, and high throughput.

A Library-based Network Operating System for the Cloud

Date and Time

Monday, September 24, 2012 - 12:00pm to 1:30pm

Location

Computer Science 302

Type

Talk

Speaker

Dr. Anil Madhavapeddy, from University of Cambridge

Host

Jennifer Rexford

The cloud has so far mostly been used to consolidate existing operating systems and manage them more conveniently. The interface between guest kernels, applications and VMs are all managed very differently, leading to serious inefficiencies, unreliability and insecurity.

Our new Mirage OS revisits the library OS, and finds it a practical approach to improve this situation. Applications are written in a high-level functional programming language (OCaml) and compiled directly into microkernels that run on the Xen hypervisor. By treating the hypervisor as a stable hardware platform, we can focus on high-performance protocol implementations without worrying about having to support the thousands of device drivers found in a traditional OS.

Mirage includes clean-slate functional implementations of protocols ranging from TCP/IP, DNS, SSH, Openflow (switch/controller), HTTP, XMPP and Xen inter-VM transports. In this talk, I will describe the basics of Mirage, and demonstrate the performance and resource tradeoffs with our approach. Some of the new applications we are building using Mirage include the next-generation Xen Cloud Platform (a widely deployed open-source Xen distribution), Openflow-integrated applications, and a novel approach to managing distributed devices on the Internet.