Profiling Latency in Deployed Distributed Systems | Computer Science Department at Princeton University

Date and Time

Wednesday, May 1, 2013 - 3:30pm to 4:30pm

Location

Computer Science Small Auditorium (Room 105)

Type

Talk

Speaker

Gideon Mann, from Google

Understanding the sources of latency within a deployed distributed system is complicated. Asynchronous control flow, variable workloads, pushes of new backend servers, and unreliable hardware all can make significant contribution to a job's performance. In this talk, I'll present the work of the Weatherman effort to build a profiling tool for deployed distributed systems. The method uses distributed traces to estimate the code control flow and predict/explain observed performance. I'll then illustrate how this method has been applied to understand and tune large distributed systems at Google and how it has been used in a differential profiling fashion to understand the sources of latency changes.

To provide another view of latency, I'll quickly discuss our recent work on distributed convex optimization with an emphasis on the interface between the algorithm and the computing substrate performing the computation. In particular, I'll show that data center architecture, in particular network architecture, should have a significant impact on machine learning algorithm design.

Gideon is a Staff Research Scientist at Google NY. He attended Brown University as an undergraduate where he hung out in the AI lab and drank too much Mountain Dew. He then attended graduate school at Johns Hopkins University, worked in CLSP, and graduated in 2006 with a Ph.D. He still misses Charm City. He then spent a post-doc at the UMass/Amherst with Andrew McCallum working on weakly-supervised learning. In 2007, he joined Google.

At Google, his team works on applied machine learning. The Weatherman effort leverages statistical methods to data center management. The team also is responsible for the Prediction API (https://developers.google.com/prediction/). Publicly released in 2010, Prediction was an early machine learning as a service offering and remains an ongoing research project.