Profiling Latency in Deployed Distributed Systems
Gideon Mann, Google
Understanding the sources of latency within a deployed distributed
system is complicated. Asynchronous control flow, variable workloads,
pushes of new backend servers, and unreliable hardware all can make
significant contribution to a job's performance. In this talk, I'll
present the work of the Weatherman effort to build a profiling tool
for deployed distributed systems. The method uses distributed traces
to estimate the code control flow and predict/explain observed
performance. I'll then illustrate how this method has been applied to
understand and tune large distributed systems at Google and how it has
been used in a differential profiling fashion to understand the
sources of latency changes.
To provide another view of latency, I'll quickly discuss our recent
work on distributed convex optimization with an emphasis on the
interface between the algorithm and the computing substrate performing
the computation. In particular, I'll show that data center
architecture, in particular network architecture, should have a
significant impact on machine learning algorithm design.
Gideon is a Staff Research Scientist at Google NY. He attended Brown
University as an undergraduate where he hung out in the AI lab and
drank too much Mountain Dew. He then attended graduate school at Johns
Hopkins University, worked in CLSP, and graduated in 2006 with a Ph.D.
He still misses Charm City. He then spent a post-doc at the
UMass/Amherst with Andrew McCallum working on weakly-supervised
learning. In 2007, he joined Google.
At Google, his team works on applied machine learning. The Weatherman
effort leverages statistical methods to data center management. The
team also is responsible for the Prediction API
(https://developers.google.com/prediction/). Publicly released in
2010, Prediction was an early machine learning as a service offering
and remains an ongoing research project.