RINC: Real-Time Inference-based Network Diagnosis in the Cloud

January 5, 2015
Cloud tenants experience performance problems due to issues
within their virtual machines (VMs) or within the cloud
infrastructure. To offer good and predictable performance,
cloud providers must be able to detect and diagnose performance
problems in real time. However, existing cloud
diagnosis techniques are either unable to detect problems
in the tenant’s VMs or are too costly. We argue that rather
than collecting all statistics, cloud diagnosis should proceed
in phases, with each phase selectively collecting heavier
weight measurements. To this end, we introduce a set of
novel techniques for inferring the internal state of a VM’s
midstream network connections which allows us to accurately
collect measurements at any point during a connection.
Our framework, RINC, runs within the hypervisor,
using these techniques to selectively monitor a tenant’s connections.
RINC provides a simple query interface to its cloud-wide
platform that allows cloud operators to easily write diagnosis
applications. We evaluate RINC on a testbed and with
a simulator using a combination of real data center traces
and synthetic workloads. Our evaluations validate RINC’s
accuracy and show that, by being selective, RINC is able
to scale to a cloud with 100K physical servers or 1Million
VMs. Moreover we demonstrate RINC’s flexibility and expressibility
by implementing five diagnosis applications.

