Today's data centers usually employ high redundancy to tolerate hardware and software errors. But unfortunately another type of errors, namely configuration errors (i.e., misconfigurations) can still take down the whole data centers. It has contributed to more than 30% high severity issues and caused some of the recent major downtime in cloud service companies including Amazon, Microsoft Azure, Google, Facebook, etc. Unfortunately, many software developers and system designers put most of blames on users for configuration errors, and do not pay enough attention to handling misconfiguration in a more active way. Comparing to software bugs, configuration issues have much less tooling support for error detections, issue tracking, tolerance testing, as well as design reviews.
In this talk, I will present our recent work on characterizing misconfigurations in commercial and open source systems (as well as a major commercial cloud service provider), and also some of our solutions in addressing this configuration problem. We approach it from the perspective that configurations are a part of user interface, and thereby need to consider from user perspectives. More specifically, as a practical first step, we need to avoid error-prone requirements and also react gracefully to user mistakes. Our solutions have been used two commercial companies and have influenced some popular open source systems such as "Squid" redesigning their configuration. In this talk, I will also some of the negative interactions (and "challenges") with some open source developers.
Yuanyuan Zhou is currently a Qualcomm Chair Professor at UC-San Diego. Before UCSD, she was a tenured associate professor at University of Illinois at Urbana Champaign. Her research interests span the areas of operating systems, software engineering, system reliability and maintainability. She has 3 papers selected into the IEEE Micro Special Issue on Top Picks from Architecture Conferences, best paper at SOSP'05, and 2011 ACM SIGSOFT FSE (Foundation of Software Engineering) Distinguished Paper. She has co-founded two startups. Her recent startup, PatternInsight, has successfully deployed software quality assurance tools in many companies including Cisco, Juniper, Qualcomm, Motorola, Intel, EMC, Lucent, Tellabs, etc. Recently Pattern Insight has sold its Log Insight business to VmWare. In addition, Intel has licensed some of her solutions on detecting concurrency bugs. She has graduated 14 Ph.D. students so far, many of whom have joined top industrial companies and academia such as University of Wisconsin at Madison, University of Toronto, University of Waterloo, etc as tenure-track faculty.