This thesis presents network server performance analysis and improvement at the operating system (OS), application, and processor levels. At the kernel level, we develope a profiling tool that provides rich OS transparency at low cost, by exposing system call performance as a first-class result via in-band channels. Using this tool on the Flash Web server running the standard SPECweb99 benchmark reveals a series of negative interactions between the server application and the OS. Some of the solutions to these issues have lead to a set of kernel patches to improve networked file transfer, and others contribute in server application design.
At the application level, we redesign the Flash server and the widely-used Apache server, improving Flash's SPECweb99 score by a factor of four and reducing response time by one to two orders of magnitude on both servers. Using these servers, we then examine server latency under load and trace the root cause of server-induced latency to head-of-line blocking within filesystem-related kernel queues. This behavior, in turn, causes batching and burstiness, and gives rise to a phenomenon we call service inversion, where requests are served unfairly with long responses served ahead of short responses. Removing blocking not only reduces response time under load and improves latency profiles, but also mitigates burstiness and improves request handling fairness. The resulting servers show better latency scalability with processor speed, making them better candidates for future improvements.
Finally, we investigate the architectural aspects of server performance, conducting detailed analysis of delivered simultaneous multithreading (SMT) systems. Using five different software packages and three hardware platforms, experimental results show that the benefits of the current SMT implementation on Intel Xeon processors are modest for network servers, and short memory latency or extra L3 cache helps SMT yield better speedups. By performing microarchitectural evaluation using processor performance counters, we also provide insight into the instruction-level resource bottlenecks that affect performance on these platforms. Finally, we compare the measured results with similar studies performed using simulation, and discuss the feasibility of these simulation models, both in the context of current hardware, and with respect to future trends.