Bugs With Long Tails Can Be Costly Pests

In the world of servers and HPC, the smallest of inefficiencies can build into big problems.

popularity

I don’t think Van Gogh was considering high performance computing or server architecture, but he made a lot of sense when he said “great things are done by a series of small things brought together.” A series of very small things can, and do, create big things: that’s the fundamental premise of long-tail marketing: Amazon, for one has built a strong business from selling millions of niche items. But another interpretation of the ‘long-tail’ brings a potentially nasty sting for high-performance computing platforms, where it can point to painful and often costly problems which are not always apparent.

A good example of this was a problem affecting Google’s servers for three years: a bug affected 25% of the system, which equates to millions of nodes spread across Google’s extensive global network. More details on this can be read here on Dan Luu’s blog. In the world of e-commerce, even slowdowns of microseconds can have incalculable knock-on financial impacts: latency in the user experience increases the chances of a transaction never being completed and reduces the likelihood of ad clicks: a well-established fact.

Users expect instant (millisecond) responses when they’re using Facebook – unaware that seemingly simple actions (for example updating their timeline or posting an update) involve a highly complex, geographically dispersed network of nodes to rank, filter, and format updates as well as grabbing necessary media files plus relevant advertisements and recommendations. A single query is often broken into sub-processes and handled in different parts of the network. A simple web search, for example, can involve up to 100 web server blades which ‘farm out’ processing tasks across a global network.

The problem is that the smallest of bugs or inefficiencies in the system can create ‘outlier’ events which have the potential to cripple the entire system. Even if only one in a hundred processes are affected then, statistically, the routes through the system will be affected at some point –  the effect at the overall system level will be that users ‘see’ a 99 percentile worst-case performance. These are the problematic ‘long tail’ issues, and controlling them is a valuable exercise.

Servers rely on complex, often heterogeneous, multicore processor architectures and it’s this technology that enables the services to be delivered to users around the world in an instant. But, with the capability comes a significant increase in complexity and this has to be managed and optimized to ensure performance is maintained.

Embedding hardware-based, non-intrusive, wire-speed smart and conditional monitoring capabilities within the processor SoC allows the collection of granular data on real-world behaviour, not just of the chip, but also the wider system. This level of insight makes it much easier to focus on performance issues than is possible with traditional solutions such as sampling profilers or application- and system-level instrumentation and has the added benefit of being entirely non-intrusive. The hardware-based approach can detect hard-to-identify issues that impact performance – for example, problems with affinity management policies, contention and cache coherence.

For SoC manufacturers, it provides a major point of differentiation – products that allow customers and even end users to refine and optimize the performance of the server infrastructure into which the chips are built, while they are running.

We published a white paper covering the topic of Tail End Latency and Server Debug in some depth, you can download it here. I also spoke about this at the recent Linley Spring Processor Conference. If you’d like to receive the slides or to find out more, drop an email to [email protected].



Leave a Reply