Systems & Design

SPONSOR BLOG

Bugs With Long Tails Can Be Costly Pests

In the world of servers and HPC, the smallest of inefficiencies can build into big problems.

April 26th, 2018 - By: Gajinder Panesar

I don’t think Van Gogh was considering high performance computing or server architecture, but he made a lot of sense when he said “great things are done by a series of small things brought together.” A series of very small things can, and do, create big things: that’s the fundamental premise of long-tail marketing: Amazon, for one has built a strong business from selling millions of niche items. But another interpretation of the ‘long-tail’ brings a potentially nasty sting for high-performance computing platforms, where it can point to painful and often costly problems which are not always apparent.

A good example of this was a problem affecting Google’s servers for three years: a bug affected 25% of the system, which equates to millions of nodes spread across Google’s extensive global network. More details on this can be read here on Dan Luu’s blog. In the world of e-commerce, even slowdowns of microseconds can have incalculable knock-on financial impacts: latency in the user experience increases the chances of a transaction never being completed and reduces the likelihood of ad clicks: a well-established fact.

Users expect instant (millisecond) responses when they’re using Facebook – unaware that seemingly simple actions (for example updating their timeline or posting an update) involve a highly complex, geographically dispersed network of nodes to rank, filter, and format updates as well as grabbing necessary media files plus relevant advertisements and recommendations. A single query is often broken into sub-processes and handled in different parts of the network. A simple web search, for example, can involve up to 100 web server blades which ‘farm out’ processing tasks across a global network.

The problem is that the smallest of bugs or inefficiencies in the system can create ‘outlier’ events which have the potential to cripple the entire system. Even if only one in a hundred processes are affected then, statistically, the routes through the system will be affected at some point – the effect at the overall system level will be that users ‘see’ a 99 percentile worst-case performance. These are the problematic ‘long tail’ issues, and controlling them is a valuable exercise.

Servers rely on complex, often heterogeneous, multicore processor architectures and it’s this technology that enables the services to be delivered to users around the world in an instant. But, with the capability comes a significant increase in complexity and this has to be managed and optimized to ensure performance is maintained.

Embedding hardware-based, non-intrusive, wire-speed smart and conditional monitoring capabilities within the processor SoC allows the collection of granular data on real-world behaviour, not just of the chip, but also the wider system. This level of insight makes it much easier to focus on performance issues than is possible with traditional solutions such as sampling profilers or application- and system-level instrumentation and has the added benefit of being entirely non-intrusive. The hardware-based approach can detect hard-to-identify issues that impact performance – for example, problems with affinity management policies, contention and cache coherence.

For SoC manufacturers, it provides a major point of differentiation – products that allow customers and even end users to refine and optimize the performance of the server infrastructure into which the chips are built, while they are running.

We published a white paper covering the topic of Tail End Latency and Server Debug in some depth, you can download it here. I also spoke about this at the recent Linley Spring Processor Conference. If you’d like to receive the slides or to find out more, drop an email to [email protected].

Gajinder Panesar

(all posts)
Gajinder Panesar is a fellow at Siemens EDA. He holds more than 20 patents and is the author of more than 20 published works. Prior to joining UltraSoC, Panesar served at NVIDIA. As chief architect at Picochip, he created the architecture of the company’s market-defining small-cell SoCs, and continued in this capacity after the company’s acquisition by Mindspeed Inc. His previous experience includes roles at STMicroelectronics, INMOS, and Acorn Computers. He is a former Research Fellow at the UK’s Southampton University, and a former Visiting Fellow at the University of Amsterdam.

Knowledge Centers
Entities, people and technologies explored

Startup Funding: Q1 2025

AI chips and data center communications see big funding; 75 startups raise $2 billion.

by Jesse Allen

Advanced Packaging Fundamentals for Semiconductor Engineers

New SE eBook examines the next phase of semiconductor design, testing, and manufacturing.

by Bryon Moyer

Chip Industry Week in Review

AI export rule to be scrapped; SEMI, EU request; Cadence, Nvidia supercomputer; AI co-processor; Imagination's new GPU; semi sales up; imec, TNO photonics lab; NSF key to national security; flexible packaging control system; SiConic test engineering; USB 4 support; SiC JFETS; magnetic behavior in hematite.

by The SE Staff

Bugs With Long Tails Can Be Costly Pests

Gajinder Panesar

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers
Entities, people and technologies explored

Related Articles

Startup Funding: Q1 2025

Advanced Packaging Fundamentals for Semiconductor Engineers

Chip Industry Week in Review

Chip Industry Week in Review

RISC-V’s Increasing Influence

Chip Industry Week in Review

What Exactly Are Chiplets And Heterogeneous Integration?

Big Changes Ahead For Interposers And Substrates

Sponsors

Recent Comments

About

Navigation

Connect With Us

Bugs With Long Tails Can Be Costly Pests

Gajinder Panesar

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers Entities, people and technologies explored

Related Articles

Startup Funding: Q1 2025

Advanced Packaging Fundamentals for Semiconductor Engineers

Chip Industry Week in Review

Chip Industry Week in Review

RISC-V’s Increasing Influence

Chip Industry Week in Review

What Exactly Are Chiplets And Heterogeneous Integration?

Big Changes Ahead For Interposers And Substrates

Sponsors

Newsletter Signup

Popular Tags

Recent Comments

About

Navigation

Connect With Us

Knowledge Centers
Entities, people and technologies explored