The Increasing Challenge Of Reducing Latency

Growing complexity coupled with a need to reduce power is making it harder to eliminate or even to hide latency.


By Ed Sperling
When the first mainframe computers were introduced the big challenge was to improve performance by decreasing the latency between spinning reels of tape and the processor—while also increasing the speed at which the processor could crunch ones and zeroes. Fast forward more than six decades and the two issues are now blurred and often confused.

Latency is still a drag on performance, but it now is determined by a multitude of factors ranging from processor and memory configurations to connectivity on-chip, across the board and outside a device. In fact, someone trying to define latency as recently as five years ago would have trouble comprehending all of the elements that affect it today.

Call your landline with your cell phone and you can easily discern the delay, but what’s actually causing that latency? It could be anything from cache coherency to on-chip memory and routing, I/O, and power island wake-up and shut-down, among other things.

“There are a couple of factors that you need to look at with latency,” said Chris Rowen, chief technology officer at Tensilica. “One is that people are building SoCs that are dramatically more complex. There’s more wireless functionality, more broadcast receiver issues, audio, storage control and the entire device is more data intensive. The second thing is that in addition to all of this growing complexity you have to drive down power. Mobile platforms have changed the game. Processors and DSPs need to run using an order of magnitude less energy.”

Processor latency
This has led to several distinctly different architectural approaches to dealing with power and performance—frequently within the same chip. One is to add more cores to a device for specific functions, but not to make the data coherent. Rowen said that cache coherency can cost an order of magnitude in power—or more.

“Cache coherency plays a role in certain slices of the market, but it’s really of interest only to about 10% to 15% of the designers in the dataplane,” he said. Keeping cache coherent also takes time, which increases latency. The way to compensate for that is to turn up the clock frequency and offset coherency checking with faster processing time, but that burns more power.

A second approach is to use more cores in a CPU or a GPU, but to keep them turned off most of the time. “Dark silicon,” a term coined by ARM CTO Mike Muller, has been gaining far more attention lately as a way of only using what is absolutely necessary. While this certainly makes sense for improving battery life and even reducing the amount of electricity used by a plug-in device, there is a latency impact. Turning on and off cores takes time, and turning them on and off faster requires more energy than doing it slowly.

The way around this is to anticipate what a core will actually be doing and determine whether to leave it on or turn it off. “There is a ramp up and ramp down cost,” said Mark Throndson, director of product marketing at MIPS Technologies. “You need to look at analysis of activity and a prediction of the expected workload far enough ahead to have an impact on latency.”

That’s related closely to an advanced scheduling scheme, which includes prioritization and sequencing.

A third approach is to simply use more specialized processors that are right-sized for a particular function. This rationalization of resources—and rationalized use of those resources—has been talked about for years but only sporadically implemented. The reason? It’s difficult to architect, requiring lots of up-front tradeoffs in performance, power and use models that span both hardware and software. And companies trying to win a socket with an SoC design have sometimes built in such features only to find they aren’t used by customers.

Still, not everything needs to run at gigahertz speeds, and in some applications a certain amount of latency is acceptable. Waiting for Internet access, for example, is far less forgiving than a background security check or even a mail download. Even for those applications where latency is not acceptable, multithreading can help. MIPS’ Throndson said the advantages of multithreading become really apparent when the latency is higher on L1 and L2 cache.

“What you’re doing is helping to keep the processor busy under normal circumstances,” he said. “When one thread stalls, the CPU has one or more additional threads available. That can be used to hide a lot of latency.”

Network and I/O latency
The processor is only one piece of this puzzle. Within an SoC there are many components—blocks with their own processors or logic, for example, that need to connect to the outside world or other blocks within the SoC.

“There are too many pieces that need to be connected,” said Jack Browne, senior vice president at Sonics. “With a smart phone, the benchmark is how long it takes to open a brower. But that may not represent the best practice for how to design a chip even though it may sell the socket. This isn’t always intuitive.”

A case in point is the process of downloading a large image or a video, for example. It may seem logical to move that through the available network as fast as possible, but that also could block other traffic on a chip.

“You need to sort out tasks in a way that makes sense to the user,” said Browne. “That means solving first-order problems in ways that aren’t always apparent. There may be other scenarios to consider. But the flip side of this is, how many scenarios are enough to consider?’

The whole network-on-chip model of Sonics and Arteris is geared toward reducing latency and adding flexibility into a design.

“The big bottleneck right now is bandwidth to memory,” said Kurt Shuler, vice president of marketing at Arteris. “This is causing latency issues.”

One way to deal with this is to widen the data pipes, and the entire Wide I/O and stacked die approach are geared toward increasing bandwidth on an SoC. Another approach is to simply change storage from spinning disks to solid state drives, or SSD.

Storage latency
For anyone who has booted up a notebook computer that uses SSD rather than spinning media, it’s obvious where one source of latency has been. Spinning disks take time to start up, find and pull down data, and no matter what the speed it cannot compete with solid-state drives where there are no spinning parts. That makes SSD inherently more energy-efficient as well as significantly faster.

“What’s changed is that there are now standardized drivers in Windows 8 and Linux,” said Neil Hand, group product marketing director for Cadence’s SoC Realization Group. “All of the early implementations were proprietary. Now you can take any processor with PCIe and firmware and standard NVMe and connect it together.”

While the volume market is on the portable device side, the really big opportunity for SSD is in data centers, where the explosion of data consumes huge amounts of electricity. Latency is absolutely critical in data centers. Think about a financial trading floor, for example. Storage also has to be efficient, though, because costs there are measured in many millions of dollars per year. Information technology operating expenses show up as a line item on the corporate budget because it costs so much money to power servers and storage, as well as to cool row upon row of these devices.

While there has been significant movement on the server side to improve efficiency, first through virtualization and more recently with significantly more efficient servers, equivalent steps on the storage side have been much slower. And all of this has an impact on latency.

One problem is that solid-state drives don’t hold as much data as spinning disks, as most people realize from looking at the size of SSD options on notebooks compared to spinning media. The comparable standard for NVMe in the data center is SCSI over PCIe, or SOP from the Solid State Storage Initiative.

“What’s changing is that you’re adding an intelligent scheduler,” said Craig Stoops, group director for verification IP at Synopsys. “You’re able to queue multiple commands at the same time and potentially execute them out of order. This is where you’re seeing companies like Fusion-IO and Violin Memory showing up with high data rates or transactions.”

Trying to nail down a single problem or even multiple problems with latency remains a moving target. Where companies have direct control over a piece of the latency mosaic, they can make a dent. Where they don’t have direct access to the technology, they can mask it—at long as it doesn’t upset the power-performance balance too much.

“Latency shows up in lots of different places, from system design through the network, and from the base station to the terminal and back to the device,” said Tensilica’s Rowen. “In an SoC it’s dominated by going off chip from the processor to the bus to the DRAM controller to DRAM and back, which can eat up 100 to 200 processor cycles. Cache is one way of dealing with that. Another is context switching of processors.”

The reality, though, is there will always be latency to battle somewhere on a chip, whether it’s in a planar design, a stacked die, and regardless of how big the data pipes get or how much intelligence is applied to scheduling. The history of semiconductor and system engineering has been about solving bottlenecks, and the future of advanced designs almost certainly will include new wrinkles on existing problems as well as new challenges in areas such as process variation and quantum effects. And while they may be familiar problems, they’re being viewed in a new context.