Experts At The Table: Latency

First of three parts: Memory access time; SoC complications; tradeoffs with energy efficiency; external causes of latency; the good and bad of software design; network impacts; dependencies and intrinsic issues.


By Ed Sperling
Low-Power/High-Performance engineering sat down to discuss latency with Chris Rowen, CTO at Tensilica; Andrew Caples, senior product manager for Nucleus in Mentor Graphics’ Embedded Software Division; Drew Wingard, CTO at Sonics; Larry Hudepohl, vice president of hardware engineering at MIPS; and Barry Pangrle, senior power methodology engineer at Nvidia. What follows are excerpts of that discussion.

LPHP: Latency can stem from a lot of different areas. Where do you see the problem?
Hudepohl: One of the traditional latency issues is memory access time. Most aspects of processor design over the past decade have focused on different ways to address that problem. That’s why we saw the advent of pipelining, caching, the use of more complex branch prediction techniques, and the use of more complex architectures such as out-of-order processing to try to define other instructions to execute while one set of instructions is waiting for a cache net. Multithreading is also a technique being used to help minimize the effects of memory latency. You basically run instructions from another thread while one thread is waiting for a response. Many of the decisions we’ve made micro-architecturally and architecturally have been driven by latency repercussions.
Wingard: My entire career has been in the era of hiding latency in microprocessor design in different ways. What has become interesting is that latency is becoming a bigger challenge to deal with in SoCs. In early-generation SoCs there were enough other things going on at low-enough data rates so that latency wasn’t as critical of a problem. As we continue to evolve and have more and more things using external memory, as well as more and more things on the chip communicating through external memory, we have the intrinsic things dealing with latency as well as the contingent things. Lots of different processors, co-processors and I/O cores are trying to get access at the same time. There’s extra latency just based on contention. We have lots of techniques for hiding latency, but once we provide service which ones do we choose to service. You have to deliver differential choices there. One core may be very sensitive to latency, while another may be much more tolerant and only concerned with worst-case latency so your video doesn’t flicker. What gets really complex is how you optimize and architect to spread around the latency.
Pangrle: A lot of the focus has been CPU-based. What we’ve been seeing for a number of years is that while the processor has been scaling and getting faster, the memory hasn’t scaled with that. We went to L1 cache, L2, and now you’re seeing limitations even with three levels of caching. All of this is about trying to hide latency between the memory and the compute speed. There are a number of tradeoffs that have to be taken into account from an energy perspective. There is hope that if you get spatial locality you improve that, but if you don’t end up using it you pay high overhead.
Rowen: The talk about memory is representative of one of these latencies, but there are others. There are latencies that are closer to the computation. If you’re thinking about a complex computation, a 64-bit multiply may take several cycles of latency and you need to worry about that in the design of your processor. But there also are external latencies, like latency to go to flash memory or disk and latencies on the network that can contribute in a big way. The solutions all draw on similar underlying principles. When you think about latency, you have to do A and then B and then C and D. That chain is made up of incidental and fundamental dependencies. The fundamental dependencies are that you had to do A before B and then you had to do C. In cases of fundamental dependencies, latencies become harder to address. Many techniques to address latencies involve incidental dependencies. You have to do A and B, but it doesn’t matter if they’re done in sequence or in parallel, so a lot of the techniques to limit latency are really about exploiting parallelism by finding the incidental latencies. Multithreading is a great example. Out-of-order execution is the same thing. You try to find operations that don’t impact each other and you throw hardware at it to do that. The interesting thing is that as you make a system more parallel, it consumes more system resources and more energy per step. Your energy per-use for work may or may not go up. One of the tradeoffs for latency is how much you’re willing to degrade the energy efficiency of your system to address latency. In ultra-low-power systems, you don’t worry about latency and you do get the best power. Out-of-order processing sometimes consumes orders of magnitude more power than latency-tolerant, sequential execution. It’s not latency in a vacuum. It’s latency vs. something else.
Caples: From a software perspective we try to take advantage of what’s available in the hardware. We look at it from the standpoint of what the operating system is putting into this in order take advantage of the hardware. It all boils down to ISR (interrupt service routine) and context-switch-related latencies. Whether you’re designing with a unified-interrupt architecture or a segmented-interrupt architecture, there are things you can do to minimize the overhead and the impact of the operating system on these devices. When we talk about context-switched latency, that’s one of the longest operations an operating system can do. Writing code that reduces the number of context switches is more efficient. There are different use cases where maybe a unified-interrupt architecture would be beneficial to a segmented interrupt architecture, but the key from our perspective is low-overhead processing interrupts.

LPHP: How much of an issue is software for latency? And can it be fixed or hidden?
Caples: With real-time operating systems the latency can be as little as 50 to 500 CPU cycles. That’s usually predicated on where the ISR occurs because you do get quite a bit of jitter with that. Depending on what you’re designing for, you can process interrupts very quickly. So if you have an ISR latency in the microsecond range—it can be an average of 2 or 2.5 microseconds—that means you can handle up to 450,000 interrupts per second. And if you have that many interrupts, something is wrong with your design. But that’s the type of efficiencies you can achieve. What you can do with unified interrupt architectures is put that ISR aside, but that creates an inherent latency. So do you block an ISR and create a longer latency, or do you allow ISRs, process them, and put them aside until you can complete the task. Those are two tradeoffs from an RTOS perspective.
Wingard: How much benefit do you get by ganging them? If you have 3D, the last thing you want to be doing is context switching just to take the next interrupt.
Caples: There are techniques for that. You may run into complications where there could be global variables that are being shared. But there also are tradeoffs.
Hudepohl: There are other techniques such as tail-chaining, where you can go from one interrupt to the next without bouncing back to the main process.
Rowen: With the hardware guys, they say latency and think memory. The software guys think interrupt latency. This is just the tip of the iceberg for the Rorschach test of what we mean by latency.
Hudepohl: A guy that devises a multiply-divide unit has a different perspective on an iterative vs. a pipelined divider.
Rowen: And a network guy has a totally different view of latency. It’s one of the fundamental concepts of system design, no matter what kind of system it is. We have a hardware-software divide right now, but as people build these systems they would love to migrate as much functionality into software as possible. It’s easier to adapt.

LPHP: It’s also easier to fix, right?
Rowen: Yes, and some would say it’s more likely to need fixing. But it reflects the fact that what can be upgraded in minutes is different than a product that can be fixed by fabricating a new part. The result of this net shift of functionality into software is that you buy into a net overhead of various kinds. To make software comprehensible you have to structure it and layer it and build libraries. Even the whole notion of an interrupt is, to some extent, a software structure to bring some sanity to it. To some extent, the issues of latency as you go further up in the abstraction are a consequence and unintended byproduct of the fact that we’re trying to manage complexity at the limit of human understanding. Worst-case latency is a scenario in which you make only some very basic guarantees about what will happen, and you have all kinds of unintended consequences. The methodology to understanding where the latency is likely to be is critical. There is some tooling there, and there is likely to be a lot more tooling to enable people to figure out how to bound this problem.
Wingard: We’re very interested in the level above the RTOS. We hear a lot about Android, which is already running in a virtual machine above a variant of Linux. That’s really divorced from the hardware, which makes it hard to think about managing latency. They’re intentionally trying to hide the hardware from you, so all the tricks we’re trying to build into our hardware to allow you to exploit parallelism and hide latency are intentionally being shielded. It’s tough to optimize. One of the things we see in the battle between the Android and iOS ecosystems is that Apple has control of the whole chain. They have an opportunity to put the hardware in only if they have the software to take advantage of it, and vice versa. They can do a more optimum job.

Leave a Reply

(Note: This name will be displayed publicly)