Experts At The Table: Latency

Second of three parts: Hardware vs. software; energy efficiency issues for memory; the impact of use models; worst-case scenarios and what can go wrong; SMP approaches and issues.


By Ed Sperling
Low-Power/High-Performance engineering sat down to discuss latency with Chris Rowen, CTO at Tensilica; Andrew Caples, senior product manager for Nucleus in Mentor Graphics’ Embedded Software Division; Drew Wingard, CTO at Sonics; Larry Hudepohl, vice president of hardware engineering at MIPS; and Barry Pangrle, senior power methodology engineer at Nvidia. What follows are excerpts of that discussion.

LPHP: The rule of thumb is that things run faster in hardware than software. As more processes begin running in software, does that impact the latency of the overall system?
Rowen: Yes, if you define a problem that is simple enough that you would contemplate putting it in hardware and also simple enough that you can analyze the latency characteristics. It’s not that software is necessarily worse than hardware, but we choose to put into software the problems we don’t really understand. We’re kicking the can down the road a bit.

LPHP: In an ideal world, you’re probably doing both through co-design, right?
Rowen: Yes, but the key is that we’re making very distinct choices about what goes into hardware versus software. If you did a true apples-to-apples comparison and implemented them both in hardware and software, it would take years for the hardware people to solve them.
Pangrle: There were research efforts back in the 1980s where people tried to run languages natively.
Hudepohl: And most modern SoCs have a mix of computational resources. They have a general-purpose processor for certain kinds of problems, hardware and video codecs for other kinds of problems, and GPUs for still other problems. GPUs are mix of hardware and firmware and other things. So even on a single SoC there are different choices.
Caples: With complicated hardware there’s still complicated software that’s required to glue all this stuff together.
Pangrle: You need to carve it up into layers and put some kind of hierarchy on it. The faster path around that is to know the specific thing you’re trying to do on hardware. That may allow you to design the hardware differently. But to trade off with something that’s more general-purpose requires a hierarchy.

LPHP: Is it better to have multiple memories or fewer?
Wingard: We’re pretty committed to having cache-based systems. It’s a sharing problem. As we add more processors with different instruction set architectures we need to do careful analysis of how they interact with each other. The traditional approach is they only interact with each other’s directional memory. That memory is the bottleneck in many designs today, so people are looking at adding some kind of on-chip memory resource, or using Wide I/O or some other type of direct-attached memory as a way of dealing with communications memory. There are a lot of things that can be done there, but right now the tools aren’t in place to do a good analysis. That’s partly because we have hardware people designing part of the system for which the software hasn’t been defined yet. We put some of this functionality into software because we’re trying to add flexibility to deal with characteristics of the end application we don’t know about yet or standards we may want to be able to adapt to in the field. What seems like the right amount of memory may not be enough because someone decides it has to handle a 4K-resolution display. On top of that, do you think of extra memory being a software-managed buffer or hardware-managed cache, or some hybrid of both? There are some interesting possibilities there, but putting the burden on software guys of tagging that as opposed to hardware guys probing the cache for data that’s not there is an interesting tradeoff. So then you go back to the energy issue and ask, ‘How efficient are these hardware-managed L3 caches going to be?’

LPHP: Is it use-dependent?
Wingard: Yes. With a smartphone, it depends on what you’re doing at any particular minute. If you’re doing a two-way call it’s very different than reading your e-mail.
Rowen: We’ve put a lot of effort into this subject with respect to high-throughput wireless baseband subsystems. One of the most interesting contrasts is between running LTE Advanced 4G on the base station versus the handset. It’s a very similar protocol doing the same computation, but it’s rather different software assumptions. In the base station, you’ve got lots of users active simultaneously and the number of scenarios you have to cover is much, much larger. Design teams gravitate toward cache-based systems because caches are specified typically out of ignorance. If you don’t know, you specify a coherent cache-based subsystem because it always does something that has a chance of being correct. By contrast, with the handset you know there’s just one user and the obsession over power is much higher. System engineers do orders of magnitude more analysis about the data traffic flow. No one dreams of using cache. Using processors is already controversial enough for some tasks. And the flow of data is extremely well managed. The protocols that are emerging as most popular are write-only. You never read anyone else’s data. You only read your own. But because it’s easier to hide the latency of a write rather than a read, your latency problem goes away. You can map out for every data structure and every variable who was the best owner at any point in time in the L1 RAM, not cache. But it’s a lot more work at the engineering stage.
Wingard: From a computer architecture perspective, it looks much more like a message-path system. You’re writing a message into someone else’s memory, and then you’re telling them it’s there so that by the time they go look for it, it’s low-latency.
Hudepohl: Caches are great for the average case, tied into the RTOS. The problem is that sometimes you hit and it may be 1 or 2 or 10 cycles, and sometimes you miss and it may be 100 or 200 or 300 cycles. If you cared about predicting the exact real-time response, you can’t tolerate that much variation under certain applications. Then you have to go to other techniques like a scratch pad RAM or a message path construct to get more predictability. In an SMP-style operating system, where processes don’t know which core they’re executing on, the scheduling interrupt comes on and they get swapped out and they run on another processor. Does the hardware manage the movement of that cache-based data or does the software do it? The software is perfectly capable of doing it, but there’s a ton of overhead to invalidate the caches in the old processor and move all the data over to the new one. It’s usually hard to get everything right when you have all these slices of memory, so most programmers prefer to have hardware manage the coherence. They don’t even want to think about it. In very specialized cases where you can think about that stuff up front it’s great, but it’s hard in the general system sense.
Caples: With SMP, RTOSes are trying to solve some of those problems. You’ve got bound computation domains where you can assign specific tasks to cores so you don’t have core fluctuation within an SMP environment. You also have soft affinity, so when a task is ready on a particular core and it gets interrupted, it will be rescheduled on that same core if it’s available.
Pangrle: Average latency is one thing, but there may be cases where in some cases that latency may get very bad. There may have to be some upper limit. That plays back into your overall latency and how you approach it from a system perspective. You may have upper limits on what you will tolerate.
Caples: Average latency is almost meaningless for us. Really, you’re looking at worst-case latency.
Wingard: There’s another big difference here when the people who build the hardware and the software are co-located or working together. If you build a processor in a baseband application you may be thinking of that as an integrated design challenge because you’re doing both. To do that level of analysis on top of someone else’s hardware, where someone is selling you a chip and you’re suppose to figure out how it works, is tougher. There are examples of chip companies that were started to do some very clever things, but they never got off the ground because no one could figure out how to develop the software. It’s easy to build embarrassingly parallel hardware that is programmable, but building a software environment that people are willing to use is a different matter. What’s more interesting is what are the general things people are willing to put in place that others will take advantage of.

Leave a Reply

(Note: This name will be displayed publicly)