Experts At The Table: Latency

Last of three parts: Cloud vs. local partitioning; pre-computing; quicker maps; low-power mode and other tradeoffs; the impact of smaller wires; architectural issues; Wide I/O benefits; virtualization.

popularity

By Ed Sperling
Low-Power/High-Performance engineering sat down to discuss latency with Chris Rowen, CTO at Tensilica; Andrew Caples, senior product manager for Nucleus in Mentor Graphics’ Embedded Software Division; Drew Wingard, CTO at Sonics; Larry Hudepohl, vice president of hardware engineering at MIPS; and Barry Pangrle, senior power methodology engineer at Nvidia. What follows are excerpts of that discussion.

LPHP: How do we partition devices to take advantage of local processing and the cloud, when necessary, so we minimize latency?
Wingard: That’s more about energy than latency. What’s the cost of computing locally and what’s the battery charge. But there’s this whole other side to it, which is what is going to cost to communicate it? If you outsource it to the cloud, there’s a certain amount of communication, and that may determine the equation. There’s a fair amount of local processing that needs to be done to go out to the servers to do the giant search.
Rowen: It’s one of the problems we have to think very carefully about because there are hardware-software tradeoffs. More generally, when you have a computation you’re trying to do, you have to think in terms of infinite computing resources that are many milliseconds away, and local computing resources that are finite but very responsive. People try to find ways to use the cloud to pre-compute a lot of likely things, and then select among pre-computed things locally. You’re always trying to maintain the illusion of instant response, but for a set of resources that you’d be willing to use locally. If you think about caching in your browser, that’s a case where it’s trying to anticipate what you’re likely to do and has a local list of previously searched topics. Only if you don’t hit on one of those topics do you hit on a larger list from the cloud. We’re all just at the beginning of trying to figure out all the techniques that are likely to be used as the availability of compute in the cloud grows. But the latencies are long and they’re going to stay long compared to what you can do locally. One of the great areas of system designs is pre-compute and cache, or tracking patterns and pre-computing along the likely choices that someone makes locally. If you’re playing a game, you need lots of computing to build the rest of the space. You don’t want to pre-compute the entire universe, but you may want to pre-compute the likely path that a player is likely to take. We saw the same thing with the evolution of Google Maps. You get a little bit of information quickly, and then depending on where you’re looking, you bring in the more detail.
Hudepohl: Yes, like the zoom level. The maps are pretty good now, so they can locally display different levels of zoom without having to go back further. The adaptive prediction comes into play there.
Pangrle: With a portable platform what you’re looking at is energy or heat dissipation on a small platform. But if you do a computation remotely, you can bring some of that in locally that you can’t get. There may be some energy tradeoff for that, but there’s also a tradeoff for the customer of what they’re willing to accept to get that experience.
Caples: I see similarities with an RTOS when we try to incorporate power management. Part of that is putting devices or systems in low-power mode. You want to go through this calculation. Is the amount of energy it takes to bring it out of low-power mode greater than the amount of energy you save while you’re in low-power mode?
Wingard: There is a lot of work going on right now to try to reduce the drain. We know we’re heading in the direction of many, many more domains to manage. We can see designs with hundreds of on/off switches on them coming very soon.

LPHP: Isn’t that what’s behind the push into near-threshold computing? It doesn’t get full on or off.
Wingard: When certain chips display Web pages the processor gets throttled up to maximum voltage, and before the Web page is displayed they’re already over temperature and they need to throttle back. They’re basically kick-starting the processing, they hit the limit, and then they back off before the Web page is displayed. That’s how quickly these systems are reacting.

LPHP: How do use models affect latency?
Caples: From my perspective that’s whether you can put a system into a low-power state. That’s still a viable state where the user is performing some meaningful task. Part of that can be an operating point transition where you’re operating at a lower frequency. There’s going to be a correlation between latency and the frequency at which you’re operating.
Wingard: A guy at one of the big wireless companies was talking about something I didn’t consider to be a problem, which is the number of background applications on these phones. How do you get an alert that something is going on? That’s an incredibly inefficient operation. There’s no central funnel. Each one of those applications is pinging the network at their own pace and their own time. If you’re not in the middle of a phone call, you have to power up the radio, sync to the channel, and do all of this setup work so you can send one packet out and get a one-packet response that nothing is happening, and then you shut the whole thing down. You might have dozens of applications doing that. From the network provider perspective, they’re losing bandwidth because their base stations are setting up and tearing down all these dumb calls for useless things. You probably get a lot worse battery life because of that.
Hudepohl: This is where the software use case run on the same hardware can affect that. If it’s polling every few microseconds for some event or there’s a message pass where it says, ‘Don’t even bother checking because it will tell you when it has the data you’re looking for,’ in a lot of systems that can be much more efficient.

LPHP: As we move into advanced process nodes we’re dealing with RC delay and issues of scaling wires. Does that affect latency?
Rowen: Yes. One of the most interesting ratios is the one between the speed of computing and communication. If we continue to see wires getting relatively slower than gates, you could be doing a lot of computing in a relatively small amount of space as long as you didn’t have to communicate. That changes the tradeoffs you have to make as you go searching for parallelism. If you can re-compute something rather than fetch it again from memory, you may be motivated to compute something twice rather than compute it once, store it and then read it. Architectures go through an evolution for that. It also lends itself to a more aggressive pipelining of design, with more stages. Some of that may have incidental dependencies, and some may have fundamental dependencies. But you will tend to see that the tradeoffs get tougher as the wires get slower. In theory you could be running a core at many gigahertz, but it is running in isolation. The communication issues get more difficult. How do you cache your local memory, how do you distribute memory, do you want 10 big cores running fast or 100 small cores running slowly or more locally—those are interesting tradeoffs.
Wingard: If you look at the internal array speed in DRAM, when we introduced SDRAM around 1995 the internal array operated at 100MHz. If you look at today’s stuff, the parts start off with an internal array speed of 100MHz. As they go through their life they go up to 200MHz, then we get the next version of DRAM that rolls back to 100 MHz. They just keep doubling the interface frequencies. That number works because of the economics of the DRAM business and the RC delay of the wires that form the word lines and the bit lines. They’ve got an economic model around the cheapest cost per bit at a performance level that has driven a constant latency internal to the memory arrays for 17 years. If you look at processor frequencies, they’re probably 10 times faster than they were 17 years ago.

LPHP: Some of this happens internally on a chip. How do you deal with that?
Hudepohl: Part of this involves the ultimate frequency a device is capable of reaching. Whether it’s full-custom design or synthesized from IP providers, some of these questions about wires vs. transistor speed are implicitly included in the final frequency our device reaches. But even within that, there are lots of constraints and targets. We’re providing synthesizable IP, which you can take and target at the maximum frequency with one set of tradeoffs. You also can take that same RTL and try to optimize the area and power a little better, and use that headroom to address some of these issues. And at the device-physics level, how fast does the transistor switch and what happens to the wires, and how do the wires scale through 10 or 15 or 30 process generations. Because the scaling is not really 3D anymore, the difference between these relative speeds is changing. The Intel processor speeds, for example, have not continued to scale. Part of that is device physics effects, but another part of it is because of power and energy reasons. That’s created a balance about how to manage the relative speed of wires and the transistors.

LPHP: This sounds like a big architectural challenge.
Pangrle: Yes. A lot of research in computer architecture involved multiple cores and clock speeds doubling every year or so. But by the time you could get that design implemented, the guys doing single-threaded designs had gone through a few iterations and they were eight times faster than when you started your design and thought you were going to blow them out of the water. Doing redundant computation is an interesting one. Along with the lines being slower, there are parasitics attached to it. It’s not only slower, but you’re also expending more energy to push information across it. Chris’ point is interesting. I had done some research a while back where we found we could use less interconnect and less area by incorporating redundant calculations. We could put results where we needed them with less wiring.

LPHP: Does Wide I/O help?
Wingard: In a lot of systems, Wide I/O is being considered a software-managed cache. But it doesn’t have enough capacity to replace main memory for a lot of applications. The places where we’re seeing it being used is as a buffer. It’s a great low-energy way of storing a frame buffer and reducing the amount of energy to go externally. It’s much, much lower power and it has nice bandwidth characteristics. It looks like a good alternative to putting a bunch of SRAM on die for L3 cache.

LPHP: Does it decrease latency?
Wingard: For external DRAM, yes. There’s less contention. If you look at the contention part of latency, you have to wait for someone else before you can do it. Having a lot of extra bandwidth helps. Modern DRAM PHYs are real PHYs. They’re almost as complex as a SerDes. Wide I/O strips that off, both on the sender and the receiver side.

LPHP: Contention has been one of the reasons we haven’t seen a lot of virtualization in these devices, right?
Caples: We don’t do a lot with virtualization, but we do understand the problems. There is latency involved.
Rowen: Virtualization is one of those techniques, like cache, where you do it to provide a greater level of service to a community of application developers. You don’t do it as a means to optimize hardware efficiency. It is, to some degree, the enemy of hardware efficiency because it is one additional level of mystery about what is really going on. People who are really focusing in on power or latency are typically going to run from virtualization—or at least hold it off as long as possible—from entering into a domain where they have to figure it out. Now, instead of telling you where something is happening, you’ve got a pointer to a data structure that contains a map that changes continuously. Virtualization is a great thing for software development and system functionality as a whole, because it can do some things in terms of global balancing that are useful. But in the traditional model that says, ‘I need to know what I am doing to optimize it,’ it raises huge issues.
Hudepohl: It’s another layer that provides more generalization.
Wingard: There is virtual memory support, which lots of processors rely upon. And then there is virtualization, where you’re trying to run multiple operating systems or multiple operating environments. Today there isn’t even much virtual memory support outside of the processor complex, and that has been known to drive software guys crazy. There are interesting cases in cell phones where you can’t find contiguous DRAM to describe the frame buffer because the thing has been running so long without driving that TV display that you’re supposed to be able to plug into. When you do plug it in, you have to come up with the contiguous buffer. So you end up with much more complex hardware. Being able to use virtual memory is an attractive way around that. The support for full virtualization has a lot more to do with running multiple environments. It can be much more secure if you keep things on a separate platform. Or you can mix an RTOS with an OS, and if you’re careful about how you layer things you can still get good deterministic behaviors.
Caples: Sandboxing an operating system also helps with safety/security issues. There’s definitely a place for it.
Hudepohl: In the grand scheme of things, virtualization is an interesting technology. It’s already being used in a lot of places. In terms of latency, virtualization doesn’t really change things. But the underlying issues with interrupt response time and physical latencies don’t fundamentally change. You can time-division multiplex some of those things, so for really long latencies it may give you another tool on the software side to use the underlying hardware that’s available.