Experts At The Table: Multi-Core And Many-Core

Second of three parts: Partitioning processors; more intelligent memory and memory controllers; changing the balance between processors and memory.


By Ed Sperling
Low-Power Engineering sat down with Naveed Sherwani, CEO of Open-Silicon; Amit Rohatgi, principal mobile architect at MIPS; Grant Martin, chief scientist at Tensilica; Bill Neifert, CTO at Carbon Design Systems; and Kevin McDermott, director of market development for ARM’s System Design Division. What follows are excerpts of that conversation.

LPE: Is software taking advantage of the hardware in a power-efficient way?
Rohatgi: Yes, and the ultimate example of that is the Android operating system. Even though it relies on Linux there are on-demand and five levels built into Linux that controls at the software level the CPU registers or SoC registers to shut down power. You’re already seeing that at the operating-system level.
Martin: It depends upon which software you’re talking about. At the OS level, where lots of apps are running, there may be commoditization happening. Down at the dataplane, where people use application-specific processors, you can argue that’s the infrastructure. People want extreme power efficiency and reliable continuously executing functionality. That’s the place where heterogeneous multiple processors really shine. It’s almost an infrastructure layer in a mobile device. So you see different solutions depending on what level of the device you’re talking about. We see a drive to more heterogeneity, too. Baseband wireless infrastructure works better with heterogeneous processors than trying to shove that onto a multicore device.
Neifert: That’s certainly what we’re seeing in our customer base. They want one processor to run the modem subsystem or the WiFi and partition that off. The last thing you want to do is wake the application processor all the time. The application processors are getting more complex so you can talk and play games at the same time and surf the Web. The application processor has to handle all of that. The application processor may be power efficient, but not as power efficient as one that just runs the radio or data transfer.

LPE: Is it better to actually design a device with multiple processors or a single multicore processor?
Sherwani: When I was at Intel we believed it was the best processor ever developed. I never thought I would see ARM and x86 processors on the same device. We are not that far away right now—and I’m talking about having them on a single chip. Or it may be a MIPS or Tensilica core. Such processors will exist. We are very efficient these days about using power islands. We can put six or eight processors on a chip and we can put them to sleep when they’re not being used.

LPE: Is it more difficult to verify them?
Sherwani: The verification nightmare is growing exponentially, and it’s not clear to me how we will be doing verification five years from now. At the implementation level, verification is becoming a bigger and bigger piece. But it’s more of an architecture question than whether you’re using multicore or many cores.
Martin: This whole approach tends to lead to a more compositional design style where you’re composing well-understood systems. What you need to do is limit the interactions between them to a relatively high level of abstraction or control. You verify significantly each subsystem and then you verify without having a great deal of interaction between the subsystems.
Sherwani: It’s amazing that on a big chip people don’t do flop-to-flop timing on a block. This is a situation that would never happen in software between subroutines, but it happens all the time in hardware. In hardware we have not reached a maturity level where I take care of my block and you take care of your block. We have timing paths going to two blocks and you cannot time it unless you do the timing and verification together.
Neifert: I’ve got customers that will spend months validating their processor, fabric, memory and data path, throwing out all the various options on there and running that. That could be a single-core processor reaching out to memory, and they’ll spend a lot of time optimizing that. Now throw in one other master accessing the same memory and everything goes out the window because of all the different permutations when these things talk to each other. It now blows up exponentially. The nice thing about a multicore approach is that you’ve handed off a lot of that task to the processor guys and hope that they’ve done it properly. It may not be the optimal use for your application, but pushing the problem off to an IP provider and a multicore solution is what a lot of our customers are doing.

LPE: What’s the best way to take advantage of cores? Do you do it with Wide I/O or through multicore and a standard bus?
Sherwani: If you look at where Micron is going with this, the whole interface has been changed. The memory becomes a lot more intelligent instead of a dumb storage. You will be able to ask memory to do certain tasks. Processor people have tried to make memory as dumb as possible in order to commoditize it. All the value comes from the processor side. But balancing would be better so you can offload things. You can combine flash into the most cost-effective memory. Instead of saying, ‘Give me byte No. 7,’ you can say, ‘I need this piece of information.’ It’s a lot more power-efficient to do it that way.
McDermott: It’s quality of service. You’re not just making a data request. You’re saying, ‘I need high bandwidth or high efficiency or low latency.’ A processor may need only a small amount of data, but it may need it very efficiently and very fast. With video you need high bandwidth that is very predictable. Having graphics integrated is one way to go. Unless you have a view of the fabric, the quality of service and the end power engine it’s going to be very hard to engineer a one-point solution.
Martin: With a compositional approach, you may have big memories and then a lot of small distributed memories to keep data close to the area where it is being processed. And maybe you need some intelligent abstractions on things like DMA (direct memory access). That would give programmers more assistance in managing the data flow and data interaction so things will move out of central memory into local memory before they’re needed. That’s a different programming style. We need more flexibility in how hardware and software developers can compose these memory systems together.
Sherwani: If memory is knowledgeable about what is stored inside, it can give you service of the highest level. Right now you can’t do that. The attitude has been, ‘I have a board and I have a DIMM and I want this DIMM to be as low cost as possible.’ That approach has led us down this path. If you’re designing a microprocessor of any kind, it puts a lot of burden on the microprocessor to do all these things with memory. Eventually you will see memory microprocessors—storage with a processor on it—that can gate what is being stored on it. That is a new area, though, and I don’t think much has been done so far.
Rohatgi: In some respects this is already happening. If you think about cache controllers over the last 30 years, this is where you’ve seen a massive improvement. It isn’t user-level aware. It’s bit-level aware. And if your memory isn’t fragmented it works. Or in a multicore design, a coherency module is also very well aware of what it needs to do to keep synchronization between processors. I like the visionary statement of making it user-focused.
Neifert: If you look at the various SoCs on the market, they may use processors from ARM, MIPS and Tensilica, but a large number of them are still doing their own memory controllers because that’s a place to differentiate their design. There are more memory controllers coming out of Synopsys and Cadence, but in large part the bleeding-edge SoCs are still designing their own.
Sherwani: But you can go a lot further.
McDermott: There’s a big difference if you can optimize a path for video and have some pre-fetch algorithm. That may not apply to every chip. But in a custom design, you can partition as needed. When you define your coherency space you need to make them aware of these choices. It’s not just an arbitrary memory spec. You need to make them aware of how to use it.
Martin: That should lead to some opportunities for much more sophisticated memory control, and the kinds of data flows and accesses that people really want to do. That can be reflected in configurable memory IP. I’m not sure how rapidly that’s happening, but there are moves in that direction.
Sherwani: For the work we are doing with the [Micron] Hybrid Memory Cube, there’s a lot of excitement around that space. A completely different level of system design is possible with that kind of hybrid model.