Experts At The Table: Coherency

Last of three parts: Wide I/O; figuring out which data should be coherent; alternatives to cache coherency; how many cores and which kinds are best; software vs. hardware; lots of evaluation but not much implementation.

popularity

System-Level Design sat down to discuss coherency with Mirit Fromovich, principal solutions engineer at Cadence; Drew Wingard, CTO of Sonics; Mike Gianfagna, vice president of marketing at Atrenta, and Marcello Coppola, technical director at STMicroelectronics. What follow are excerpts of that conversation.

SLD: We’ve been hearing a lot about Wide I/O. Why is it so important and what effect will it have on designs?
Wingard: To me it’s about scalability. The stated benefit of Wide I/O is greater bandwidth and lower power. That impacts scalability because the SoC architecture is normally scaled from a frequency and data width perspective to match the bandwidth of DRAM. Every time we plug an ARM subsystem or cluster into a chip, the rest of the chip is running at a lower frequency than the processor. There’s no reason to run at higher bandwidth than memory. That’s the choke point. If we get more bandwidth, it makes sense to run the system interconnect and another system at higher bandwidth. If you run at higher bandwidth, then you put more pressure on the cache-coherent system. That network needs to be able to resolve the cache probes in time so that you don’t impact the memory system or else you’ll lose performance. Technologies like Wide I/O offer the promise of greatly scaling memory bandwidth and will put a lot of pressure on the coherence mechanisms to keep up with the memory. As long as your coherent scheme can search caches at the same bandwidth the DRAM would have delivered then you’re okay. As soon as you fall behind, then the coherence manager is the bottleneck.
Fromovich: It should be faster.
Wingard: But there’s a latency benefit. What the computer architect would say is that there’s a bandwidth choke point, not latency, where you can’t search enough of the other guys’ caches. You can’t decide whether that DRAM transaction should go or not. It would be great to go faster, but with AMBA you don’t get a lot of performance by doing that. The interesting question is whether coherence is a power benefit or a power cost.
Gianfagna: It’s a balance point.
Wingard: We’ve been arguing that it’s unclear and that we should adopt coherence slowly as a system. ARM’s slides make it clear they can save power in the GPU plus CPU case by introducing coherence. That’s possible in a display-less case. But if you try to make the rest of the GPU coherent with the CPU you would lose big. The GPU is about processing pixels, and the CPU isn’t processing pixels, so there’s no good reason for that to ever be in cache.
Coppola: You need to decide which kind of data to cache. That’s not a simple problem to solve.
Wingard: It’s exactly that tradeoff that needs to come back to the system decide. And it’s going to be the software guys who are going to decide. We’re going to give them hardware they can use, but they’re still going to have to use it carefully. The burden on them goes up. There are still benefits, but coherence is not a panacea. It’s not like you flip some switch and everything is better. The chip will burn more power if they turn on more power all the time.
Coppola: In Europe we have a project funded by the European Union to figure out which kind of information we can cache and which information we cannot cache in order to improve performance. There are a number of companies working together to analyze the problem. If we share anything with a GPU or an accelerator for some particular function, we have to understand what goes into the cache. This is a really tricky problem.
Wingard: And the problem is what goes in cacheable space and what goes in cacheable space that’s shareable. Today most people worry about the second problem.

SLD: Isn’t this similar to the 50-year-old problem of symmetric multiprocessing in software? It comes down to partitioning.
Gianfagna: There’s a big interconnect component to massively parallel software. There’s not a big interconnect problem here, but there is a replication and management problem. The cycles involved in that can blow all the savings, and then some.
Wingard: It is directly analogous. We’ve heard in the SoC about two-core, three-core and four-core clusters with unclear programming models for how we’re going to take advantage of all of these processors. But now we market ourselves on how many processor cores are inside of these machines. The frequency and number of cores is on the data sheet for the phone because Intel taught us that’s the way to do that. Effectively using those is a challenge. My view is we’re better off with heterogeneous cores because we get better battery life, better performance and better user experience. Using coherence effectively has some different characteristics—it’s really about how the data is being used. We could end up with a message-passing type of system. There is some benefit to that. You don’t have the single global view of memory. Coherence is the opposite. It’s about protecting the model that memory is shared among all processors.

SLD: Do the tools we have understand all of this?
Fromovich: We’re just starting to see customers building this kind of design. We’re working with customers now and we do understand the complexity—but you only really start to understand coherency when you do work with customers and face the reality. We realize the difficulties and the challenges of verification and compliance with the spec. To understand coherency you must have an interconnect into the system. That’s a new approach. How do you verify coherency across a hierarchical fabric? And how do you connect another interconnect into the system and what effect does that have coherency. We’ve built this kind of solution. But the complexity of the spec is much larger than what it used to be. We need to provide automation, not just tons of tests. Otherwise it would take years to complete this kind of task. Instead of one read and write there may be 20 versions.

SLD: And you may have read and write proceeding at different speeds, right?
Fromovich: Yes. And then you have coherency, maintenance of the cache—it’s impossible for anyone to keep it all straight.
Wingard: ARM just opened a compute farm in Cambridge with 65,000 CPUs focused completely on verification for their next-generation cores and the cache-coherent system. It’s the biggest compute farm in Europe.
Gianfagna: That’s an indication we’re going to use a brute-force approach to solving this.

SLD: Can you ensure IP will be coherent with everything else?
Gianfagna: We’re still in the very early stages of that. You need coherence of models from the very highest levels all the way down. Today that piece is missing. The ability to move from a hardware model to a software model with sufficient speed and accuracy to model these interactions isn’t happening. It’s a function of how much software can you run and how long does it take. I don’t think we’re there. There’s a need for a level of abstraction that is sufficiently accurate to reflect the hardware details of the cache, but sufficiently fast to run enough scenarios. And by the way, when you find a problem, you have to change the description of that hardware and try it again. That’s missing. There’s an enabling technology there that will make a difference in the efficiency with which these schemes are implemented, but we’re not there yet.

SLD: Is coherency the issue upon which progress along Moore’s Law relies?
Wingard: We can’t continue to build chips that aren’t coherent. The costs will go through the roof. The reality for the expensive platforms is that the software costs are higher than the chip development. There are two technologies there that are frequently seen as a way of solving the software costs. One is coherence, the other is virtualization. Those two almost go hand in hand. That, plus platform models so there is less software to write from one chip to the next. But coherence will help companies stay on Moore’s Law.
Gianfagna: It’s a more robust vector in the business model. Technology vectors will continue to march along.

SLD: But from a rational-use basis, can you progress if you don’t solve this coherency issue?
Fromovich: If you keep adding more and more cores, at some point you encounter coherency. If you want to scale beyond four cores, you have to have coherency.
Wingard: Software guys have had to work without it, and they’re beginning to rebel.
Gianfagna: The software costs are off the charts.
Fromovich: That’s partly because software comes too late and you want to fix things earlier.
Gianfagna: That’s true. You spend a lot of time hacking software after you’ve finished the hardware. But what if you could change the rules?
Coppola: Introducing cache coherency plus virtualization are a way of simplifying the software. We will pay in silicon, but we have a lot of silicon available now. What is important is that it is the best way to use that silicon. The big issue is how we are going to move from today’s platforms to these new platforms. We need to address time-to-market. This is not simple.

SLD: Does coherency become a competitive weapon, or does it get solved as an industry?
Gianfagna: It has to be done by the industry because there are too many pieces to the puzzle and they can’t all be solved by one vendor. The market will resist that.
Fromovich: We see a lot of interest across our customer base. I don’t know if everyone will decide to use it, but there is a lot of exploration going on.
Wingard: The ARM ecosystem is trying to chase what Intel learned about 20 years ago in a very accelerated fashion. They’re having to come up to speed very fast.