Using different processors in a system makes sense for power and performance, but it’s making cache coherency much more difficult.
Cache coherency is becoming more pervasive—and more problematic—as the number of heterogeneous cores used in designs continues to rise.
Cache coherency is an extension of caching, which has been around since the 1970s. The notion of a cache has a long history of being utilized to speed up a computer’s main memory without adding expensive new components. Cache coherency’s introduction coincided roughly with the introduction of symmetric multiprocessing in servers, but it began finding its way into portable electronics after 90nm when it was no longer possible to just turn up the clock speed on a single core.
Cache coherence provides the notion of a single address space that is shared between multiple processors such that all processors see a single unified view of memory. The problem comes when one processor wants to access the same piece of memory that is either being worked on, or is in the local cache of another processor. This means that the caches themselves have to keep track of where the latest copy of memory contents are being held. Complexity increases as additional layers of cache are added.
The majority of the work in cache coherence involves homogeneous cores. Cache coherence using cores of different sizes running at different speeds was, at least until recently, mostly confined to supercomputers. It is now creeping into mainstream designs, where it is viewed as one more way to eke out extra performance from multiple processors on a single chip.
And here’s where the problems begin. Keeping cache coherent is generally straightforward when using identical cores and provides the simplest programming paradigm for software. But it’s certainly not the most efficient way to use memory. In fact, non-uniform memory architectures can be significantly more energy efficient and simpler to implement.
“You really need to consider if you need coherency in the first place,” said Mike Thompson, senior product manager for ARC processors at Synopsys. “Shared low-latency memory may support what you need. The challenge with coherency is that it’s a slow path that can limit the speed in applications. Coherency tends to be deep. There are a number of layers that you end up with in the path through coherency.”
Still, there are plenty of cases where cache coherency is necessary, particularly when processing is being partitioned across multiple cores. In those cases, it’s much more energy-efficient to use different sizes of cores that can be targeted for specific functions. Rather than just turning off same-sized cores when they’re not being used, this approach allows smaller, less power-hungry cores to be kept on and used for some basic processing while larger cores can focus on computation-intensive tasks and then shut down. This is especially important for always-connected IoT devices, but it’s also much more complicated because not every function needs to run at the same speed or energy level.
“With heterogeneous cache coherency you’re dealing with multiple caching agents with different characteristics, protocols and cache sizes,” said Kurt Shuler, vice president of marketing at Arteris. “There are agents with smaller caches that work better with smaller directories and big clusters that require huge directories. The problem is you need to be able to size and configure multiple directories and snoop filters. There also are different caching policies, which are the cache states that a GPU or CPU recognizes, and problems managing all of the different cache states.”
Among the most common policies or protocols are MESI, MSI, MOSI, MESIF and MOESI. If they’re not consistent, the entire design falls apart.
Parallel problems
Computer science has been wrestling with better ways to add parallelism into application programming for the past half-century. While parallelism works well with databases and applications such as graphics processing, getting software developers to program in parallel is like trying to hammer a square peg into a round hole. In addition, there are no natural constructs in the languages being used for most programming to make its adoption any easier. It always makes sense in a PowerPoint presentation, but parsing compute functions across multiple processors is contrary to the way the human brain works.
Cache coherency is a key component of parallelism. It allows applications that can be parallelized to run more efficiently by utilizing cache memory rather than main memory in much the same way a single-core processor uses cache. That can be done at any level of cache, but typically it happens at L1 or L2. It also is gaining popularity in new areas, such as automotive system engineering, where SoCs increasingly are replacing electronic control units and need to remain in sync across multiple functions.
“There are a lot of tricks to improving throughput,” said Mike Gianfagna, vice president of marketing at eSilicon. “Caching is just one of those. It’s basically predictive, and it’s being used more than before. But if there’s a failure with that, there is a problem with the design of multiple cores.”
Gianfagna said parallelism among cores is effective when it works, and cache coherency comes up in almost every discussion about physical implementation and architectural design these days. “But it’s a complex system to design.”
That’s particularly true with heterogeneous cache coherency. Getting coherency right from a design standpoint is fairly straightforward with two identical cores. It gets harder with four cores, and harder yet again with eight. But the difficulty level goes up significantly higher when those homogeneous cores are replaced by heterogeneous cores. And it all has to be baked into the chip the architecture and properly verified prior to tapeout.
“Establishing hardware level coherency is a significant verification challenge that is increasing as the number of chip cores multiplies to enable greater device performance,” said Bill Neifert, director of models technology at ARM. “Coherency is not a block-level issue as it needs to be tackled at the system-level in a way that is sensitive to the individual IP configuration of the entire SoC. If it’s not done correctly, then there is a risk of invalidating whole sections of the cache.”
Neifert said one customer developing software for an eight-core chip found errors and assumed it was a virtual prototyping issue because the software was working fine on the actual silicon. It turned out to be an uninitialized register. In the silicon it defaulted to one state and in virtual prototype it was another. They used the visibility of the prototype to track it down. “If you do this wrong, you can have functional issues and performance issues.”
Nick Heaton, distinguished engineer at Cadence, likewise believes that cache coherency is a system-level challenge. “The big issue, and one that a lot of hardware engineers can’t easily grasp, is that you have to take a software approach, which is that you can never exhaustively cover all of the possible permutations,” he said. “For a long time, 2 clusters was the mantra. Now we’re seeing 12 clusters. You’ve got on and off states, which add potential for bugs. On top of that, different customers are going down different approaches. You can shut down the power but still keep the cache on. It may take more power to flush it than leave it powered on. Or you may want a factored response. These are architectural and philosophical choices. There is no right answer.”
What goes wrong
But there are plenty of experienced engineering teams get this wrong. Arvind Raghuraman, staff engineer in Mentor Graphics’ Embedded Systems Division, said that working with processors running at different speeds is still okay if they broadcast updates to other cores.
“If the hardware provides coherency, the software can leverage it,” Raghuraman said. “But one of our customers came to us trying to figure out why they were taking a performance hit. Their throughput was extremely bad. They thought they had enabled shared memory and it turned out that wasn’t the case. When it is set up properly there can be a substantial performance gain—as much as an order of magnitude. But coherency can make or break the performance of a device. When it’s not there, achieving performance goals is a problem.”
Processor vendors such as ARM and Synopsys do provide infrastructure and tools so that software vendors can take advantage of cache coherency. ARM’s big.LITTLE architecture is a classic example of a heterogeneous multicore implementation, and the technology is widely used in smartphones today. But coherency also can be customized for a specific implementation, and at that point it’s basically off the grid.
“At this point, there are two classes of design,” said Drew Wingard, CTO at Sonics. “There are those who accept the technology provided by the processor vendor, and those who have the ability to manage coherence in a better way. So you have ACE and CHI protocols from ARM. They all use the same protocol stack and they’re all interoperable. You also could move the coherence closer to the CPU, but that would require a lot more work in software. And the number one reason why people go to coherent processing is to reduce the work on software.”
Different data
While not all data needs to be kept current, some data goes out of sync faster than other data. Data that is in sync is clean data, while out-of-sync data is dirty. Depending upon how often that data gets dirty often determines how difficult it will be to keep it coherent with other data.
This is particularly challenging with different sizes of chips. “You’re passing architectural boundaries with different implementations of cores,” said Synopsys’ Thompson. “It could be 64-bit cores at the high end connected to 32-bit cores. When you run more and more cores, you have limitations of the maximum frequency for a system. So you may have a single or dual-core processor running at 1.5GHz, but with four to eight cores it drops to 1GHz. Coherency becomes a bottleneck. The more processors in an SoC, the more rope you have to hang yourself.”
In the first iterations of cache coherency, most chipmakers would use the protocols provided by chipmakers. But as more chipmakers begin adopting cache coherency, they are looking for ways to differentiate themselves. Some of are even developing their own caching protocols, said Cadence’s Heaton.
“You need to be able to take your IP and be confident that when you bolt it together with a coherent infrastructure that nothing breaks,” Heaton said. “So you can use formal techniques at the interface and do subsystem simulation with system interconnect VIP. But how do you know you’ve driven through all the permutations? This becomes a use-case-driven verification problem, and raises the potential for bugs.”
He said power up and power down cut across coherency, adding a whole other set of corner cases that need to be checked. “This is why it’s so daunting. If the system has a failure you haven’t found pre-tapeout it can be devastating.”
On top of that, not all cache is the same. Some L3 cache is off-chip, and there is a debate about whether designs even need L4 or last-level cache at all. Adding coherency at some levels while not at other levels, with some on-chip and some off-chip, makes the architecture even more complex, which is why there has been experimentation in a type of software and hardware pre-fetch. Sonics’ Wingard said it’s uncertain whether this will work, but he noted that it is less intrusive than full-cache coherency.
On a positive note, cache coherency is getting renewed attention throughout the academic community and by the semiconductor industry. It’s also becoming something many more design teams will need to understand as new markets begin using different combinations of technology, so there will be much more experience in dealing with these issues. But at least for now, it’s raising the anxiety level among some industry veterans.
Based on our research cache coherency does not scale beyond several hundred cores at most due to the communications necessary to maintain it. Also the necessity of cache coherency seems to come from parallelizing existing code designed for single or very few cores.
The end-game appears to be refactoring the problems themselves or in some case recognizing that some problems just don’t parallelize well. One thousand cores to run MSWORD is probably not a good idea.
MapReduce/Hadoop and Apache SPARK on the other hand ignore cache coherency by redefining the data management problem in a manner that parallelizes nicely.
The cache-coherency problem rises out of a bad software methodology for parallel processing (SMP), which happened because it was an easy approach when CPUs were smaller and memory was expensive. Maintaining cache-coherency on more than a few CPUs is difficult, and it doesn’t scale well beyond that.
If you do parallel processing on GP-GPUs you don’t run into that problem because the memory architecture is quite different and cache-coherency isn’t required. However, programming GP-GPUs isn’t that easy.
So you can look at the various NoC/cache-coherency tools as applying a “Band-Aid” to a fundamental methodology problem, and not really any kind of cure.
The cure is to move to a different software methodology (e.g. CSP – http://usingcsp.com) that doesn’t share memory in a way that needs cache coherency, and is better suited to a heterogeneous/distributed processing environment.
If you do GP-GPU the way AMD implemented with its HSA, then coherency is a must.
You can probably do CSP on top of HSA, i.e. coherency is a requirement of a particular kind of software.