The fundamentals of improving performance.
Coherency is gaining traction across a wide spectrum of applications as systems vendors begin leveraging heterogeneous computing to improve performance, minimize power, and simplify software development.
Coherency is not a new concept, but making it easier to apply has always been a challenge. This is why it has largely been relegated to CPUs with identical processor cores. But the approach is now being applied in many more places, from high-end datacenters to mobile phones, and it is being applied across more cores in more devices.
“Today, in the networking and server spaces, we’re seeing heterogeneous processing there,” said Neil Parris, senior product manager in the Systems and Software Group at ARM. “It’s really a mixture of, for example, ARM CPUs with maybe different sizes of CPUs, but other processors such as DSP engines, as well. The reason they want the cache coherency comes down to efficiency and performance. You want to share data between the processors, and if you have hardware cache coherency, then the software doesn’t need to think about it. It can share the data really easily.”
Without hardware coherency, it has to be written in software. “So every time you need to parse some data from one CPU to the next, you have to clean it out of one CPU cache into main memory,” Parris said. “You have to tell the next CPU if you’ve got any old copies of this data in the cache. You have to invalidate that and clean it out. Then you can read the new data. You can imagine that takes CPU cycles to do that. It takes unnecessary memory accesses and DRAM power to do that. So really, the hardware coherency is fundamental to improving the performance of the system.”
How to achieve that isn’t so simple, though. Research has been underway for years to improve coherency and to make it easier to implement.
“We’re in a world where, in a lot of design teams, the software team outnumbers the hardware team by integer multiples, so the cost for the software development is very expensive,” said Joe Rowlands, chief architect at NetSpeed Systems. “Coherency is the solution there, and more devices are becoming coherent.”
In the past the only real tweak to the coherency formula was a change in latency to memory access. All of the agents were still identical and symmetric. Increasingly, that is changing.
“Now what we are seeing are agents that are wildly different, and you have CPUs where you may have a high performance CPU and some very-low-power, low-cost CPUs on the same system, both of which need to be coherent,” Rowlands said. “You might have GPUs that are coherent. GPUs are interesting because unlike CPUs, they’re much more latency insensitive but they have a substantially more coherent bandwidth.”
These are two examples of wildly different devices, and the trend is to figure out how to get these things to work together. Rowlands contends a hardware coherency scheme is needed that comprehends these differences in requirements, the difference in cache sizes, the difference in number of ports, and basically every address.
Anush Mohandass, vice president of business development at NetSpeed, stressed this is the very essence of heterogeneous computing. “When I started as a design engineer, the entire processor was a chip. There was no SoC, so the chip meant the CPU. Now, a chip means a bunch of different things. Some engineering teams may be implementing 16 or 20 different CPU cores for math processing, a GPU for parallel processing, a DSP core for a camera system doing image processing. All of this is happening concurrently, and all of these have different sets of requirements in terms of system-level requirements, power requirements, performance requirements, latency requirements, and it all of this needs to be managed in a consistent way. That’s where coherency has exploded.”
Rowlands contends that a coherent solution must be built that is optimized for the specific system and for the specific components. “A system with just CPUs is going to be very different from a CPU plus GPU system, or a CPU plus DSP system, or a combination of them. Similarly, on the IP front, when you are building a single, fixed configuration coherence IP, a fixed configuration isn’t taking advantage of any of the asymmetries in the system.”
This is where the coherent hub interface (CHI) comes into play, and is most commonly spoken of in context of the AMBA 5 CHI.
Kurt Shuler, vice president of marketing at Arteris, observes that just the term ‘coherent hub’ causes confusion. “Traditionally, these things have been monolithic, and have also been referred to as cache controllers. The problem is that when there is a hub, it’s basically a big crossbar with logic on the ends, tons of wires on the inside, so when it comes to place and route time, it becomes very difficult to place. As a result, the industry is demanding something that allows them to reach timing closure, and get through place and route more easily.”
Drew Wingard, CTO of Sonics, agreed. AMBA 5 CHI semantically looks a lot like AMBA 4 ACE, but physically it looks very different because it is a packet processing kind of interface. “What it really means is that by the time you get to a full AXI or ACE interface, believe it or not, you are up to eight interdependent channels worth of signals. Not eight signals—eight sets of signals—which can have thousands of signals within each set. The total number of wires can easily get to 1,000 for 128-bit kinds of interfaces.”
Understandably, that starts to become an integration barrier, he said. “With CHI they defined an interface where you could think of them as multiplexing those channels across a shared interface, but they went a bit further than that and actually defined it in terms of something that looks more like a packet-based format. The reason they did it was to enable the design of larger scale cache coherent chips, so this is pretty quickly more than just big.LITTLE. It’s going to two things where there are six or eight clusters of coherent processors all being pushed together typically with a third-level shared cache of some kind, and maybe some heterogeneous accelerators such as GPUs, for example.”
Wingard said ARM did not expect AMBA 5 to be used in a very general way. In the middle of the coherent part of the design there are very detailed dependencies and interactions between the different communicating components. Consequently, just defining the interface (which is what CHI is) doesn’t guarantee true interoperability at the system level.
ARM’s Parris agreed that the number of applications for coherency is now way beyond what was originally expected but because the principles of hardware coherency are to simplify the software, remove software complexity overhead, it is applicable across lots of places where there is a processor.
“There are enough system-level properties of how all the stuff works, and it’s really difficult to capture all of that in the interface specification,” Wingard said. “By and large, people are either working with CHI or doing so in a more closed system model. Either they are receiving their processors and the fabric components from a company like ARM or maybe they have an architecture license to the processor and they are building their own fabric. In both cases, it’s more of a closed system model.”
Building an interconnect fabric is no small task, and not the mainstream practice today. The industry is still in a phase where the engineering teams believe the lowest-risk path is to receive the processor and the fabric IP from the same people who have had the opportunity to co-validate it and deal with all of those system issues, Wingard said. Those are the people with detailed knowledge about what was talking to what in the design.
“Everyone who designs an interconnect is going to optimize it in different ways. ARM and other vendors will offer a lot of configuration options so the licensees can tune the performance and properties of the interconnect to meet their needs. ARM is very big on that configurability and offers design tools to help partners optimize their designs to meet their needs. That could mean making it the smallest possible area to meet their performance goals. It could be tuning it for frequency targets. There are lots of different areas where you might customize and configure the IP to meet the product needs,” Parris said.
In terms of customizing the interface protocol itself, Parris added there are a number of potential gotchas, challenges, and difficulties with that. “The IP interface standards work best when everyone follows the same standard. If everyone has the same interface standard, all of the IP and verification IP plugs and plays together. If someone goes away from that standard, now you need a special EDA tool or a special modeling solution. It becomes more expensive in a way. Can you really get a benefit from doing those modifications? Maybe. But would those benefits outweigh the costs of incompatibility or non-standard solutions and non-standard IP?”
Arteris’ Shuler said this works best when that IP fits perfectly with the target market. When that bundle doesn’t quite fit it can cause trouble.
Because of this, there is a lot of growth happening in this area in IP and system components that have separate processing engines for specific things, he noted. Think video processing units or image signaling processors along with parallel processors for machine learning. “The companies in this space want to know if they can take advantage of cache coherency because they can have the CPU cluster — an industry standard CPU, custom, bespoke IP — with one common view of memory,” Shuler said. “That helps with bandwidth and latency, especially if you don’t have to go out to the DRAM because most of the stuff you’re dealing with is in cache. And it could help simplify the software too.”
On the smaller end of the scale, coherency is being incorporated in some really tiny devices such as Freescale/NXP’s 64-bit ARM-based networking processor, Parris went on. “It’s a tiny little chip with a single cluster ARM core, and a cache coherent interconnect which was designed to accommodate multiple CPU clusters in mobile applications but has also found use in networking designs, storage controller designs, set-top boxes, DTVs and automotive.”
Looking ahead, future versions of AMBA will work towards improving efficiency of the system, improving the efficiency of moving data and sharing data between the different processors and wires in the system, he added. “As partners build bigger and bigger systems with more and more processors, we work on new interconnect designs and new CPU designs, and we can see different ways to improve on the protocol.”
While there is tremendous design complexity of next-gen processors to be dealt with, even for the more generic application processors in cell phones and similar devices, Arteris’ Shuler concluded that while there are only a few companies that can justify going down to 10nm and 7nm — at least right away — they are still going to need to sell a product that is as low power, and as high performing as the devices at 10nm and 7nm. “How are you going to do it? You’re going to have to be more efficient in the processing of your hardware; you’re going to have to have more efficient hardware. That means you’re not going to be able to run all that software just on CPU cores. You’re going to have to do a lot more offloading and that’s where the heterogeneous stuff is really going to take off.”
How Many Cores?
Fan-outs and 2.5D will change how cores perform and how they are used; hybrid architectures evolve.
Heterogeneous Multi-Core Headaches
Using different processors in a system makes sense for power and performance, but it’s making cache coherency much more difficult.
Are More Processor Cores Better?
The effects of one architectural change on hardware, software and the design flow have far-reaching consequences. Adding a second processing core adds untold complexity.
One-On-One: Mike Muller
ARM’s CTO talks about new memory strategies; coherency vs. non-coherent designs; future phone innovations; what’s limiting wearable electronics, and the impact of stacked die.