The mundane aspects of a system can make or break a solution, and interfaces often define what is possible.
It takes a lot of technology to enable something like machine learning, and not all of it is as glamorous as neural network architectures and algorithms.
Several levels below that is the actual hardware on which these run, and that brings us into the even less sexy world of interfaces. One such interface, the Cache Coherent Interconnect for Accelerators (CCIX), pronounced C6, aims to make the interface between the CPU and accelerators implemented in FPGAs or GPUs almost disappear.
CCIX Fundamentals
Using the technology available before the release of CCIX, the general flow of data between a CPU and accelerator would have been very batch-oriented. The CPU would transfer data, possibly using a direct memory access (DMA), from main memory into the accelerator. The accelerator would crunch on the data until it had completed and would send an interrupt back to the CPU indicating its completion. At that point, the resulting data would be transferred back into main memory. This whole process has long latency, and this promotes a very granular operation.
“Consider the slowdown of ,” says Gaurav Singh, vice president of architecture and verification at Xilinx and chair of the CCIX consortium. “What does that mean for system architecture? The amount of functionality incorporated within a single piece of silicon is a slowing trend. Now, in order to get the necessary functionality and performance improvements, you have to utilize heterogeneous components. So what is the correct way to interconnect them?”
And that is what CCIX hopes to provide. The effort was started by Xilinx in 2016, when it needed an efficient architecture to connect its devices, which were being used as accelerators within a datacenter. PCI express is the de facto standard today, but it suffers from the latency problem already described. Xilinx got together with ARM and some of the EDA and IP partners under a multi-party non-disclosure agreement to try and find a more suitable solution. In March 2017, they incorporated as a consortium.
Rather than creating a new standard from scratch, CCIX is built on top of PCI express. The PCIe physical and data link layers can be used unmodified. CCIX adds a modified transaction layer and a new link and protocol layer.
CCIX creates a system from connected components, with all of them having native access to the same memory. If you were to have designed accelerators that were on a single piece of silicon, all of the accelerators would have had access to all of the resources of the SoC. CCIX is attempting to enable the same functionality at the system level.
By bringing the notion of cache coherence between the CPU and accelerator, CCIX bypasses much of the overhead. Data structures can be created in memory, and a pointer to it sent to the accelerator. The accelerator can crunch on the data immediately, possibly making local copies of data only when it actually needs it. It is also possible that the data is being continuously updated.
Why cache coherency is important
Technology adoption depends on low barriers to entry. “There are hundreds of thousands of developers out there who are used to programming for ARM devices,” points out Kent Orthner, systems architect for Achronix. “CCIX basically uses the ARM protocol under the hood to provide coherency across devices that are connected. This allows all of the software programmers to use the same programming model. And they can now talk to the accelerators in the same way. That becomes a lot simpler.”
“Essentially, you are taking that ARM bus and coherency protocol, packaging it up in the CCIX protocol, and now you are leveraging the PCI infrastructure to add that shared memory system across multiple chips,” adds Jeff Defilippi, senior product manager at ARM. “Even in the mobile space today we have a CPU and GPU, and a lot of times they are connected coherently just for that reason. It simplifies software, and not just for an efficiency reason.”
Coherency makes it easier to tie various components together without a significant performance hit. “What coherency allows you to do is bridge the gap between an FPGA and an SoC and the CCIX interconnect,” said Anush Mohandass, vice president of marketing and business development at NetSpeed Systems. “If you look at an FPGA accelerator, that’s often treated as a second-class citizen in multi-chip architectures. To get a performance boost, that has to be coherent. This happens in the data center, where there is off-chip acceleration that can be used for a specific processing task. The other place it happens is in a smart NIC at the top of a server rack.”
The way of working with accelerators also is changing. “Data sharing models are changing,” explains Xilinx’s Singh. “Earlier, it used to be that you had a chunk of data and you could fire and forget. But today, there are many applications where you have sparse data structures and the access to those data structures is really data-dependent. You get a packet and look at a packet, and that gives you a pointer into a really large table. That model does not work with PCI express because you would have to know, a priori, where that would be and pass that buffer over to the accelerator. There are lots of examples of those kinds of structures where you really want the accelerator to be doing a lot of lookup and the processor to be doing an update. So you want the processor and accelerator to be sharing the same table.”
Enabling machine learning
This type of access may be very important for machine learning. Today, machine learning is based on tasks that have a very deep pipeline. “Everyone talks about the amount of compute required, and that is why GPUs are doing well,” says Singh. “They have a lot of compute engines, but the bigger problem is actually the data movement. You may want to enable a model where the GPU is doing the training and the inference is being done by the FPGA. Now you have a lot of data sharing for all of the weights being generated by the GPU, and those are being transferred over to the FPGA for inference. You also may have backward propagation and forward propagation. Forward propagation could be done by the FPGAs, backward by the GPU, but the key thing is still that data movement. They can all work efficiently together if they can share the same data.”
It is also important to enable the programming languages that are being used in this type of application. OpenCL, for example, defines a concept called a shared virtual memory. They are trying to make it look like a shared memory space. CCIX provides the hardware access to enable that programming model to be a lot more efficient.
Extending the notion of the OS
Changes in the interface can have a ripple effect throughout the rest of the hardware architecture.
“Assume you have a function that is being accelerated, perhaps part of a machine learning algorithm that does vision processing,” explains Orthner. “You have the image in memory and you want to recognize a face. So the software asks for a face function. It could either run the software, which will be low performance, or it can run in the FPGA. Whether it can run in the FPGA right now depends on if the bitstream has been loaded. Then it comes down to the driver. It can either fire it off to software, it can recognize that the hardware implementation is already there in the FPGA, or load the FPGA with the bitstream that supports it.”
The correct answer may change at runtime. If it is known that you will be doing the task 1,000 times, then it probably makes sense to spend the couple of milliseconds that it takes to program the FPGA. With an embedded FPGA it can be done even faster. Part of the software driver needs to keep track of which bitstream is in the FPGA. You also could have multiple accelerators in the same device. This type of functionality needs to migrate up into the OS, again in a way that makes it invisible to the programmer.
Providing security
Any new architecture discussed today has to look at security and to build in the necessary protection mechanism from the very start. If an accelerator has access to shared memory, then it could provide access into the rest of the system.
“Someone could attempt to hack your system through an accelerator that has access to the memory,” admits Singh. “That is perhaps the bigger risk, but there are built in protection mechanisms that already exist. There are MMUs that do this. They give access control based on addresses. We are only interested in giving accelerators access to that portion of the memory that it is permitted to access.”
IP availability
The decision to base CCIX on the PCI express standard gives the IP developers a head start. “What that does is multiple things,” says Sachin Dhingra, senior product marketing manager for the IP group of Cadence. “It reuses the existing infrastructure on the physical side and on the system level the same connectors, the same ecosystem. From an IP perspective, you can leverage a huge chuck of the PCI express solution. The only part I am really changing is the transaction layer and the interface to the system bus.”
Fig 1: CCIX Interface IP. Courtesy Cadence.
But CCIX does push things a little further. “CCIX is 25GHz, and PCI express Gen 4 tops out at 16GHz, so you have to redesign the physical layer to support 25GHz,” adds John Koeter, vice president of marketing for the IP division of Synopsys. “Then there is the transport layer. What we find is that people want to use this interface for either PCI express or CCIX, depending on their application. So we have to provide some muxing in the transport layer and because of the latency. We have a native interface on that.”
That provides a range of performance solutions. “If I am running CCIX at 16GHz, I don’t have to change anything below the PCI express transport layer,” says Cadence’s Dhingra. “If I am doing 20GHz or even 25GHz, I will change small portions of it, but it is not much – just a small portion to do with the PHY. But the rest stays the same.”
And PCI express is not standing still either. “The PCI sig just announced Gen 5, which takes that up to 32GHz,” adds Koeter. “That throws in an interesting dynamic.”
The goal, though, is to pick what is right for a particular market and application, and then build the most cost-effective solution based upon a number of choices.
“At ARM we have licensees who create their own SoC around the ARM architecture and create their own CCIX environment—potentially using that with a third-party PCIe controller that is CCIX enabled to add that connectivity outside of the box,” says Defilippi. “We also have IP products where we have high-performance CPUs, and now we have the interconnect, as well, that has all of the on-chip components for CCIX. But we are working with the ecosystem partners to get that outside of the chip and provide the external connectivity and PHY layers necessary for the complete solution. So between ARM IP and the third-party IP, it is easy to stitch these things together as they do today for DDR or PCI in general.”
Conclusion
While CCIX is not the only new standard competing in this space, it has a strong start because it is riding the coattails of PCI express. As a result, the IP providers are seeing the demand for this.
“PCI express was successful because it preserved the software model,” says Singh. “That mechanism is built into every OS distribution. It is pervasive. CCIX leverages all of that.”
Related Stories
Embedded FPGAs Come Of Age
These devices are gaining in popularity for more critical functions as chip and system designs become more heterogeneous.
Machine Learning Popularity Grows
After two decades of experimentation, the semiconductor industry is scrambling to embrace this approach.
Machine Learning Meets IC Design
There are multiple layers in which machine learning can help with the creation of semiconductors, but getting there is not as simple as for other application areas.
Supporting CPUs Plus FPGAs
Experts at the table, part 3: Partitioning, security issues, verification and field upgradeability.
The Great Machine Learning Race
Chip industry repositions as technology begins to take shape; no clear winners yet.
Plugging Holes In Machine Learning
Part 2: Short- and long-term solutions to make sure machines behave as expected.
There isn’t a cache in the human brain; neural networks are peer-to-peer, 3-D systems that look more like analog circuits.
CCIX is just evolution of support for good old Von Neumann SMP computing, and has little to do with new computing models.
https://en.wikipedia.org/wiki/Von_Neumann_architecture#Von_Neumann_bottleneck
(it’s particularly bad use of FPGA transistors)
I completely agree with you. When I asked about this, it just came down to – we have to make it easy for the existing software developers to understand it. Any change in their way of thinking is seen as being too big a blockage to get acceptance. With Moore’s law coming to an end, new, more efficient models of computation will become absolutely essential at some point.
There are other ways to run (old) code such that cache coherency is not required despite massive parallelism, I have a pending patent –
Patent US20150256484 – Processing resource allocation
You should really look at what AMD is doing with CCIX… Their whole infrastructure for future systems support it… I thought they might go with OpenCAPI because of IBM, but ARM does move faster…
Following from Kev’s comment. CCIX / deep learning use cases sounds like the old round peg/square hole routine. Training on a GPU whilst simulataneously coherently sharing weights with a FPGA inference engine? How often do you train at the *same time* as you infer? I’m no data center expert but that seems like an odd use case. I’d also question using an FPGA for high performance inference….this is a classic ASIC problem.