Dealing With Performance Bottlenecks In SoCs

SoCs keep adding processing cores, but they are less likely to be fully utilized because the real bottlenecks are not being addressed.

popularity

A surge in the amount of data that SoCs need to process is bogging down performance, and while the processors themselves can handle that influx, memory and communication bandwidth are straining. The question now is what can be done about it.

The gap between memory and CPU bandwidth — the so-called memory wall — is well documented and definitely not a new problem. But it has not gone away. In fact, it continues to get worse.

Back in 2016, John McCalpin, a research scientist at the Texas Advanced Computing Center, gave a talk in which he looked at the balance between memory bandwidth and system resources for high-performance computing (HPC). He analyzed what were then the top 500 machines and dissected their core performance, memory bandwidth, memory latency, interconnect bandwidth and interconnect latency. His analysis showed that peak FLOPS per socket were increasing 50% to 60% per year, while memory bandwidth only increased at about 23% per year. In addition, memory latency was degrading at about 4% per year, and interconnect bandwidth and latency increased about 20% per year. These suggest a continued and widening imbalance with respect to data movement.

What this means is that if we are streaming data, every memory transfer takes as much time as 100 floating point arithmetic operations. In the case of memory latency, where you cannot prefetch and you miss the cache, you have lost the chance to do more than 4,000 floating point operations.

Fig. 1: Imbalance in elements of system performance. Source: John McCalpin, TACC, University of Texas at Austin

Fig. 1: Imbalance in elements of system performance. Source: John McCalpin, TACC, University of Texas at Austin

A well-designed system is balanced. “If there are two solutions, and one uses transistors more efficiently, it will get more throughput per dollar and throughput per watt, and that’s what most people are going to want,” says Geoff Tate, CEO of Flex Logix. “It’s hard to deliver architectures that get high utilization, but the more utilization you can get, the better. Transistors still aren’t free.”

Others agree. “When looking at system performance, things are either compute bound, memory-bound, or I/O Bound,” says Bill Jenkins, director of product marketing for Achronix. “As you make the compute faster, you need to put more emphasis on faster memory to keep up with the compute and you also need higher bandwidth interfaces to get the data in and out of devices.”

But the industry has a fascination with processing performance. “The capabilities of the compute unit, whatever it is, are important, but they’re often not the limiting factor in actual system speeds,” says Rob Aitken, a Synopsys fellow. “System speed is workload-dependent, and it’s determined by how fast data comes from someplace, gets processed somehow, and gets sent to wherever it is that it’s being used, subject to all of the miscellaneous constraints of everything along the way.”

That means that it is impossible to build a system that is optimal for all tasks. The key is to make sure it is well balanced and is not over-provisioned in any area.

Moving data
The cost of moving data certainly impacts system performance, but it also relates to power because moving a piece of data consumes orders of magnitude more power than performing computation on it. To complete a task, it generally means moving data through external interfaces into memory, from memory to CPU and intermediate results bouncing between memory and CPU and finally results being pushed back out through the external interfaces.

“No matter how fast your compute, or how big your memory array, ultimately what is going to determine the performance of your chip and your system is the bandwidth of the bus that connects the two,” says Pradeep Thiagarajan, principal product manager for analog and mixed-signal IC verification solutions at Siemens EDA. “That’s where you have the big bottleneck. And it’s not just a bus. It’s basically your transceiver, SerDes links, and it brings a whole different dimension to a problem that needs to be solved.”

One of the biggest advances in effective memory bandwidth has been in the adoption of cache. This effectively brings the memory closer to the processor, and decreases latency, assuming that most memory accesses come from this memory rather than the main memory. However, cache performance has been going down, and this is one of the main contributors to increasing latency (as shown in Figure 1, above). Even the introduction of HBM has not managed to reverse the trend because processor performance is increasing so rapidly, mainly through the rapidly increasing number of cores. McCalpin says that the degradation in latency is because cache is getting more complex, especially as more cores are kept coherent, and that lookups in multi-level cache is serialized to save power.

The other alternative is to move the compute closer to the memory. “The era of in-memory computing is just beginning,” says Marc Greenberg, product marketing group director for Cadence‘s IP Group. “I see three ways that this can happen. Typically, we don’t see a lot of complex logic on DRAM dies because of the economics of DRAM manufacturing. What we may see is a small number of very specific functions getting added to those dies — for example, an accumulate or multiply-accumulate function, which is common in many DSP and AI algorithms. The second might be technologies like CXL.mem, where it is quite feasible to add compute functions to the logic die that controls the memory array. Technically this is processing near memory rather than processing in memory. The third is somewhere between the two. For certain stacked memories like HBM, there is typically a logic die co-packaged with the DRAM in the same stack, and that logic die is the interface between the bus that faces the CPU and the DRAM devices. That logic die gives scope for low-to-medium complexity processing elements on the logic die.”

The success of HBM certainly has helped popularized the notion of chiplets, where chips that had become reticle limited, or yield limited, now can be fabricated on multiple chiplets and integrated into a package. However, this now requires die-to-die connectivity solutions that are likely to be slower than those on a single die. “When companies are splitting a chip into multiple homogeneous dies, you want the same operation from a split chip, without any degradation in performance, or accuracy,” says Sumit Vishwakarma, product manager at Siemens EDA. “You want to make sure there is pretty much zero latency between the two.”

In effect, these chiplets are being designed in the context of the system, and vice versa. “It’s not just the design of the memory or the controller,” said Niels Faché, vice president and general manager of PathWave Software Solutions at Keysight Technologies. “The IC design that goes into packages introduces its own parasitics. So you have to look at utilities and potential changes to impedance levels. And you really need to think of this as a system, and look at the eye diagram and see how you can optimize that based on the operating conditions in a system.”

To this end, design teams are looking at bringing into the package some of the functionality that previously existed outside of the package, substantially increasing bandwidth and reducing latency. “Depending on the source, and the receiving side, the purpose of those is what determines the interface and the protocol,” says Siemens’ Thiagaraja. “It will be one thing for compute-to-compute. That same interface will be very different for compute-to-memory. It can be very different from a compute-to-I/O. We now see HBM stacks that are within the same package, and they need interfacing, as well. You’ve got so many protocols – USB, SATA, PCIe, CXL, DDR, HMC, AXUI, MIPI — the list goes on and on. Newer protocols are being created, because of the requirements, and there is a need for new receivers for these die-to-die connects.”

One of the big advantages of multi-die systems is the number of connections available becomes a lot larger. “From an I/O perspective, we used to have 1,024-bit buses and then we went to serial interfaces,” says Jenkins. “But what has happened recently is those serial interfaces have now become parallel, such as x32 PCIe which consists of 32 lanes of very high-speed serial connections.”

Parallelization extends to multi-core systems, as well. “With a four-core system running something like an operating system, you’ve got some operations that can be parallelized and other ones that are inherently sequential,” said Roddy Urquhart, senior technical marketing director at Codasip. “This is where Amdahl’s Law applies. Then, there are other emerging challenges like AI/ML, where you can exploit data parallelism, and by using that data parallelism you can develop very specialized architectures for dealing with very specific problems. There are some opportunities within embedded devices, too. We’ve been doing some research using a fairly conventional three-stage pipeline, 32-bit RISC-V core using Google’s TensorFlow Lite for Microcontrollers to do the quantization, then creating custom RISC-V instructions to accelerate neural networks using very limited computing resources.  Now, this would work well at the edge of IoT where perhaps you’ve got simple sensing or simple video processing to do. But for something like augmented reality or autonomous driving, you’re dealing with much greater quantities of video data. The way to exploit that is to exploit the inherent parallelism of the data.”

That helps significantly on the processing side, but it’s only part of the solution. McCalpin says the focus has been on making DRAMs bigger, but not making them faster. The DRAM cycle times essentially remained unchanged over the past 20 years, and all improvement in performance has come from sending larger chunks of data as a burst. If more communications channels become available through HBM, this may make the memory cycle time the bottleneck.

Workloads
As previously mentioned, system performance is dependent upon workloads. It is not possible to optimize a general-purpose machine for everything. “Figuring out that balance is forcing a rethink in how people approach this problem,” says Aitken. “Depending on who you are and what you’re doing, the solution to the problem of, ‘I have a very specific workload that I understand, and I have enough control of my own compute universe,’ is that I can actually design something that’s customized to optimize my workload, or workloads like my workload, in ways that benefit whatever it is that I’m trying to do.”

Even a task like AI represents different workloads. “If you look at AI there are two aspects to it,” says Siemens’ Vishwakarma. “One is training, and with training you need to constantly access the memory because the weights reside there. And you’re constantly changing the weights, because you’re training the model. There, communication is the key. However, if you look at inference, the model is already trained and all you have to do is the MAC operation. You’re not changing the weights. The weights are fixed.”

Creating the right balance requires a co-design approach, says Aitken. “Is the way that I’m solving this problem, and the way that I’ve distributed my algorithm into its various components, the ideal way to solve it? Once you’ve established this is the general algorithmic structure that I want, you can map it onto some objects that have predefined compute capability, predefined bandwidth, and so on. If I decide I need a custom processing object, I can put that together. Those are all elements of the problem. There’s a lot of opportunity in that space, and that will become evident as more people want to try this stuff.”

Even within the hardware domain, there is a lot of co-design that needs to happen. “There’s an architecture phase where you assess various scenarios for multi-dies,” says Thiagaraja. “The primary focus of the architect is really throughput and bandwidth within the chip, and going out of the chip. On the other hand, you’ve got the physical design team that has to figure out the optimal size of the die. It can’t be too big because of yield and power. It can’t be too small because then you have to deal with smaller amounts of compute within each die. They are looking at it from a power and area perspective. And then you’ve got the design teams who have to build the interfaces and the protocols for them. The architecture team, the physical design team, and the design team are in a constant three-way battle to find an optimal point that makes everyone happy.”

Compute paradigm
For some problems, the utilization of traditional software itself can lead to inefficient solutions. This happened during the transition from single core to multi-core, and the adoption of the GPGPUs. The industry is waiting for it to happen on the new generation of AI hardware.

“There was a realization point for the GPU, that it was a massively parallel compute object and could do all kinds of things beyond just rendering shapes,” says Aitken. “There’s a lot of effort being expended, by a lot of people, on what those kinds of architectures could look like in the future. With AI, there is a tension between the adoption of TensorFlow, or what have you, versus, ‘Can I conjure up some new method, a new architecture that will process some alternate way of solving similar problems better?’ There’s a lot of speculation on that. There are a lot of people trying things, but I don’t know that anyone has hit the level where somebody could do a GPGPU with it and say this is the way forward from now on. It would be cool if they did.”

And there are some very specific hardware steps that could be applied to AI to overcome the memory transfer problem altogether. “Analog computing has a different approach to look at the same problem, but in a different perspective,” says Vishwakarma. “If you want to do an adder in digital, an adder will take about seven or eight gates. Each gate would have maybe four or five transistors. Just to add two numbers you’re looking at about 50 transistors. But if you take an analog approach, you basically join two wires. It’s current. For inferencing where you have this MAC operation, which is multiply and accumulate, you can use analog compute and store the weights in a flash memory. Here you are taking a different approach to compute overall.”

“I have been a big fan of this technology for decades,” says Cadence’s Greenberg. ” And yet it always seems that when it’s about to take off, it gets superseded by advancements in the digital domain. Maybe one day we’ll see analog computing as a ‘More than Moore’ technology, but we’re just not quite at the point where analog is beating the digital domain.”

Several promising startups in the analog space have failed. It is tough to sell a completely new concept when it is difficult to compare. “Computing capability is often viewed through the lens of benchmarking that data center architects use as a method to assess vendor solutions,” says Maurice Steinman, vice president of engineering for Lightelligence. “Benchmark results are typically expressed as raw performance, or performance per ‘something else that matters,’ such as cost, area, or energy – basically how much work is getting done and at what cost.”

And the industry seems to have blinders on when it comes to raw processor performance. “The CPU itself has a certain level of raw compute power — basically, what’s the single thread performance of this thing,” says Aitken. “Even though it’s not the entire metric, it’s still a useful metric for what the system is capable of doing. And then there are broader metrics for things like ops per watt. That’s a key metric for the overall efficiency of the system. It is challenging to move past the obsession with just raw TOPs per watt on accelerators, but what is it operating on and how did the necessary data get there? And how many watts did that take? That’s left as an exercise to the user.”

Conclusion
Adding more or faster processing cores is great, but unless you can keep them busy you are wasting time, money and power. The likelihood that you can keep them busy is decreasing. The number of algorithms that have the right memory transfer-to-compute ratios is diminishing.

With the migration of DRAM into the package, where we can expect to see continued increase in potential bandwidth, it is becoming increasingly concerning that DRAM performance has not increased for the past 20 years. If the DRAM manufacturers cannot solve this problem, then the industry will have to architect itself out of the problem.



Leave a Reply


(Note: This name will be displayed publicly)