Processor Tradeoffs For AI Workloads

Gaps are widening between technology advances and demands, and closing them is becoming more difficult.

popularity

AI is forcing fundamental shifts in chips used in data centers and in the tools used to design them, but it also is creating gaps between the speed at which that technology advances and the demands from customers.

These shifts started gradually, but they have accelerated and multiplied over the past year with the rollout of ChatGPT and other large language models. There is suddenly much more data, and all of it needs to processed faster, using less power, and with more functionality crammed into limited space.

To make matters worse, because this is new technology, it is also in a state of almost continuous change. That, in turn, is creating a number of difficult tradeoffs that are seemingly at odds with each other. For example, chips require built-in flexibility to account for these changes, but at the performance and energy efficiency normally associated with an ASIC. These designs also require tightly integrated software, but enough flexibility to measure and respond to power and performance impacts caused by changes to the algorithms. And all of this needs to happen earlier in the design cycle.

“If you go back a while on the hardware side in the data center, Intel may have seen this coming a decade ago when they bought Altera,” said Alexander Wakefield, applications engineering scientist at Synopsys. “The idea was that we can put an FPGA fabric right next to the x86 CPU and it will revolutionize the world. They paid a lot of money for Altera, absorbed them into the company, and then no major product appeared. Is FPGA really the right piece for it? Maybe not. It’s really great if you can take something, synthesize it into some sort of hardware logic, and put it in an FPGA. It’s like an AI, and it’s an accelerator. But is it the right accelerator? Maybe not. NVIDIA got it right, and the stock price has shown that. Customers want to take a workload that’s software-based and pull it onto a piece of hardware that has thousands of small processing units on the GPU, and they need to do very complex tasks that are GPU-ready.”

The generative AI revolution kicked off in 2017 with the publication of the seminal paper “Attention Is All You Need,” according to Arif Khan, senior product marketing group director for PCIe, CXL, and Interface IP at Cadence. “This paper described the transformer model that formed the basis of large language model (LLM) implementations that have driven applications such as ChatGPT and DALL-E, which have made AI a household term today. ChatGPT has been adopted more quickly than any other application to date, having reached 100 million users within two months of its launch. The training models use hundreds of billions of parameters to allow inferences to be made when users query these systems.

AI/ML designs for training, inference, data mining, and graph analytics have differing needs. “For example,” Khan said, “the training SoCs require specialized architectures with TPUs/GPUs or custom designs that can perform the vector operations and share weights during training. Designs targeted for inference must respond to high volumes of queries and need higher-bandwidth networking interfaces.”

Chips in data centers already were pushing the limits in terms of physical size. Some of these chips exceeded the size of the reticle and had to be stitched together. Increasingly, that approach is being replaced by pushing upward into the Z dimension.

“Companies like AMD are very much into the phase of building 3D-IC designs, which are integrated in the vertical scale,” said Preeti Gupta, director of product management for semiconductor products at Ansys. “You’re putting semiconductor dies on top of each other, and not just next to each other like the two-dimensional layout we’ve done in the past. That’s really in order to meet PPA objectives while keeping the cost down.”

But this impacts how chips are designed, and it requires different tools, methodologies, and flows to automate the design process. Layouts need to take into account thermal effects and noise, as well as the behavior of different materials and structures over time. All of this increases the amount of data that needs to be processed, managed, and stored just in the design phase. How, for example, do design teams distribute all the data to be processed across various compute elements, and then ensure that when it is recombined the results are accurate? And how can more of this be done earlier in the flow, such as understanding the impacts of algorithm changes on hardware performance and power using real workloads?

“AI/ML designers want to optimize their algorithms early in the design flow,” Gupta said. “They also want to do this very rapidly — have multiple iterations within a day. Obviously, when you have designed your RTL, you’ve synthesized it to a netlist, and now you want to change the algorithm that is a long loop — the design teams could gain at least 10x more productivity if they were to do these optimizations at RTL. Moreover, these AI/ML teams want to guide the design decisions using real application workloads. We are finding that these companies are now using very rapid early power profiling techniques to figure out for a real application workload, if they change the AI algorithm, how does the peak power or a di/dt event change. Imagine the power of being able to generate a per-cycle power waveform across billions of cycles multiple times a day as the AI algorithm is being optimized. They are using those rapid profiling approaches to optimize AI algorithms in the context of power.”

On top of this, the timing of every facet of the design needs to be synchronized to achieve the performance and power goals, and to be functionally useful. “From the designers perspective, timing always has been very important to do in any kind of chip design,” said Mahmud Ullah, principal product applications engineer at Siemens Digital Industries Software. “But in recent times, we have seen that it’s not only just about timing. Power is also a concern. In that context, for a lot of designers who are designing data center chips — as well as chips for different areas, including CPUs, GPUs, IoT — power is a main concern. And specifically for engineering teams creating data center chips, they want to measure power at the beginning of their design cycles because power is one of the key constraints today.”

Put simply, much more data is being shifted left, and it’s creating a flood on the front end of the design cycle.

“The main thing is how accurately you can predict the power,” Ullah said, “At the SoC level, these are big designs, which can have almost a billion gates. The main purpose is to know how accurately you can predict that power. And in order to do that, the only way you can measure it is by running it on emulation tools, because that will let you see what end applications you will be running. There could be situations in which you design a new kind of software, and when you start running that software all of a sudden you see the chip is not working. In order to avoid this situation, it would be helpful to run real applications for your designs at a very early stage. In data center designs, SoC-level power estimation is used. From there the engineering team runs their big designs with real applications, and real stimulus coming in. Then, they isolate the power related issues at early stage, and then fix those things and do further optimization.”

Cadence’s Khan also noted power consumption is a concern. “Training models are extremely power intensive, and maintaining these models for inference continues to consume power on an ongoing basis. Newer architectures for training are based on specialized architectures to optimize vector operations and locality of data movement, and there is an ever-growing number of startups working in this space. We see the impact in terms of design decisions like choice of memory: HBM versus GDDR versus LPDDR; the rise of chiplet-based partitions and dramatic demand for UCIe as a chiplet connectivity interface; and the increased deployment of CXL to support new memory architectures.”

That was much simpler, in retrospect, before the rollout of generative AI. The level of uncertainty, and the amount of data that needs to be processed has exploded. There are many more options to consider, and all of this needs to be done reliably and quickly. But what’s the best approach, and how does that get architected and partitioned in a way where the power is manageable and the performance is sufficient, and where it can run full-bore without overheating?

Synopsys’ Wakefield questions whether a processor chip and an AI chip need to be integrated together on the same piece of silicon or on the same substrate. “Do they need to be placed on the same board? Definitely. It’s already happening. Does that increase the power needs and mean the size of these models continuing to grow? Yes. If you look at the cost that NVIDIA is able to get for one of their AI chips, it’s significant. Their list price is $30,000 to $40,000 for a single piece of silicon. Part of the problem is when you look at the power specs of these things, the latest NVIDIA GPUs use 450 to 500 watts of power. How do you keep this thing cool? How do you prevent the silicon from melting? How do you do it efficiently if you reduce the power needs for certain applications? That’s going to become a real problem later. Right now, it uses a lot of power, and people are prepared to eat the cost. But when AI gets more prevalent in lots of different things, you don’t want to spend 500 watts on that item plus the cooling cost. So maybe it’s a kilowatt for some particular task. In your vehicle, you don’t want a kilowatt of power going to the self-driving system. You want the kilowatt of power driving the wheels. The AI architectures will get better. They will get more refined, they’ll become more custom. Different companies are announcing different AI projects within them, and there are companies selling AI as IPs.”

The amount of compute horsepower that will be required for generative AI is basically an inflection point in itself.

“Once you build an AI chip, and it’s got 1,000 AI cores inside it, customers want 2,000 or 4,000 cores in their next design,” Wakefield said. “The one after that is going to have even more. Then we’re going to 3D-ICs, and you’re going to be able to build these little pieces, stack them all together, and create stacks of these things are all connected together. Intel’s Ponte Vecchio [now called Intel Data Center GPU Max Series] is as big as a credit card, with 30 tiles stitched together. For achieving the right yield, for the right testing, each of those individual tiles may be different technology nodes, and they may respin certain pieces of it, and then stitch them together to create a product. We’ll see more of that coming, as well.”

Moving and managing data
Driving many of these changes is AI, whether it’s machine learning, deep learning, generative AI, or some other variant. But the growing system demands are rapidly outpacing the ability to design those systems, creating gaps on every level and pushing faster adoption of new technologies than at any time in the past.

“If we look at all the technology scaling trends, taking memory for example, DDR memory’s performance doubles about every five years,” noted Steven Woo, fellow and distinguished inventor at Rambus. “But in the case of HBM, it’s faster. Every two to three years HBM doubles in speed. In general, the number of cores in processors is going to double every few years. While that may slow down, that’s been the historical trend. Then we look at AI — especially on the training side — and the demand is doubling every few months. So, we’re beginning to realize at this point, ‘Wow, there’s nothing I can do on the silicon alone that’s going to keep up with these trends.’ What everybody then says is, ‘Well, fine, if I can’t do it in one piece of silicon, I’m going to do it in lots of pieces of silicon. And then I’m going to just chain together more and more pieces of silicon.’”

That all makes sense in theory. The problem is partitioning tasks between all these processing engines and adding the results together at the end.

“The amount of work each one does go down for each [engine] you add, and the amount of communication we have to do goes up because there are more [engines] to talk to,” Woo said. “It has always been the case that communication is very expensive. And today, if you look at the ratio of how fast compute is to communication, under some scenarios, the compute looks almost free. The communication is what your real bottleneck is. So there’s some limit on how far you can really go in terms of how much you’ll break down a problem, in part because you want the engines to have something to do. But you also don’t want to be doing so much communication that it becomes the bottleneck.”

That’s just one facet of the problem, too. These problems are showing up everywhere.

“What happens is in certain kinds of markets, people are willing to say, ‘This is such an important problem that I’m going to design a special kind of communication network to solve all this,” he noted. “We’re seeing this in the AI space where companies like NVIDIA have something called NVLink, which is their super-fast communication mechanism. Other companies have other methods. Google TPU has its own kind of network. There’s a lot of interest now in in optics for that communication, because there’s a lot of interest in seeing silicon photonics technology mature. The feeling is once you lessen the impact of the communication problem, the compute engines become the big thing again. This is all about, ‘How do I look at this and make sure the communication isn’t the big bottleneck?’ One way to think about communication is it’s almost this necessary evil of what we have to do to break up the problems. But in and of itself, communicating data from one node to another doesn’t really advance the computation. It’s just a necessary evil to continue the computation.”

Trickier tradeoffs
There are lots of moving pieces for balancing all of the PPA requirements within a data center chip, and improvements or changes to any one often has impacts on at least one of the other two requirements.

“On the software side, customers are building an AI accelerator, which is a combination of the hardware they sell, the silicon, and some sort of library or drivers or software layer that they sell with it,” Wakefield observed. “It’s the total performance of both of those things together that the end customer cares about. If your software is really bad, your AI compiler is bad, and it utilizes the hardware badly. Then you’re going to lose customers because you won’t stand up against competition, which may have inferior hardware but a better software stack.”

This has put much more emphasis on up-front exploration. “You can go to one of the cloud providers and rent an NVIDIA GPU or A100 chip and run your workloads on it,” Wakefield noted. “They’ll charge you so much per minute to run it. Do they like buying these chips from NVIDIA at $30,000 or $40,000? apiece? Probably not. Will they build their own? Probably. We saw that with Amazon. At AWS, you can rent Graviton space, which is Amazon’s version of a core. It’s their own core, not Intel or AMD, so you’ll probably see the same sort of thing happening in the data center for various workloads, where maybe there’s custom silicon that’s a little more power/performance-wise optimal for a certain thing, or it’s some mix of regular processor and AI chip together in the same 3D-IC. Maybe that makes more sense. Then for certain custom applications, you’ll definitely see a custom ASIC that has the right combination of hardware that you need with the right power profile and performance profile for certain embedded-type applications, such as self-driving cars, security cameras, even your Ring doorbell that runs for two years off a battery.”

One of the biggest tradeoffs in this space is the tradeoff between memory bandwidth, capacity, and cost.

“It is a classic ‘pick two’ between bandwidth, capacity and cost, and sometimes it’s a ‘pick one,’” noted Marc Greenberg, product marketing group director for DDR, HBM, flash/storage and MIPI IP at Cadence. “A low-capacity user might choose a single-rank DDR5 unbuffered DIMM (UDIMM) for the most cost-sensitive applications. To reach higher capacity, a dual-rank UDIMM could be used to double the memory capacity – at the expense of slightly higher loading on the DRAM bus which could slow down the DRAM bus, but no other significant expense other than the extra memory used.”

Data center users often choose registered DIMM, which allows for a further doubling of maximum capacity by supporting a larger number of DRAM dies per DIMM, but which comes at an additional cost of both the additional memory added as well as a registering clock driver (RCD) chip that is introduced into each DIMM. “For even more capacity, a second DIMM socket on the channel can be added, which comes at the expense of further loading and degraded signal integrity on the DRAM bus, which can again affect bandwidth/speed,” Greenberg said.

Beyond this, higher-capacity DIMMs may further double or quadruple capacity by 3D stacking DRAM devices – which has little impact to loading but which may add significant cost associated with 3D stacking. “And to add more capacity, a CPU manufacturer may add additional DIMM channels in parallel, which doubles bandwidth and capacity but also doubles the silicon area and package pins associated with DRAM on the CPU. This is an open area for innovation, and there are exciting developments to add both capacity and bandwidth to the DRAM bus under discussion,” he added.

Building chips in the future
Wakefield believes we will see more of this happening in AI space. “It’s currently still a bit of the Gold Rush stage, where people need to get the chip out as fast as possible. If it uses some extra power today, it doesn’t matter that much. People are paying a large amount of money for an AI chip. The power that they consume is a factor, but it’s not that big of an issue. As the industry matures a bit, then you’ll see the power portion become much more of a factor. How do you stop these things melting? If you can make it twice as big, you would, but now it’s going to be one kilowatt and one kilowatt melts the silicon.”

That adds cost. Managing the power and the overall thermal footprint is important. It’s also expensive to get it right, and worse to get it wrong.

“We have to worry about thermal runaway, and have the ability to look at real application workloads and be able to help designers make those architectural decisions,” Ansys’ Gupta added. “Let’s say you have an AMD GPU intended for the data center, and it has tens of different thermal sensors. You’re looking at one of the thermal sensors and observing its temperature. We know that GPU performance is limited by power, but what does that mean? It means that as the GPU is running, and maybe a child is running a gaming application on the GPU, for example, the temperature goes up for that die. As soon as the thermal sensor detects that threshold — let’s say it’s 100°C — it will take in logic to reduce the frequency of that process. And because it has to cool down that chip in order for it to function and not cause the thermal runaway problem, the moment the frequency goes down the user has a less than optimal experience. They’re able to run the game, just slower. So, all of these companies are very focused on understanding these real use cases early and being able to design the dynamic voltage and frequency scaling to cater to these, and to place the thermal sensors at the right locations. If you have a billion-instance data center chip, you can’t have a billion thermal sensors. So where do you place those thermal sensors? And which are the thermal or power hotspots within the design?”

Related Reading
Making Tradeoffs With AI/ML/DL
Optimizing tools and chips is opening up new possibilities and adding much more complexity.
AI Adoption Slow For Design Tools
While ML adoption is robust, full AI is slow to catch fire. But that could change in the future.



Leave a Reply


(Note: This name will be displayed publicly)