Re-Imagining The GPU

How a basic processing element is being transformed by RISC-V, partitioning and inferencing.


John Rayfield, CTO at Imagination Technologies, sat down with Semiconductor Engineering to talk about RISC-V, AI, and computing architectures. What follows are excerpts of that conversation.

SE: What your plans are for RISC-V?

Rayfield: We’re actively finalizing the integration of RISC-V cores into future-generation GPUs. That work has been going on for several months. Moving forward, we’re leveraging RISC-V inside our IP blocks in the GPU, and very likely into other IPs such as connectivity. Traditionally we’ve used a mix of things. Imagination historically has had a proprietary core called META. But also, through a brief marriage with MIPS, some of the existing products shipped with MIPS cores. We’re in the process of moving all of these over to RISC-V as a part of our product strategy. We see RISC-V as a very big space that is gaining a lot of traction. We see it in our customer base alongside our other IPs. And, of course, our IPs will include it, as well. A number of entities are leveraging RISC-V and licensing RISC-V processor cores. We’re partnering with people in this space, such as SiFive. We’ve got our IP set up to work alongside SiFive cores in many cases, and we’ve got joint customer engagements going on.

SE: Is a lot of the business in China, or is it scattered around the rest of the world, too?

Rayfield: It’s pretty well distributed. Less than 50% of our business is in China, but China is becoming an increasingly large market for us.

SE: There has been a lot of talk about China setting up its own separate supply chain with RISC-V as one of the key components. In fact, the RISC-V Consortium has now moved to Switzerland to be free of any trade-war constraints. How does that impact RISC-V adoption?

Rayfield: That’s not entirely clear, but the fact that RISC-V is an open architecture is beneficial for all of us. It does provide more freedom. We’re a participant in the RISC-V Foundation, and we are participating more actively over time. Building a big software ecosystem depends on things being very stable, so it is important that there’s control over the architecture even though it’s flexible. If you look at the embedded processor landscape, over the years it has become very polarized around Arm. There is an appetite for having some competition in that space. We’re seeing people pick up on RISC-V because of that. And that’s in China, but also outside of China. We see it having a lot of traction in the West, particularly in what I describe as deeply embedded sockets rather than necessarily opening customer-facing sockets — although that’s also changing over time.

SE: Is there a pressure to start producing hard IP for some of the advanced packaging? The power/performance benefits of scaling are diminishing, so many chipmakers are looking at architectural improvements to make up for that. Chiplets are the latest development.

Rayfield: Most of our business today is in the consumer and mobile space. We are seeing activity in the high-performance compute space, and in that area those type of approaches are very interesting to us. It’s certainly on our radar. In those very high-performance markets, where you almost put space to one side and focus on performance and worrying about how to get enough dissipated power out of the space, for sure those kinds of approaches become important. We will work with partners to solve a bunch of those problems. We’re not going to get directly involved in packaging, though. It will be a partnership model.

SE: GPUs are everywhere. They’ve been used forever, but one of the big issues as we started getting into AI is the power consumption of GPUs. How is Imagination addressing that?

Rayfield: Let’s split this into two parts — inference and training. Training is where people like Nvidia dominate with high-performance GPUs, card systems, and their focus is on compute performance. Today we’re not a major player in that space. But on the inference side, we are a player. We ship neural network accelerators (NNAs) — 3NX is the current product line — and those are highly optimized IP around power efficiency. It’s an architectural play to address the power issues. If you narrow down space efficiently and focus on what the algorithms and the networks look like and how they actually map to hardware, you can make some very smart decisions about the architecture and focus on power. But that same architecture is not suitable for training. It’s a very different problem. It’s a different set of optimizations. What we’ve tended to do is look at the market and lean toward the edge. At the edge there’s lots of inference and it is also some of the consumer sockets. That’s where we’ve focused rather than the high-performance compute, although this does get a bit muddy here. In automotive, which is pushing very hard on high-performance compute, there’s definitely a crossover and we’re seeing increasing use cases for GPUs’ compute acceleration in that space. And that is typically people who are trying to migrate from a big rack of stuff containing very high-performance GPUs off the shelf to SoCs that are highly integrated and power optimized for particular AI applications.

SE: To really improve the efficiency of inferencing chip you’ve got to integrate hardware and software much more tightly than has been done in the past. How does that affect what Imagination is doing here?

Rayfield: You’re spot on. It is a balance between being aggressive on architectural things you can do — things like getting the power efficiency up — but at the same time not go so far out of field that it no longer makes any sense for the software or becomes a completely intractable software problem, and that’s a balance we’re constantly dealing with in our architecture teams. In inference, one of the biggest costs is data movement. Like most bits of silicon, if you start moving in and out of external memory with a lot of bandwidth, your power consumption goes through the roof. We’ve invested quite heavily in the software side with some quite clever compiling technology to be able to take neural networks and remap them to our hardware in an optimal way that reduces memory traffic and hence, with a 100% correlation to power usage in the system. That’s a great example of addressing some of those things that you’re citing.

SE: What happens when the algorithms change, because these things are in almost constant flux? How do you deal with that?

Rayfield: It’s still a very programmable engine. There’s a lot of flexibility in it. Depending on the end application, sometimes a host CPU is quite involved and sometimes not. It very much depends on the type of the network and the use case in the network as to whether a lot of interaction happens with the rest of the system, or if it is relatively standalone. We also have done enough work on the network side that we can steer customers toward more efficient solutions. So we often have people come to us with a particular type of network and we can say, ‘Have you considered this one?’ And we can show them the results they’ll get are equivalent, but it will run much more efficiently on the underlying hardware and deliver much better power and performance.

SE: Are you doing any slicing out of useless data upfront? There are two ways of approaching this, right? You can minimize the movement of data, but you also can get rid some of the data up front.

Rayfield: Yes, and all these kinds of tricks come together. It’s the coalescing of them all into the end product that brings the overall performance and power benefits.

SE: Another piece involves neural network acceleration, and how do you do compute more per unit of energy, right?

Rayfield: Yes, and this is a high level width versus utility tradeoff. It’s you go very wide, do lots in parallel, at some point you get to the point of diminishing utilization. So it’s really about how we can map things. We have granularity within individual cores, and then we also support multiple instances of cores, and we have software that helps you only within a single core to summarize your data reuse. You’re minimizing energy, but also partitioning across multiple cores — multiple instances to bring up the parallelism, basically.

SE: Parallelization is great on one level. You can crank it up as if you have parallelizable applications and data and computations, which is great for the MACs. The problem is you’re not always moving that kind of data through. Are you able to also segment zones to turn ‘this’ on, turn ‘this’ off as needed?

Rayfield: Yes, we have coarse-grain and fine-grain. It’s power gating, basically. So, you know you’re not paying, wasting time if you’re not utilizing part of the array. The best use of hardware is 100% utilization, because you’re always paying something, even if it’s just the last 10%. We keep pushing the software to get the utilization as high as possible. And then, in the case where some parts of the architecture are idle, they’re in a low power state and minimized as much as they can be.

SE: And this is really one step above where Imagination plays, right? So basically you sell this to your customers and your customers have to figure out what the likely use case will be for this.

Rayfield: Yes, but we actually get very involved in the dialogue, and the backwards and forwards and making suggestions as to the topology. So yes, ultimately our customers are making those decisions, but given our experience we can help them along the path to the right set of decisions.

SE: Imagination has something called a Hyperlane in its GPUs. What is that?

Rayfield: The Hyperlane technology lets you partition the GPU into slices. So if you’ve got a big fat instance of one of our GPUs, and at any particular time you want to use part of it for graphics and part of it for compute, you can separate those two worlds very cleanly. You can kind of think of it like a sort of a hypervisor equivalent from the CPU world. You can separate the state and have a clean partition between the two. They’re not interfering with each other. From a software perspective it looks like two GPU cores, and this is a nice feature because it’s very generally applicable. Another example involves safety-critical things. In automotive you might have part of a display contain safety-critical information while another part of it is more general rendering, and you might want to partition that very safely in your system design. Part of the GPU is working with some very safe software that’s running in the system.

SE: Partitioning has become a big issue in a lot of computing these days, and it’s both the hardware and the software. But where do you draw the lines for what gets partitioned and where?

Rayfield: There are a lot of use cases now, so this is becoming increasingly relevant, which is why we’ve driven down this path. In many systems, it’s being multi-purposed, whether it’s multiple areas on a display, or whether it’s partitioned between partly used for compute and partly used for graphics. But it is becoming increasingly common as we move forward. This is a natural evolution of systems, in a way. The same thing happened in CPUs where there are systems where there are safe and not-so-safe operations. There are safety-critical systems basically were where hypervisors come into play. You can even find CPUs with one operating system running hard real-time things that are safety-critical and another operating system like Linux running the rest of the system, but they’re not running on the same core or core complex. These types of ideas are scaling through the whole of system architectures. One of the real advantages we have in the GPU space is that we embraced that very early and it’s a very clean implementation.

SE: How does that work with neural network accelerators?

Rayfield: We architected the two so that they work nicely together in a system, and with other external IP in the system. We also have software that can target both the GPU and the NN for different types of acceleration. But that’s a hard problem. We have good examples of that, but it’s definitely just emerging at this point.

SE: Any changes in memory types or approaches?

Rayfield: We do things like trying to batch up transactions and so on. Ultimately, the memory controller piece is not an IP that we carry and pursue, but we’re very conscious of the type of accesses that are efficient.

SE: You do have to get the data out to and back from the memory, so you want to move the processing as close as possible, right? And you also want to probably widen the lanes for the data. too.

Rayfield: Absolutely, and if you look at our GPU architectures, it definitely is on the path of getting wider and wider. In some system designs we still scale it down to go narrow because that’s what customers want. But there are a lot of efficiencies to be had by going wide, and we leverage that wherever we can.

SE: One of the things about inferencing chips, in particular, AI in general, is that they adapt and they optimize as time goes on. How do you keep it within certain parameters, or even set those parameters?

Rayfield: Precision is absolutely important. One of the things everyone really plays up in inference is quantizing the efficiency of the network, and we do, too. But we offer a very flexible approach to that so that people can pragmatically decide on what width of data they want to pay for. Our engine allows you to operate down to four-bit-wide words scaling up to 32 bits. Even if you’re quantizing to 16-bit coefficients, for example, we have some features that give you extended dynamic range — even with 16 bit. We have the tooling to let people go through some iterations and come out with comparisons with our original network that wasn’t quantized, basically getting to equivalence in that sense. In general, people tend to develop with very wide floating point or double precision, but in reality almost all implementations end up with something less than that, whether it’s a special floating-point word or a quantized network.

SE: Is there anything new in security?

Rayfield: The Hyperlane is also beneficial for security. Whether you’re partitioning for safety or security, those features, actually take you a long way down the path. And we have an initiative in the company around security to widen the feature set and ease integration into systems with more security. Security is becoming a baseline requirement, whatever the end application. You’ve got to get to a certain bar.

SE: What’s the next big focus area for Imagination? Where do you go from here?

Rayfield: You’ll see us continue to evolve the GPUs, aggressively moving forward with that roadmap. We have an established and strong position in the AI space and, again, we’re investing more there. We’re pretty active in automotive designs, so that’s an area where we’re investing considerably. And we’re seeing some good traction because the level of compute and automotive is on quite a hard charging vector right now.

[Editor’s Note: John Rayfield resigned from Imagination Technologies after this interview was conducted.]

Leave a Reply

(Note: This name will be displayed publicly)