Nvidia’s Top Technologists Discuss The Future Of GPUs

Power, performance, architectures, and making AI pervasive.


Semiconductor Engineering sat down to discuss the role of the GPU in artificial intelligence, autonomous and assisted driving, advanced packaging and heterogeneous architectures with Bill Dally, Nvidia’s chief scientist, and Jonah Alben, senior vice president of Nvidia’s GPU engineering, at IEEE’s Hot Chips 2019 conference. What follows are excerpts of that conversation.

SE: There are some new technology trends underway, such as moving to chiplets and breaking chips up into smaller pieces. How does Nvidia view this?

Dally: Especially in our research organization, we constantly develop and evaluate technologies for building systems different ways. [At HotChips, a team] demonstrated a technology for assembling a system-in-package by assembling chiplets on an organic substrate using a technology called ground reference signaling (GRS), which we originally developed about 5 or 6 years ago. [It] has two really nice properties. One is it’s very low energy per bit—about a picojoule per bit. Compare that to a typical SerDes, which may be around 6 or 7 picojoules per bit. And GRS is single-ended, so it’s very dense. It’s 25 gigabits per second, but that’s like a 50-gigabit SerDes because a 50G SerDes takes two tracks out from under the edge of the chip. And that’s expensive resources—tracks out from the edge of the chip. Fins are cheap. But getting route-outs is what tends to limit you. So it has very high edge density. And then we’ve already demonstrated with our Volta modules and Pascal modules with HBM that we can assemble multiple chips on a silicon interposer, which is a technology that’s both denser and you can get more connections. We get about a terabyte per second per millimeter with GRS. You can get around 4 terabytes per second per millimeter on a silicon interposer. And so it’s no denser in terms of bandwidth per unit length. And it’s much lower energy. You can get energies down to about a 10th of a picojoule per bit connecting things in a silicon interposer. This gives us a bunch of technologies on the shelf that at some point in time, if it became economically the right thing to do to assemble GPUs from multiple chiplets, we basically have de-risked the technology. Now it’s a tool in the toolbox for a GPU designer.

SE: Where is the crossover point? We’re now at 7nm heading down to 5nm. Where do you guys hit it in terms of chiplets?

Alben: We haven’t hit it yet.

SE: People are trying to throw faster SerDes at the throughput problem. The other option is, let’s get rid of the SerDes entirely and go in a completely different direction.

Dally: The GRS is SerDes-like, but it’s much lighter weight than a SerDes. So it takes less die area, less power. And it takes one ball instead of two.

SE: Your competitors say a GPU is less power-efficient than some of the alternatives and soon-to-be released chips. What’s your view?

Dally: I don’t believe that’s the case. If you look at all these deep learning accelerators, at the core, they all have a matrix multiply unit. We’ve extended our matrix multiply unit to be doing int operations as well as floating point. We also have a matrix multiply unit. You can think of our Tensor core as our specialized unit for doing deep learning. And when you’re doing that inner loop of deep learning, I don’t know that anybody is going to be substantially more efficient than us. Because most of the energy is going into the math and the Tensor core, there’s a small amount of overhead for fetching that MMA (matrix multiply and accumulate) instruction that you’re issuing, and fetching the operands out of the register files. But it’s completely amortized out. On Turing, doing an IMMA, you’re doing 1,024 arithmetic operations, which amortizes out all that overhead. So in the core operations, building a more specialized chip doesn’t buy more than maybe 10% or 20%, which is the cost of doing that fetch. And by the way, they have to fetch their data from somewhere, too.

Alben: At the end of the day, they’re all talking about parallel processors, right? They’ve got a processor.

Dally: The real differentiation here is in two things. Probably the most important difference is in software. Having been doing this for a while, we have very refined software. It allows us to run a lot of different networks, getting very good fractions of what’s capable out of our hardware. The other differentiation is in the memory system that sits around these matrix multiply units. And there are tradeoffs that are made. For example, [TensorRT 4] has GDDR6 memory, which burns more energy than LPDDR.

Alben: Most inference companies talk today about using LPDDR memory, which is certainly a much lower power rate but it’s also a lot slower.

Dally: Right. And this was a very conscious decision. If we wanted to totally optimize teraOPS per watt, we would have put LPDDR4 on there. But we looked at certain networks that people want to run, especially the BERT networks, which are very large and require a lot of memory bandwidth. And if you want to do BERT with low latency, you need the memory bandwidth. So it’s better to burn a little bit more power and have this capability.

SE: People are starting to say, ‘This may not be the ultimate architecture for everything, but it’s probably good enough for a lot of things.’ This is an 80/20 rule, or maybe 90/10 at that level, right?

Dally: Yes. Also, if you over-specialize it for the networks today, by the time it actually comes out you’ve missed the mark. So you have to make it general enough that you can track really rapid progress in the field.

Alben: Think about someone in a data center. Are they going to want to buy it if it’s only good at doing one thing? Once they put that chip in their data center, it is going to be there for at least 5 or 10 years, whether they’re using it or not.

SE: Typically, there’s a wholesale replacement of all those chips at some point.

Alben: Yes, but it’s going to be there whether you [use it or not]. If you don’t have anything autonomous to run, then it’s just going to sit there not doing anything. The utilization and breadth of capabilities are important, and in general, we try to make sure that our designs can cover that well, not just an over-specialized, one-of-a-kind sort of thing.

SE: There’s a shift toward heterogeneous architectures in many AI applications. How well do GPUs play with other processor types?

Dally: We are the pioneers of heterogeneous architectures. From day one we’ve said the way you build your system is for things that are really latency-critical. You have a CPU. And for the things that are less latency-critical, but where you need the absolute best throughput, you have a GPU. For those things that you need to really go fast, you have specialized accelerators, which are in the GPU. This started with graphics, where some of the graphics workload runs on the CPU, some of it runs on the streaming multiprocessors on the GPU, and some of it runs on hardwired blocks, like rasterizers, texture filters, and compositors on the GPU. We started doing that in graphics, but it turns out that the GPU is an ideal platform into which to plug accelerators. It has this wonderful memory system and a really low overhead way of dispatching instructions. We can plug in other specialized accelerators like Tensor cores to accelerate AI. We can plug in RT cores to accelerate the BVH (bounding volume hierarchy), traversal and ray/triangle intersection portions of ray tracing. And in the future, I imagine we will plug in other cores as we identify application areas that need it. So it’s very heterogeneous. You have the CPU for those critical serial pieces of code, where all that matters is latency. Once you have enough parallelism, you don’t need that. You run it on the GPU. And if the test becomes demanding enough, and there are enough people who want it, you build an accelerator and it becomes a core.

SE: And you can expand that out just by adding in more GPUs?

Dally: Yes.

Alben: A GPU is not like x86. It works like in a fixed ISA. So we can always change the definition of what that processor does.

SE: How do you see edge computing playing into all of this?

Dally: It’s going to wind up being huge because there are very few things that couldn’t benefit from intelligence.

SE: Right now it seems like a very vague concept, where we have the cloud and everything else.

Dally: Right, but it turns out that for a lot of edge things, you’re actually better off simply using I/O [to the cloud]. Smart thermostats don’t need intelligence in the thermostat. They just measure the temperature and send stuff up to the cloud, and it comes down to turn on the air conditioner. But there are times when you would need things on the edge, and what defines that for every application is different. Within Nvidia we decided to look at a few edge cases. We have our autonomous vehicle operation. We have a big effort in robotics. And then we have a big push in medical. But that’s just the tip of the iceberg of what’s out there on the edge. And our approach is to enable everybody else on the edge. We took our deep learning accelerator design, the NVDLA, and we open-sourced it.

SE: How much of your performance and power is now coming out of better hardware/software co-design versus in the past?

Dally: It is very much a reality in the AI space. If we put Tensor cores in and we didn’t know how to write code to them, they would be useless. So when you’re really trying to squeeze a lot out of this, you have to be very carefully working together on that.

SE: Where does Nvidia play in automotive?

Dally: We want to be the brains of all the autonomous entities, which includes self-driving cars, robots, various things that people are going to build. And a lot of this is enabled by deep learning, which allows us to build machines now that have perception that exceeds that of human beings. We’re trying to work with a bunch of different auto manufacturers to offer them as much of the stack as they want. A lot of people will use our hardware, depending on how much horsepower they need. And we’ve done the work to basically make that ASIL D, so you can use it for things that human life depends on. And then we have a software stack where we have basically created a huge data set with our own vehicles running around the roads, and we’ve augmented that with simulated data. And we have a huge army of labelers to label all of this data, so we can then train networks. We have networks that do detection of other vehicles, along with estimating their distance and velocity. We have networks that fuse that with the radar data coming back to get better velocity estimates. We have networks that find free space. We have two independent detections on where it’s safe to drive, one that says where are the things that you don’t want to hit, and the other which says where is the space that doesn’t have things you don’t want to hit. We have another network that basically finds the lanes, and we have a network that’s sort of an outgrowth of our original end-to-end approach to this that we’ve since sort of de-emphasized, but it actually feeds into our path plotter by suggesting the appropriate path for the car to follow going forward. So that’s the perception stack.

SE: How about anticipating different scenarios?

Dally: We have on top of that a prediction module that tries to predict what the other vehicles are going to do in the future, and a planning module that, given all that information, says where to drive the car. In doing that, we developed a bunch of infrastructure pieces. So a lot of the automotive manufacturers we get involved with will buy DGXs or DGX pods. Because they’re going to be training their own networks, they need that capability. We also have a product called Constellation, which allows you to do hardware-in-the-loop simulation. So you can take a Drive Pegasus or Drive Xavier and put it in a rack, which is a set of GPUs that is generating video that looks just like the video coming in from the cameras. It’s identical electrical format, so it plugs into the same connectors on the computer. And the computer doesn’t realize it’s not a real car. It’s sort of like ‘The Matrix,’ so there will be computers in the rack and they’re in a simulated world, and the simulated road can be replaying data that you’ve already taken. So you have this regression testing and you can say, ‘Okay, here’s our drive down Highway 101 yesterday, and you just make sure it does the same thing it’s supposed to do. We also can feed in simulated data and have scenarios so we can see how well the car computer does. This hardware-in-the-loop simulation is really important for verifying both in a regression way—if you change anything, you haven’t broken anything—but also to just verify that it works in the first place under all conditions.

SE: Do you do you collect all the data from all the different startups and car companies using your chips?

Dally: No, I wish we could. That would be huge advantage to have all that data in addition to all the data we have we have collected ourselves.

SE: Does anyone ever ask for data?

Dally: We’ve had discussions with people about data sharing.

Alben: In directions. It’s obviously a big asset in the industry right now.

SE: The world seems to be split on automotive now between car companies that have said this is going to take too long, so they’re just going to have one camera now instead of multiple cameras, and those that are rushing to autonomous driving, probably in some sort of geo/ring fencing.

Dally: People are wrong to look at this as a threshold thing, where you get to some point and all of sudden all cars drive themselves. What I see happening is that as we develop this technology, even though a lot of it is being pushed for Level 4 driving, it winds up hitting current cars for Level 2 or 3. And it makes everybody a lot safer right now. So if you look at some of the stuff that’s rolled out in the last couple model years, people are running neural networks in their perceptions that do automatic emergency braking, and it’s way better than it was in the past.

SE: That’s an interesting observation, because in the past most of the improvements came out of the very expensive cars. Now what you’re starting to see is them coming out of the development of the technology and it seems to be more universal.

Alben: There is still a tiering.

Dally: Back in the 1990s, when we were going from a half-micron to 2.5 micron, that was a 3X improvement per watt. If you lagged behind that technology shift, you were not competitive. Today, going between two adjacent nodes, say 10 and 7nm, or 7 and 5nm, you might get a 20% improvement. It’s not the 3X that it once was. If you look at our generations of GPUs, we’ve been doubling our performance on AI generation to generation. That’s from architecture, not from process. Process helps a little bit.

SE: Is power an issue, or is it still pretty much all performance?

Dally: Performance per watt is what matters, and we will deliver as much performance as we can in an envelope.

Related Stories
Chiplets, Faster Interconnects, More Efficiency
Why Intel, AMD, Arm, and IBM are focusing on architectures, microarchitectures, and functional changes.
Advanced Packaging Options Increase
But putting multiple chips into a package is still difficult and expensive.
GPU Knowledge Center
GPU Top stories, special reports, white papers, technical papers, blogs

Leave a Reply

(Note: This name will be displayed publicly)