The more compute power, the better. But what’s the best way to get there?
Semiconductor Engineering sat down to discuss chip scaling, transistors, new architectures, and packaging with Jerry Chen, head of global business development for manufacturing & industrials at Nvidia; David Fried, vice president of computational products at Lam Research; Mark Shirey, vice president of marketing and applications at KLA; and Aki Fujimura, CEO of D2S. What follows are excerpts of that conversation.
SE: For decades, chipmakers have implemented IC scaling to advance a design, but the costs are escalating and the benefits of scaling are diminishing at each node. What’s your take on Moore’s Law? Do we need 2nm processes and beyond? And do we need more compute power or not?
Fujimura: Absolutely, yes. There is no question about it in my mind. For example, D2S provides GPU-accelerated computing for the semiconductor manufacturing industry. For the type of things that we do, whether it’s for wafer or mask manufacturing, we can today already see how we can use 10 times as much computing per dollar than we get right now. Thankfully, because we are focused on GPU acceleration, Moore’s Law continues to scale for us. We don’t rely on clock speed scaling. We rely on bit-width scaling. Nvidia packs more and more cores into the same chip for the same amount of dollars. We take advantage of that and scale along with that. We just announced our seventh-generation computational design platform, which has 1.8 petaflops of computing power. It’s an amazing amount of stuff that we can do in one computer rack. But we could easily use 18 petaflops. With 1.8 petaflops, we can do a lot, but there’s a lot more that we would like to be able to do. We can simulate more accurately to take more sophisticated effects into account without approximation, for example. But we can’t use them because it would require 10 days of computing. With 18 petaflops, we can get it down to one day of computing. And so there is a fragment of the computing community, like ourselves, where there is an insatiable demand for more computing power. And certainly, deep learning has accelerated that.
Fried: We could use 10X more computing. It’s across the board. The entire world at every node of every user point of interaction, every compute aspect, and every memory point needs 10X more computing power. The insatiable need is pervasive across the entire world right now. Working remotely and being at home has only exacerbated the need.
SE: Let’s take a look at the evolution of GPU scaling. At the 180nm node in 2002, Nvidia’s GPUs had 61 million transistors. The latest 7nm GPUs from Nvidia have 54 billion transistors. They also incorporate high-bandwidth memory (HBM) in an advanced package. Is there a limit?
Chen: The benefits that you can get just from process improvements are clearly starting to taper off quite a bit. At the same time, the market’s appetite for ever more compute power is as insatiable as ever. The thing about our architectural strategy, from the very beginning as a full stack accelerated computing company, is that we haven’t been singularly dependent on process improvements to give us continually faster clocks and faster scalar performance. We’ve always been able to optimize in three ways. Number one is through parallelism. Number two, by innovating at the architectural level. Number three, by optimizing across the full software stack, top to bottom. As a result, we’ve been able to sustain our dramatic performance improvements consistently, year after year, far more so than if we’d been dependent just on increasing clock frequencies and process improvements.
Fujimura: Software algorithms that are written to take advantage of the single-instruction multiple data (SIMD) architecture can scale perfectly with bit-width scaling of the GPUs. So as we grow from 5,000 cores per chip to 10,000 cores over the coming years at about the same cost per chip, SIMD software will be able to scale in performance linearly. In such programs, all of the cores operate the same instructions at any given time with the data being the only difference among the processors. This is very different from having multiple threads in multiple cores of a CPU, where different instructions are operating on different data in each core. But only certain types of software, and only software that was written specifically for SIMD, will work well and benefit from the bit-width scaling of GPUs. But thankfully, anything about nature is pretty much inherently SIMD. Physics, chemistry, and math operate the same on any unit. It’s the data that’s different, causing different and complex overall behavior. So weather forecasting, neural network computing, lithography simulation, mask simulation, or image processing all can be cast into SIMD-based architectures quite naturally.
SE: On the flip side, the industry is running into various challenges in transistor scaling. We’re talking about power, performance, area, cost, and time. We are encountering the power wall, RC delay, and area scaling. What are some of the challenges that you’re running into here?
Chen: First of all, we need to think about how we move data around between different parts of the GPU on-die. Not only that, we need to be clever about data movement at every level, from package-level fabric all the way up to data center-level fabric. There’s a large body of research about how data movement is the biggest driver of power generation at all levels. Many people only think of power-limited solutions in the context of mobile devices. But the reality is that everything nowadays is power limited — even the highest performance supercomputers. So our architects and VLSI engineers focus on optimizing our architectures to get as much performance as we can out of any given power budget.
Fried: Power, performance, area, and yield (PPAY), or PPAC if we want to specify cost instead of yield, has always been the aggregate envelope for product development. We’ve always been up against a barrier in each one of these. We’ve always been bound by PPAC or PPAY. We’ve always pushed the envelope on all of these. Sometimes, you take big steps in one of those parameters, some bigger than others. But it’s always this elastic set of barriers that we’ve been pushing out against. One thing that’s important is the total system-level performance. That’s all that really matters at the end of the day. And at certain points in history, huge steps in system-level performance were enabled with just clock frequency of the chip. At other times, big system-level improvements have been enabled through power management technologies. So, we’ve always been up against the same things — power, performance, area, and yield or cost. You need improvements in at least one of those areas to enable total system performance, and it’s not always the same area. I would suggest that baseline transistor scaling, either through step-wise performance, power, or uniformity of transistor scaling and enhancements, has always been a significant piece of overall system performance improvements. We’re definitely not in a place where transistor scaling doesn’t matter anymore. It still matters. It’s being leveraged and taken advantage of in different ways. If we get density improvements from scaling, even without performance enhancements, most are going to take it by providing more cores in the reticle field. Some may not care if we gain additional performance from those transistors. But if you can get 10% more cores on, let’s say, a GPU, that’s a huge system-level advantage enabled by transistor scaling. At that point, you don’t have to worry about moving those bits on and off a chip. You move bits around the chip and it’s much faster. It’s a huge system-level advantage just to get additional monolithic integration by scaling. We’re limited by the same parametric boundaries as before. We’ve just been continually pushing them in different directions. At the end of the day, system-level performance is all that matters. This is no big change. This isn’t a major inflection point where we are changing our entire PPAC or PPAY methodology. We are just pushing it on some parameters, and we can keep enhancing system-level performance as long as the market keeps insisting and demanding that we provide additional compute power and memory.
SE: Starting in 2011, the industry moved from planar transistors to next-generation finFETs. Chipmakers continue to extend today’s finFET transistors at advanced nodes. Some will extend the finFET to 3nm, while others will move to gate-all-around nanosheet FETs at 3nm/2nm. How do you see this playing out?
Fig. 1: Planar vs. finFET vs. gate-all-around Source: Lam Research
Fried: The planar-to-finFET transition was predominantly a gate-length scaling inspired transition. To get better electrostatic control of the device, we went to essentially a double-gated device. That provided us with multiple nanometers in gate scaling, and it also opened up a new axis in transistor scaling. We could start increasing height to provide more active width per footprint. That was a nice transition. Going to gate-all-around gives you full electrostatic control of the device. It should give you a couple of extra nanometers of gate scaling. Those are the nanometers that we need, and it opens up another axis of scaling. In the future, if we can get to complementary FETs, like nFETs and pFETs stacked on top of each other, it gives us an additional logic scaling advantage. We’re starting with an electrostatic advantage to get gate length scaling, and by doing so we’re opening up a whole new scaling parameter set. That said, the transition from finFET to gate-all-around, either nanowire or nanosheet, is going to be a little rougher. The architecture requires us to perform processes underneath the structure. That’s a big change, and it’s really challenging. With finFETs, we had to learn how to perform semiconductor processing on sidewalls better than we had previously, but we could still pretty much see everything we were doing. In gate-all-around nanosheets/nanowires, we have to do processing underneath the structure where we can’t see, and where it’s much more challenging to measure. And that’s going to be a much more difficult transition.
Shirey: It’s a challenge to see below the surface for gate-all-around architectures. We experienced similar challenges, but on a smaller scale, when first measuring finFET structures. GAA continues the trend started by finFET of scaling transistors by implementing vertical architectures. With nanosheet and its evolutions, such as forksheet FETs and complementary FETs, the industry will continue to see this roadmap of more and more 3D structures, all of which have to be measured and inspected. From an inspection and metrology standpoint, we are beginning to implement different illumination sources to detect and measure critical process and patterning anomalies. For many of these device structures, we are pursuing different optical wavelengths to see below the surface and extract signal from variations or defects. And in parallel, we also are looking at innovations in illumination source — things like X-ray and e-beam technologies — to get below the surface and see what’s going on there.
SE: How do you see the roadmap from your vantage point? We can see going to gate-all-around nanosheets. What’s beyond that? Do we have the processes and the tools to be able to do that?
Shirey: Obviously, the primary industry focus in the near term is getting gate-all-around technologies like nanosheets/nanowires integrated and working. It’s still relatively early in terms of characterizing those devices and we’re figuring out the most efficient way to measure them. Our analyses show that forksheet FETs or complementary FETs seem to be emerging with many papers showing feasibility, but they won’t be showing up in device integration in the near term. Once evolutionary improvements on nanosheets run out of steam, the industry will need to switch to something else like complementary FETs, which could double the transistor density. If we allocate enough engineering focus and resources for the tooling and the processes, those technologies may be adopted.
SE: Do we need alternative architectures like advanced packaging, monolithic integration, and others?
Shirey: I have two thoughts along those lines. First, many are talking about innovation through chip architectures. And it seems like that is a huge area for achieving performance improvements on the scale of what’s been obtained through Moore’s Law. We’ve gone from a Swiss Army knife-type CPU to a specialized GPU. These GPUs are designed to do a very specific task and can execute that task more efficiently from a power and performance point of view. It feels like there’s a gold rush of emerging chip architectures out there, and each one has ramp and yield challenges. Also, the chips are often big, which in and of itself presents a yield challenge. So, we see an explosion in device architectures related to the move from Swiss Army knife-like CPUs to highly customized GPUs, and as a result, process control intensity increases in order to help achieve the overall system performance requirements. My second thought is the one thing we haven’t talked about yet, which is combining all these chips through advanced wafer-level packaging, or more futuristic things like hybrid wafer bonding and hybrid die bonding. Stacking chips — like a multi-die DRAM stack or stacking DRAM directly onto a logic chip — is a huge area where performance gains can also be achieved at the system level.
Fried: At the system level, the answer is going to be yes to all innovations. We’re going to need transistor scaling. We are going to need chip architecture improvements. We’re going to need 3D integrated packaging. We’re going to need all of these to deliver the final system performance requirements. There’s a spectrum of market fragmentation out there. We’re coming from a previous place where there was almost no system-level fragmentation, where everything was like a single CPU. You can look at our prior approach to system-level improvements like it was a Swiss Army knife. And so we made a lot of transistor decisions, interconnect decisions, packaging decisions, and integration decisions based on our existence in a completely consolidated system space. On the other end of that spectrum is a completely fragmented space, where every system has its own requirements. If we go down that completely fragmented path, we might make various different transistor scaling decisions, packaging decisions, and interconnect decisions. You’re going to want to optimize each system differently. So the way some product designers want to perform 3D integration in terms of where they place the memory, I/O, and compute in a 3D integrated package is going to be very different from somebody who’s developing a different system with different priorities and requirements. There’s this huge spectrum of choices being made. Every chip architecture will drive different decisions in technology, packaging, and interconnect. It will be really interesting to see where we end up in that spectrum of fragmentation to meet system-level performance requirements.
Fujimura: Nvidia’s use of different technologies is a very good example of what everybody’s talking about. For every generation of their products, they survey all the techniques that are available out there and pick the best combination. They have practical constraints of being able to have high enough yields and be economical. Nvidia’s GPUs are high-volume wafers that are sensitive to yield, so picking leading-edge technologies that will work well on time is always a difficult task. In using Nvidia’s GPUs for over 10 years, we’ve always been amazed by the amount of packaging and integration technologies they pack in for each generation.
Chen: We were among the first to see the benefits of advanced packaging and integration technologies back in 2016 with our Pascal generation. But frankly, the whole industry has a great appetite for the benefits of these technologies. You see this reflected by how aggressively the foundries and memory manufacturers have been investing in these sorts of technologies. For example, TSMC has invested aggressively in their 3DFabric portfolio, which includes advanced 2.5D and 3D packaging and fabrics. These technologies open up additional dimensions for pushing full device performance beyond the die.
Related Stories
New Transistor Structures At 3nm/2nm
Gate-all-around FETs will replace finFETs, but the transition will be costly and difficult.
What Goes Wrong In Advanced Packages
More heterogeneous designs and packaging options add challenges across the supply chain, from design to manufacturing and into the field.
New Transistor Structures At 3nm/2nm
Gate-all-around FETs will replace finFETs, but the transition will be costly and difficult
Momentum Builds For Advanced Packaging
Increasing density in more dimensions with faster time to market.
Consumer and professional GPUs should include SSDs inside – i.e. the “solid state graphics” concept from AMD.