Moore And More

Chiplets, packaging and some interesting new challenges.

popularity

For more than 50 years, the semiconductor industry has enjoyed the benefits of Moore’s Law — or so it seemed. In reality, there were three laws rolled up into one:

  1. Each process generation would have a higher clock speed at the same power. This was not discovered by Moore, but by Dennard, who also invented the DRAM. Process generations continue to get faster and lower power, but the power does not come down fast enough to allow clock speeds to increase much.
  2. Each process generation would have smaller transistors, so you could get more transistors onto each die.
  3. All those transistors would be cheaper. The rough rule was that the density would double, but the cost per wafer would only go up by less than 15%, making transistors 35% cheaper.

Dennard Scaling ended about 20 years ago when clock speeds topped out at 3+ GHz. We are left with the last part of that equation, namely that transistors are faster and lower power with each process generation. Meanwhile, item No. 3 changed somewhere around 28nm. Now, at each process generation, the cost of processing a wafer is going up faster than the density. So the cost per transistor is rising with each generation.

Yes, we get more transistors at each process generation, but they also cost more. We can cram more functionality on the chip, but the chip will be more expensive than if we’d built it in an older generation.

Or, put simply, there are good reasons for using an advanced node—more transistors, lower power, higher performance—but it costs you. Those transistors are not getting cheaper, which is what occurred for the first 40 or so years of Moore’s Law. If you need those more performance and less power, then use the advanced node. But if you don’t need them, then don’t use the advanced node. This is why 28nm continues to be a hugely popular node for design starts and continues to run in huge volumes.

Moore’s Law was actually an economic law. In the original article (based on just four data points), Gordon Moore wrote that the economically optimal number of transistors on a chip was doubling every couple of years. If you tried to cram more than that on the chip, it didn’t yield well. And if you went for less, you’d need more chips for the same functionality.

The main reason that transistors are getting so expensive is that fabs cost so much to build and processes cost so much to develop. An EUV stepper is $100 million to $200 million dollars, and that’s for just one piece of equipment. You need more than one EUV stepper to equip an advanced-node fab.

Moore knew this day would come. In fact, he expected it much sooner. He never expected his law to last for more than 50 years. In a video interview at SEMICON West a couple of years ago, when asked what he’d like to be remembered for, he replied, “Anything but Moore’s Law.”

Nevertheless, Moore’s observation extended beyond a single die: “It may prove to be more economical to build large systems out of smaller functions, which are separately packaged and interconnected.”

Well, that day has come, and it points toward another trend that has been ongoing for some time — complex packaging. Putting more than one die in a package has become more economical. Like all mass-production technologies, this has largely been driven by learning from mass production. Large microprocessors are using interposer types of technology. Smaller (both in transistor count and physical size) communication chips have been using fan-out wafer-level packaging (FO-WLP) technologies. Since smartphones ship about 1.5B units per year, meaning that any individual model may be shipping in hundreds of millions, that is a lot of learning. Using advanced packaging with multiple die is often known as “More than Moore.”

Putting these things together, the balance has changed between using scaling as the main lever to build larger and more complex systems, versus using multiple die to do so. The economics of manufacturing a huge number of transistors on the same chip, versus building smaller die and packaging them together, is now a complex decision. Until recently, at least for large designs, the economics always came down to a single SoC. That is changing for several reasons.

Die Size
Smaller die yield better than large die. If fatal defects are randomly spread across a wafer, then a large die is more likely to be hit by one. There is also more wasted area around the edge of the wafer because there is simply more of the wafer where there isn’t room to put another chip. In the past, despite this, it has been more economical to suck it up and build a big SoC rather than build separate die and package them together. The economics are now in favor of building smaller die, especially if a full system can make use of multiple copies of the same die. It is not too challenging to build a many-core/multi-core microprocessor this way, or an FPGA.

There is another problem with very large designs. The lithography process has a maximum reticle size. If the design is larger than that, then splitting it up is the only option.

Keep Your Memory Close
All high-performance processors, whether CPUs, GPUs, deep-learning processors, or anything else, require access to large memories, either as caches or for directly storing the (big) data. A huge amount of power consumption in most computations is simply moving the data around, not doing the actual calculations. A lot of the latency in the overall calculation comes from this movement, too. So an obvious thing to do is to move the memory closer to the processor, which reduces power and improves performance.

The “obvious” way to do this would be to put the DRAM on the same chip as the processor, but there are two problems. First, is all the die-size limitations discussed earlier. Second, although it is possible to mix DRAM and logic processes, it is costly. You can’t add DRAM to a logic chip with just a couple of process steps.

The earliest approach to this is known as package-in-package (or PiP). This slightly odd term is to distinguish it from package-on-package (PoP) where two ball-grid-array (BGA) packages are literally stacked on top of each other. Two die, such as a smartphone application processor and a DRAM, are put in the same package and everything is wire-bonded to avoid the complexity of things like through-silicon-vias (TSVs). Smartphones have been doing this for years.

For high-performance computing (HPC), this approach doesn’t allow for enough memory. This slice of the market typically wants to access several high-bandwidth memories (HBM or HBM2), which consist of a logic die, and then four or eight DRAM die stacked on top, with everything connected with TSVs. This is then put on an interposer alongside the processor.

There is also a JEDEC Wide I/O standard for HBMs intended to be standardized (so the memory doesn’t depend on the design), and then the memory with TSVs is put on top of the logic die. Since the Wide I/O has 1000 or more pins, it can get very high bandwidth without requiring all the power overhead of a full DDR interface, pushing a lot of data through just a few pins.

Heterogeneity
Another motivation for separate die is not just to split up a design in a single process, but to package die from different processes. A processor can be manufactured in the most advanced and expensive node, while the I/O can be developed at a less advanced and cheaper node.

The reason for doing this is two-fold. First, the I/O interfaces don’t benefit from the more advanced node. And in the modern era, advanced nodes are more expensive per transistor, so the economic push is to hold back, not move the advanced node as aggressively as possible. But there is also a second, more subtle reason. All the I/O (and other routine blocks) already have seen silicon, either in production or at least in test chips. If the I/O die is also done in the most advanced process, then test chips for things like high-speed SerDes become part of the critical path to getting the whole system out.

Moving RF and analog to the most advanced node isn’t beneficial, either. It is very difficult to design analog circuits in finFET processes. The reason is that finFETs are quantized. Transistors have a uniform and fixed length, and width is an integer number of fins. In planar processes, the analog circuit designer could pick the widths and lengths of the transistors. Often in analog design, what is most important is the ratio between the sizes of critical transistors. But in finFET, you can’t have two transistors with an arbitrary ratio like that, so analog design doesn’t work. It makes much more sense to keep analog design in a planar process like 28nm, or perhaps even a less advanced node such as 65nm where the design already has been well characterized and seen high-volume production.

Another area where it can be attractive to use a separate die is for photonics. Even if some of the photonics are on the main die, it is unlikely that the lasers themselves can be. Usually, they are manufactured in InP (indium-phosphide) or some other esoteric non-silicon process.

Chiplets
So far, discussion about multiple die in a package assumes everything is designed by the same company, except for the DRAMs. But there is another possibility, which is that in-package components become available commercially, which is the chiplet approach.

There are some technical challenges to chiplets, which are roughly equivalent to other in-package integration issues, along with two additional challenges — standardization and market.
If the same team is designing two die that have to go in the same package, they can pretty much choose any communication scheme they choose. But if the chiplets are standard in some sense, such as a high-speed SerDes chiplet or a Wi-Fi chiplet, then the SoC has to use whatever interface the chiplet provides.

To keep things simple, it is better if the interfaces are well-proven and standard. Inside a package, the distances are short, so it doesn’t make sense to use the same type of long-reach interfaces that would be appropriate to run across a PCB or a backplane. Another advantage inside a package is that it is relatively cheap to have a lot of connections compared to running through a package onto a board, so there is no need to substitute very fast serial interfaces to reduce pin-count, which is the case at the package level.

The dream for chiplet proponents is a commercial marketplace for known-good-die. Design becomes more like board-level system design, with purchased standard components, and perhaps a single SoC designed as the heart of the system.

There is a lot of skepticism this will happen, as the problems of inventory seem hard to deal with. On the other hand, it’s not that different from how electronic systems were designed in the 1970s, with standard TTL components available from several manufacturers.

The value proposition for chiplets would be:

  • Flexibility in picking the best process node for the part—in particular, SerDes I/O, RF, and analog do not need to be on the “core” process node;
  • Better yield due to small die size;
  • Shorter IC design cycle and integration complexity by using pre-existing chiplets;
  • Lower manufacturing costs by purchasing known-good die (KGD), and
  • Volume manufacturing cost advantage when the same chiplet(s) are used in many designs.

The first few bullets are the same for any system-in-package (SiP) solution. The other three are highest if you can simply buy chiplets from a distributor, but they are also mostly true even if the chiplets have to be manufactured on demand for each system. The promise is that you can build big, high-performance systems without the complexity of integrating everything onto an enormous SoC, and that would be a very big change.



Leave a Reply


(Note: This name will be displayed publicly)