New Architectures, Much Faster Chips

Massive innovation to drive orders of magnitude improvements in performance.

popularity

The chip industry is making progress in multiple physical dimensions and with multiple architectural approaches, setting the stage for huge performance increases based on more modular and heterogeneous designs, new advanced packaging options, and continued scaling of digital logic for at least a couple more process nodes.

A number of these changes have been discussed in recent conferences. Individually, they are potentially significant. But taken as a whole, they point to some important trends as the benefits of device scaling dwindle and market needs change. Among them:

  • For high-performance applications, chips are being designed based upon much more limited data movement and near-memory computing. This can be seen in floor plans where I/Os are on the perimeter of the chip rather than in the center, an approach that will increase performance by reducing the distance that data needs to travel, and consequently lower the overall power consumption.
  • Scaling of digital logic will continue beyond 3nm using high-NA EUV, a variety of gate-all-around FETs (CFETs, nanosheet/nanowire FETs), and carbon nanotube devices. At the same time, reticle sizes will increase to allow more components to fit into a package, if not on a single die. Both moves will add substantially more real estate by shrinking features, allowing for greater compute density. In addition, scaling of SRAM will continue, and more layers will be added for high-bandwidth memory (HBM) modules and for 3D-NAND flash.
  • Designs are becoming both more modular and more heterogeneous, setting the stage for more customization and faster time to market. All of the major foundries and OSATs are now endorsing a chiplet strategy, and they are offering multiple options based upon price and performance requirements.

Some of this has been in the works for years, but much of the development has been piecemeal. There is no single industry roadmap anymore, which in the past has been used as a guide for how all the developments fit together. Work continues on all fronts in the absence of that roadmap, but it frequently is difficult to understand how the big picture is developing because not everything is moving in sync. For example, ASML was talking publicly about high-numerical aperture EUV, which replaces a flat lens with an anamorphic one, even before EUV was commercially viable. And companies such as ASE and Amkor have been working on multiple versions of fan-outs, 2.5D and 3D-ICs for the better part of this decade, even though the market for these packaging schemes are very different than initially thought.

There are many new developments on the horizon, as well. Major foundries such as TSMC, UMC, GlobalFoundries and Samsung are building advanced packaging capabilities into the backend of manufacturing. TSMC also is planning to add chiplets into the frontend using bump-less hybrid bonding, which it calls SoIC. All of these will likely require significant changes across the industry, from EDA tools to test and post-silicon monitoring.

How quickly all of these different elements come together is unclear. No one likes to be first, and at this point, it’s not obvious which of these approaches and technologies will win, or even whether they will compete with each other. But change is essential as the volume of data continues to grow. This is driving more customized solutions to process and utilize that data closer to the source, which includes some level of intelligence nearly everywhere.

In the past, solutions were developed around the most advanced hardware or software based on the assumption that the next process generation would add big improvements in performance. That no longer works. Scaling is becoming more difficult and expensive, and power/performance benefits of shrinking features are diminishing. In addition, one size no longer fits all. It can vary greatly depending upon where end customers are in the compute hierarchy — end point, edge, or cloud — and how data needs to be structured and prioritized. As a result, chipmakers have shifted their focus to new and more modular architectures that are capable of everything from massive simulations and training algorithms in the cloud, to weeding out useless image and streaming video data at the source.

Put in perspective, more processing needs to happen everywhere faster, and it needs to be done using the same or less power. In addition, systems need to be created faster, and they need the ability to change much more quickly as market needs evolve and algorithms continue to change.

Architectural shifts
To make that happen, hardware architectures need to change. Chipmakers have seen this coming for some time. For example, IBM’s new Power 10 chip concentrates customized compute elements in the center of the chip and moves peripherals and I/O to the edges.

“Acceleration needs to get pushed into the processor core,” said Bill Starke, the chip’s chief architect, at the recent Hot Chips conference. “Around the chip perimeter are PHYs.” IBM also introduced pod-level clustering, and added a new microarchitecture to support all of this.


Fig. 1: IBM’s Power 10 chip (L., from Hot Chips 2020), with processing cores concentrated in the middle of the chip served by localized memory and shared L3, vs. Power 9 (R., from Hot Chips 2018) with off-chip interconnect in center. Source: IBM/Hot Chips 2018/20

Others are taking similar approaches. Intel introduced a new architecture based on internally-developed chiplets, which clusters modular processing elements together using its Embedded Multichip Interconnect Bridges to HBM modules. In addition, it has updated its latest server chip architecture to minimize data movement.


Fig. 2: Intel’s latest server processor architecture (R.) reduces movement of data compared to the previous generation (L.) Source: Intel/Hot Chips

Likewise, Tenstorrent, which makes AI systems, created a highly modular system that includes 120 self-contained cores connected with a 2D bi-directional torus NoC. “Every core progresses at its own pace,” according to Jasmina Vasiljevic, director of software engineering at Tenstorrent.

Scaling continues
Data center chips are far less cost-sensitive than in consumer applications, so they tend to lead the industry in performance. A high-performance server amortizes chip development costs with the price of a system rather than through volume, which is essential for a mobile phone application processor, for example. So despite a never-ending stream of predictions about the end of Moore’s Law, the digital logic inside many of these devices will continue to use the latest process geometries for density reasons.

What’s different, though, is less performance-critical circuitry, as well as analog blocks, increasingly is being shunted off to separate chips, which are connected using high-speed interfaces.

“You now can partition by node,” said Matt Hogan, product director at Mentor, a Siemens Business. “So you can determine what is the correct technology for a particular portion of a design. That also allows you to scale some of the side effects.”

This approach was mentioned by Gordon Moore when he first published his now-famous observation in 1965.

“With the rapid evolution of process technology, it was typically cheaper to go with an off-the-shelf solution instead of developing custom chips,” said Tim Kogel, principal applications engineer at Synopsys. “By now, the free lunch of higher performance and lower power with every new process node is all but over. On the other hand, killer applications like AI, autonomous driving, AR/VR, etc., have an unquenchable demand for processing power and computational efficiency. Famous examples like Google’s TPU and Tesla’s FSD chips show the impressive ROI of tailoring the architecture to the specific characteristics of the target workload.”

Still, the value of Moore’s Law as originally written is waning, and that has both economic and technological  implications. The economic benefits of planar scaling ended with the introduction of finFETs, when cost per transistor stopped decreasing from the previous node. Likewise, power/performance benefits have been decreasing since about 90nm. Y.J. Mii, senior vice president of R&D at TSMC, said that 3nm will bring performance improvements of just 10% to 15% for the same power, or 25% to 30% power reduction for the same speed.

This is hardly a dead end from a technology standpoint, however. Architectural improvements, including different packaging approaches and 3D layouts, can boost that performance by orders of magnitude. And scaling still helps to pack more density into those packages, even if the scaled-down transistors themselves aren’t running significantly faster.

“We have been bombarded by More-than-Moore topics for many years now,” said Tom Wong, director of marketing for design IP at Cadence. “But is it really area reduction, power reduction or transistor performance improvements (traditional PPA) that are driving these discussions, or is it silicon economics and the limitations of lithography/equipment that caused us to hit the brick wall? As it turns out, silicon economics and the limits of reticle size are the two things driving a disruption, which is necessitating designers to look at new ways of designing chips, and turning to new architectures.”

Both economics and reticle size limits are being addressed through different packaging schemes and a boost in reticle sizes, which allow for much bigger individual die. Doug Yu, vice president of R&D at TSMC, said that reticle sizes will increase by 1.7X with the foundry’s InFO (integrated fan-out) packaging approach. In addition, TSMC plans to introduce a 110 x 110 mm² reticle in Q1 of next year, which will increase reticle size by 2.5X.

All of this is necessary as the cost of putting everything onto a single die continues to rise. Modularity allows chipmakers to customize chips relatively quickly based upon a platform type of approach. CPU, GPU and FPGA chip designers figured this out more than five years ago, and have since started the march to disaggregated implementation by going to multi-die, and letting the interposer/packaging take care of the integration. This is one reason why die-to-die connectivity IP is taking center stage today, Wong said.

“CPUs, GPUs and FPGAs have all gone the route of chiplets because these companies design the chips (chiplets) themselves and need not rely on a commercial chiplet ecosystem. They can take advantage of what a chiplet-based design can offer,” Wong noted. “Multi-core designs, including CPUs, GPUs and FPGAs, can benefit from this architectural change/trend. SoC designs that can separate ‘core compute’ and high-speed I/Os also can benefit from this. AI acceleration SoCs and crypto SoCs are two examples. And datacenter switches and fabrics, such as 25.6Tb/s for hyperscale compute and cloud builders, also can benefit from this architectural change to chiplet-based design. These designs can be as complex as 20 billion+ transistors.”

So far this approach has been utilized by IDMs such as Intel, AMD and Marvell, each creating its own modular schemes and interconnects. So rather than building a chip and trying to pitch its benefits across a broad range of customers, they offer a menu of options using chiplets and, in Intel’s case, an assortment of connectivity options such as high-speed bridges.

Changes everywhere, some big, some tiny
Putting all these changes into perspective is often difficult because the whole industry is in motion, although not necessarily at the same velocity or for the same reasons. So while processors and processes change, for example, memory lags well behind.

In addition, some technologies need to be completely rethought while others stay the same. This is particularly evident with GPUs, which have been the go-to solution for AI/ML training because they are cheap and scalable. But they are not the most energy-efficient approach.

“We’ve seen it with bandwidth, we’ve seen it with power,” said Kristof Beets, senior director of product management and technology marketing at Imagination Technologies “All of these different constraints come into play. From a GPU point of view it’s been a tricky evolution, because obviously GPUs are massive number crunchers, displays get ever bigger, and devices get ever smaller. So a lot of these problems keep hitting. There’s been a phase of brute force, which kind of depended on Moore’s Law. We were doubling the GPU, and for a while that was okay, because process technology kept up. But now that return is diminishing, so while we can put more logic down, we basically can’t turn it on anymore because it consumes too much power. So the brute force trick doesn’t work.”

Dynamic voltage and frequency scaling (DVFS) has helped a bit to ramp down the voltage, allowing for even bigger GPUs running at lower frequencies. Nevertheless, even that approach has limits because there are only so many GPU cores that can be used within a fixed power budget. “This gives us better FPS (frames per second) per watt, but even that is now starting to slow because leakage is going up again,” Beets said. “This is where, for GPUs, ray tracing has been interesting. It’s a way of switching away from brute force. They are very flexible. We’re seeing the same with AI and neural network processing. It’s exactly the same concept. This is where you’ve truly seen orders of magnitude solutions that are 10, 20 times better than the GPU by taking into account the data flow, the specific operations, so it’s quite interesting. It’s not quite as bad as the old days of fixed function processing. We’re not back there yet. But some of it is definitely starting to return with more dedicated processing types.”

There are many approaches to augmenting scaling performance. “There have been a few areas, such as application processors, GPUs, MCUs, DSPs, where we’ve had fairly general-purpose architectures exploiting Moore’s Law to do more and more,” said Roddy Urquhart, senior marketing director at Codasip. “But now there are a huge number of ideas around trying out novel architectures, novel structures, with a range of programmability. At the systolic array end, there are things that tend to be either hardwired processing elements, or they have processes that have firmware uploaded and left in a static condition for some time. At the other extreme are domain-specific processes, which are highly programmable. I see a return to innovation in the highly parallel, highly pipelined, array-type structures, which is a very good fit with neural networks of different sorts. At the other end, people are thinking more outside the box for moving out of the silos of MCU, GPU, DSP and application processors, and creating something that is more of a blended version of some of these things to meet particular needs.”

Micro-architectures
Alongside of these broad architectural shifts are micro-architectural innovations. In many respects, this is a partitioning problem, where some compute functions are given priority over others within a larger system. That can have a big impact on both performance and computational efficiency.

“Taking advantage of the inherent parallelism, the application should be mapped to an optimal set of heterogenous processing elements,” said Synopsys’ Kogel. “Choosing for each function a processing core that provides the minimum required flexibility gives the highest possible computational efficiency. Also, the organization of the memory architecture has a very high impact on performance and power. Since external memory accesses are expensive, data should be kept in on-chip memories, close to where it is processed.”

This is easier said than done, however, and it requires multi-disciplinary and, increasingly, multi-dimensional planning. “It’s quite a challenge to manage the complexity and predict the dynamic effects of a highly parallel application running on heterogenous multi-processing platform with distributed memories,” Kogel said. “We propose the use of virtual prototyping to quantitatively analyze architecture tradeoffs early in the development process. This enables the collaboration of stakeholders from application and implementation teams, before committing to an implementation specification.”

New tradeoffs
Going forward, how to proceed with power and performance tradeoffs depends on the market. Some markets are highly cost-sensitive, so they haven’t ramped into this problem yet. At the same time, others are less cost-sensitive and more latency-sensitive.

“People are increasingly impatient. You want to get stuff that you want as quickly as possible,” said Mike Mayberry, CTO of Intel, during a panel presentation at DARPA’s recent Electronics Resurgence Initiative (ERI) Summit. “But we’re also seeing balanced systems and more compute near the data, and that’s one of the trends we see continuing.”

Mayberry noted there is no hard stop on density scaling, but increasingly it will include the Z axis. “We’re also seeing novel beyond-CMOS devices that will enable heterogeneous architectures. A decade from now, you’ll see those on shelves.”

Intel, among others, is looking at ways to grow devices in addition to depositing and etching different materials. This has been talked about for years with such approaches as directed self-assembly. At some point that still may be economically viable, but the general consensus is probably not until after 3nm.

Alongside of all of this, photonics is beginning to gather some momentum as a way of moving large quantities of data in and around these increasingly dense structures with minimal heat. One of the more novel approaches involves using light for processing. LightMatter CEO Nick Harris said that optical devices eliminate leakage effects, resulting in lower heat and more consistent performance. What’s particularly unique about this approach is that light can be partitioned into different wavelengths, allowing different colors to be prioritized.

“With 100GHz wavelengths, which is really small spacing, we can fit 1,000 colors,” Harris said. The downside is lasers don’t last forever, so there needs to be enough redundancy to allow these systems to last throughout their expected lifetimes.

For more traditional computing, the number of process node options is increasing, as well. Foundries are offering in-between nodes, which improve performance or power without a complete redesign. For example, TSMC uncorked its N4 process, which will enter risk production at the end of next year. C.C. Wei, CEO of TSMC, said in a presentation that IP used in both N5 (5nm) and N4 will be compatible, which allows companies to improve density and lower power with minimal redesign.

Still, the number of options is dizzying. In addition to different node numbers, there also are different process options for low power and for high performance. On top of that, different substrate materials are beginning to gain traction, including silicon carbide and gallium nitride for power transistors, and silicon-on-insulator for lower-cost, low-power applications.

All of that has a big impact on design rules, which are used to prevent failures. “If you’re designing a chiplet, you don’t know how it’s going to be used or placed,” said Mentor’s Hogan. “You don’t know if it’s going to be next to an MCU, so you have to figure out how to do that in a thoughtful way. You need to protect it from electromagnetic effects and other potential issues.”

And because chips are expected to function properly for longer periods of time — in the case of automotive, it may be as long as 18 years for leading-node logic — all of this needs to be done in the context of aging. This can get extremely complicated, particularly in multi-chip packages.

“You need to look at things like threshold shifts with different stimuli and scenarios,” said Vic Kulkarni, vice president of marketing and chief strategist for the semiconductor business unit at Ansys. “You can do a precise analysis of registers, but if the Vdd is not going down and the Vt is not going down, there isn’t much margin left. You also need to think about things like electrical overstress. The fabs are not willing to take that on.”

Tradeoffs range from power, performance, and cost, to quality of service.

“We used to always have lossless compression,” said Imagination’s Beets. “And about one or two years ago, we introduced lossy, as well, so we could trade off on quality. In GPUs, we’re starting to see across the board a tradeoff of quality versus cost, and the lossy compression allows the quality to be decreased, which also saves on bandwidth and power. In GPU processing, we’re starting to see the same thing, which is variable rate shading. This is basically when you look at a video, you’d say all you really care about is the face, and you want that in full detail, so the background doesn’t matter. Games essentially do the same thing. For example, in a racing game the car is very sharp and has a lot of detail, but the rest has a motion blur on it.”

There also are tradeoffs in precision. Lower precision can greatly speed up processing, and sparser algorithms can be written to be less precise, whether that’s 16-bit precision or even 1-bit precision. But that precision also can be controlled by the hardware and firmware, and it can have a big impact on overall system performance where some functions are made more accurate than others.

Conclusion
For the first 40 years or so of Moore’s Law, power, performance and area improvements were sufficient for most applications, and the growth in data generally was manageable through classical scaling. After 90nm, classical scaling started showing signs of stress. So the writing has been on the wall for some time, but it has not gone unheeded.

What’s surprising, though, is just how many avenues are still available for massive improvements in performance, lower power and potentially cost savings. Engineering teams are innovating in new and interesting ways. Decades of research into what seemed like obscure topics or tangents at the time are now paying off, and there is plenty more in the pipeline.

Related Stories
EUV’s Uncertain Future At 3nm And Below
Manufacturing chips at future nodes is possible from a technology standpoint, but that’s not the only consideration.
The Next Advanced Packages
New approaches aim for better performance, more flexibility — and for some, lower cost.
The Good And Bad Of Chiplets
IDMs leverage chiplet models, others are still working on it.
Big Changes For Mainstream Chip Architectures (2018 for comparison)
AI-enabled systems are being designed to process more data locally as device scaling benefits decline.



Leave a Reply


(Note: This name will be displayed publicly)