As chip companies customize designs, the number of possible pitfalls is growing. Tighter partnerships and acquisitions may help.
Diminishing returns from process scaling, coupled with pervasive connectedness and an exponential increase in data, are driving broad changes in how chips are designed, what they’re expected to do, and how quickly they’re supposed to do it.
In the past, tradeoffs between performance, power, and cost were defined mostly by large OEMs within the confines of an industry-wide scaling roadmap. Chipmakers designed chips to meet the narrow specifications set forth by those OEMs. But as Moore’s Law slows, and as more sensors and electronic devices generate more data everywhere, design goals and the means of achieving them are shifting. Some of the largest systems companies have taken chip design in-house to focus on specific data types and use cases. Traditional chipmakers, meanwhile, are creating flexible architectures that can be re-used and easily modified for a broader range of applications.
In this new design scheme, the speed at which data needs to be processed, and the accuracy of results, can vary widely. Depending upon the context — whether it will be used in safety- or mission-critical applications, for example, or whether it is in proximity to other components that may generate heat or noise — architects can trade off raw performance, performance per watt, and total cost of ownership, which includes reliability and security. That, in turn, determines the type of package, memory, layout, and how much redundancy is required. And it adds new concerns, such as synchronizing clocks across systems of systems, different aging rates of components in a package, and unknowns stemming from insufficient industry learning about how the various pieces go together and what can go wrong.
As these designs roll out, there are some innovative approaches emerging for customization, as well as some consistent themes. At the recent Hot Chips 34 conference, Jack Choquette, senior principal engineer at NVIDIA, gave a preview the company’s new 80-billion transistor GPU chip. The new architecture takes into account spatial locality, allowing data from different locations to be processed by available processing elements, and temporal locality, in which multiple kernels can operate on data. The goal is to allow more blocks to operate on fragments of data, synchronously or asynchronously, to improve efficiency and speed. That contrasts with existing approaches, in which all threads must wait for other data to arrive before processing begins.
Fig. 1: Thread block cluster, allowing co-scheduling of some processing on adjacent multiprocessors. Source: NVIDIA/Hot Chips 34
Alan Smith, AMD senior fellow, likewise introduced a “workload-optimized compute architecture” at the conference. In AMD’s design, the data path has been widened for data forwarding and reuse. As with NVIDIA’s architecture, the goal is to remove bottlenecks from data paths, to streamline operations, and to increase utilization of various compute elements. To improve performance, AMD eliminated the need for constant copies to back up memories, which significantly reduces data movement.
AMD’s new Instinct chip includes a flexible high-speed I/O and a 2.5D elevated bridge that connects various compute elements. High-speed bridges — first introduced commercially by Intel with its Embedded Multi-Die Interconnect Bridge (EMIB) — are used to make two or more chips act as one. Apple has used this approach, bridging two Arm-based M1 SoCs to create its M1 Ultra chip.
Fig. 2: AMD’s multi-die approach with fan-out bridge. Source: AMD/Hot Chips
All of these architectures are more flexible than previous versions, and the chiplet/tile approach provides a way for big chipmakers to customize their chips while still serving a broad customer base. Meanwhile, systems companies such as Google, Meta, and Alibaba are taking this a step further, designing chips from scratch that are tuned specifically to their data type and processing goals.
Tesla’s data center chip architecture is a case in point. “In the early parts of the AI revolution, the compute requirements were scaling roughly with Moore’s Law,” said Peter Bannon, vice president of low voltage and silicon engineering at Tesla, during a presentation at the recent TSMC Tech Symposium. “But over the last five years, there has been a distinct change in the trajectory, where the compute needs have been doubling every three or four months as people have figured out how to train bigger and bigger models that continue to deliver better and better results.”
The Tesla design team set a goal of scaling up “with no practical limit on the size of the machine,” said Bannon. “The thinking was, ‘If the machine wasn’t big enough for a particular model, that we would just grow the machine and make it bigger.’ We wanted to be able to exploit multiple levels of parallelism — both data and model-level parallelism at the training level, as well as parallelism inside the inherent operations you’re doing when you’re training the convolution and matrix multiplication. And we wanted it to be a fully programmable and flexible hardware.”
What’s different
ASICs always have been customized, but at each new process node the cost is rising to the point where only the highest volume applications, such as smart phones or PCs, are sufficient to recoup the design and manufacturing costs. Increasingly, systems companies are absorbing that rising cost by using the chips they design internally, and they are looking to extend those customized architectures over longer periods of time.
To squeeze more performance per watt out of these designs, they also are optimizing chips for specific software functionality, as well as how the software takes advantage of the hardware — a complex and often iterative process that requires continuous fine-tuning with regular software updates. In the case of a data center, for example, these chips can improve performance per watt and run cooler, which has the added benefit of reducing electricity costs for powering and cooling racks of servers.
There are other considerations, as well. Among them:
All of this makes the design, verification, and debug process much more difficult, and it can create problems in manufacturing if there is insufficient volume and knowledge about where anomalies may show up. This explains why more EDA, IP, test/analytics, and security companies are starting to offer services to supplement work by in-house design teams.
“It’s no longer about designing a CPU that’s going to do x, y, and z functions for every workload without thinking about the overhead,” said Sailesh Chittipeddi, executive vice president at Renesas. “That’s why all these companies are now getting more verticalized. They’re driving solutions they need. That includes AI at the system level. And it includes an interplay between electrical and mechanical features, down to the level of where you place a particular connector. It’s also driving more CAD companies to get into system-level support and system-level design.”
This shift is happening across a growing number of vertical markets, from handsets and automotive to industrial applications, and it is driving a wave of smaller acquisitions that fall well below the radar as chipmakers look to position their hardware for a broad swath of new markets. Renesas’ acquisition of Reality Analytics in June, for example, is aimed at creating AI models for a variety of industrial market segments.
“This technology can be used to look at vibration in a system and predict when a particular part is going to fail,” Chittipeddi said. “If you look at mining, for example, if a drill bit breaks, it can cause massive problems. We can import those models onto our MCUs, which can be used to control these systems.”
Who does what
However, domain-specific solutions ratchet up the pressure on EDA companies to figure out which commonalities can be automated. That was much easier with planar chips developed at a single process node. But as more markets are digitalized — whether that is automotive, industrial, mil/aero, commercial, or consumer — their goals are becoming increasingly divergent.
That divergence is only expected to grow as chiplets developed at different process nodes are developed for customized packages, which may be based everything from pillars in fan-outs to full 3D-IC implementations. In some cases, there even may be a combination of both 2.5D and 3D-ICs, which Siemens EDA has labeled 5.5D.
The good news for EDA and IP companies is this has significantly increased demand for simulation, emulation, prototyping, and modeling. Large systems vendors also have been pressuring EDA vendors to automate more of the systems companies’ design processes, but there isn’t enough volume to warrant that investment. In place of that, the systems companies have reached out to EDA and IP companies to provide expert services, shifting from a transactional relationship to a much deeper partnership, and giving EDA companies a deeper look at how various tools are being used, and where are the holes that can foster new opportunities.
“A lot of new players are more vertically integrated, so they’re doing more in-house,” said Niels Faché, vice president and general manager of design and simulation at Keysight Technologies. “There’s much more interest in system-level simulation, and there is a growing need for collaborative workflows within companies and between companies. We’re also seeing more iterating through design. So you have a development team, a quality team, and you update the design continually.”
For chip companies that are designing chips for OEMs, that’s only part of the challenge. “If you look at the automotive market, there is a shift away from designing chipsets against requirements,” said Faché. “In the initial stage, the chip company may build a reference design with the software, and set in the context of how it will get used. The OEM then will look to optimize that. What that does is to push collaboration into the traditional food chain. If you’re developing a radar chip, for example, it’s not just a radar subsystem. It’s radar in the context of a larger technology stack.”
That stack might include the RF package, antennas, and the receiver, while the OEM builds the radio, with the EDA.
Application-specific vs. general-purpose
A big challenge for design teams is that more of the design is becoming front-loaded. Instead of just creating the chip architecture and then working out the details in the design process, more of it needs to be addressed right at the architectural level.
“There was one case when a chip company shipped a chip that used too much power, and the OEM wasn’t happy,” said Joe Sawicki, executive vice president at Siemens Digital Industries Software. “But you wouldn’t know that just running applications. And AI has made that a much bigger problem because it’s not just about the software. Now you have all this inferencing running on it. If you don’t care about latency, you can have a general-purpose chip sitting in the cloud where you just communicate to the cloud and get the data back. But if you have stuff that’s real-time, where it needs to respond immediately, you can’t afford that latency and you want low power. So then, at least for the accelerator, you want to have that custom designed.”
Gordon Cooper, product marketing manager at Synopsys, agreed. “If you’re using AI, is it being used 100% of the time, or is it a nice to have? If I just want to say I have AI on my chip, maybe I just need to use a DSP to do the AI,” he said. “There’s a tradeoff, and it depends on the context. If you want full-blown AI 100% of the time, maybe you need to add external IP or additional IP.”
One of the big challenges with AI is keeping a device current, because the algorithms are constantly being updated. This becomes much more difficult if the design is a one-off and everything is optimized for one or more algorithms. So while architectures need to be scalable in terms of performance, they also need to be scalable over time, as well as in the context of other components in a system.
Software updates can create havoc with clocks. “Anything you do with the quality of synchronization on a chip will impact latency, performance, power, and time-to-market,” said Mo Faisal, CEO of Movellus, in a presentation at the AI Hardware Summit 2022. “People are building bigger and bigger chips — reticle-sized chips — where you optimize the core and make sure it plays nicely with the software. This is matrix multiplication, graph compute, and the more cores you throw at it in parallel, the better. However, these chips are now running into challenges. Before, this was a problem for one or two teams at Intel and AMD. Now it’s everybody’s problem.”
Keeping everything in sync is becoming a process rather than a single function. “You may have different workloads,” Faisal said. “So you may only want to use 50 cores for one workload, and for the next you want to use 500 cores. But when you turn on the next 500 cores, you end up putting pressure on the power network and causing droop.”
There also are problems with simultaneous switching noise. In the past, some of this could be addressed with margin. But at advanced nodes, that margin increases the amount of time and energy required to move electrons through very thin wires, which in turn creates resistance and increases the thermal dissipation. So the tradeoffs become much more complex at each new node, and the interactions between different components in a package are additive.
“If you look at 5G, that means something different in automotive than for the data center or the consumer,” said Frank Schirrmeister, group director for product marketing at Cadence at the time of this interview. “They all have different latency throughputs. The same is true for AI/ML. It depends on the domain. And then, because everything is hyper-connected, it’s not only within one domain. So it essentially requires many variations of the same chip, and that’s where heterogeneous integration gets interesting. The whole disintegration of the SoC comes in handy because you can do different performance levels based on things like binning. But it’s no longer a design per se, because some of the rules no longer apply.”
Conclusion
The entire chip design ecosystem is in flux, and that extends all the way to software. In the past, design teams could be assured that software written at a high level of abstraction would work well enough, and that there would be a regular improvement with the introduction at each new node. But with the benefits of scaling dropping off, and a subsequent increase in data that needs to be processed more quickly, everyone now has to work harder — and they have to work much more collaboratively with groups with which they have never had much contact in the past.
The best way forward, at least as far as power and performance are concerned, is to design chips for specific purposes using customized or semi-customized architectures. But that creates its own set of issues, and those issues will take time to iron out. Tools for 2.5D and 3D designs are just starting to roll out, and chipmakers are sorting out plans for how they will either get very specific, or general enough to be able to leverage their architectures across multiple designs. Either way, engineers in every discipline will need to start looking beyond their area of focus to systems of chips, and systems of systems. The future is bright, but it’s also much more challenging.
Related Reading
Toward Domain-Specific EDA
Is the tools market really changing, or has this always been the case?
Big Changes In Architectures, Transistors, Materials
Who’s doing what in next-gen chips, and when they expect to do it.
Scaling, Advanced Packaging, Or Both
Number of options is growing, but so is the list of tradeoffs.
Leave a Reply