Chip Design Shifts As Fundamental Laws Run Out Of Steam

How prepared the EDA community is to address upcoming challenges isn’t clear.

popularity

Dennard scaling is gone, Amdahl’s Law is reaching its limit, and Moore’s Law is becoming difficult and expensive to follow, particularly as power and performance benefits diminish. And while none of that has reduced opportunities for much faster, lower-power chips, it has significantly shifted the dynamics for their design and manufacturing.

Rather than just different process nodes and half nodes, companies developing chips — traditional chip companies, automotive OEMs, fabless and non-fabless IDMs, and large systems companies — are now wrestling with more options and more unique challenges as they seek optimal solutions for their specific applications. And they are all demanding more from an EDA ecosystem, which is racing to keep up with these changes, including various types of advanced packaging, chiplets, and a demand for integrated and customized hardware and software.

“While heterogeneous integration predates the end of Dennard scaling or flattening of Moore’s Law by several years, silicon designers and system architects are embracing this paradigm now to retain their pursuit of PPA goals — without empirical law and its derivatives,” said Saugat Sen, vice president of R&D at Cadence. “While there are many architectural and design challenges in this era, addressing thermal concerns rise to the top. Efficiency in design and implementation has been intricately linked to closed-loop integration with multi-physics analyses for awhile. More-than-Moore has created a compelling case for the implementation-analyses microcosm to transcend across the fabrics of system design, from silicon to package, and even beyond, and more so in the systems companies that are at the bleeding edge of design innovation.”

Defining power and energy requirements for systems processors is becoming much more difficult.

Power consumption and total energy usage of compute is a huge concern, and it’s getting bigger by the moment due to geopolitical developments, the rising cost of energy, and environmental concerns, said Christian Jacobi, IBM fellow and CTO for system architecture and design for IBM Z Systems. “At the same time, since Moore’s Law and Dennard scaling are essentially over, as architects we want to keep adding features, functions, performance, and more cores to every chip without exploding in the energy footprint. So we have to get smarter about how we manage energy in the chips, from how to optimize power consumption versus performance at any point in time, how to use periods of lesser activity where not every compute resource is being used to its fullest, to reducing the power consumption on components of the chip.”

IBM’s solution for its Z Systems is to integrate AI into the processor chip. “We can access the data where it already lives,” Jacobi said. “If the data is in the processor chip, and in the caches of the processor chip — because that’s where whatever other business process was computing on that data, like a banking transaction, or a credit card transaction — I don’t need to take that data and move it somewhere else, to a different device or across a network or through a PCI interface to an I/O attached adapter. Instead, I have my localized AI engine, and I can access that data there. I don’t have to move it half a meter or a meter or a kilometer to go to a different device. That obviously reduces the energy footprint quite significantly for doing the AI. The actual computation itself, the adds and multiplies, they still consume power. But at least we can reduce that overhead of getting the data to the compute and back.”

What that means for the rest of the ecosystem is complex, because not every chip or package will do things the same way. “There are still a number of changes that have to happen to support the ecosystems and product complexity,” said Chris Mueth, senior manager of new markets and digital twin program manager at Keysight. “Product complexity is the main driver, because everybody wants more miniaturization. Everybody wants more functions within the products that they have. So there’s more integration required. And while it seems like we’re approaching asymptotic conditions, I don’t think we’re dead yet.”

In fact, there are at least several more process nodes on the Moore’s Law roadmap, and all three of the leading-edge foundries — Samsung, TSMC, and Intel — have roadmaps that extend into the 1.x nanometer range. “That’s really important because we have to make the transistor smaller for two reasons,” said Mueth. “One is for speed, the other is for thermal. As you’re clocking a bazillion transistors on a chip, you’re generating a lot of heat. The way around that is to shrink everything down, but at some point we’re going to reach an asymptotic peak.”

Steven Woo, fellow and distinguished inventor at Rambus, agreed. “Now that Dennard scaling has basically stopped, you really can’t get the reliable decreases in power anymore,” he said. “So if you want to keep getting performance, and you want to keep increasing the compute density, you’re going to have to figure out a way to wick away the heat. There are only a couple of ways out of the box. That’s one of them.”

In electric vehicles, for example, this means the ECUs must be designed in the context of a very limited power supply in the midst of the entire electric system. Hardware designers traditionally have approached this type of problem by adding a number of modes that can be turned down and monitored to meter what the system is doing, such as slowing things down.

“What we’re seeing more in AI, which will likely play out in all fields, is that the software engineers really understand the tradeoff between the performance of the system and the precision of the system,” Woo said. “If they are somehow limited in bandwidth, energy, or otherwise, they turn it into a software problem. If they need more bandwidth, they can reduce the precision of the numbers, and they train specifically for reduced precision or for sparsity. In the AI field, there is a holistic view of integration between the software side of things and the hardware side. In the same way over the last 20 years, programmers have been forced to become more architecturally aware, considering cache size and processor architecture. In the future, programmers will have to be more cognizant about things like power limitations in the system, and try to use tools and APIs that let them trade off power for performance. That evolution will happen. It will take time. It’s taken about a generation of programmers to really understand that you can’t be as abstracted anymore in what the architecture looks like when you’re writing your programs. It’s probably going to move in that direction over the next 20 years.”

This is particularly true in automotive, where chips need to perform reliably over time, and need to be updated as algorithms and communications protocols change.

“One of the mega trends that we see is the monitoring of the health status,” said Roland Jancke, design methodology head in Fraunhofer IIS’ Engineering of Adaptive Systems Division. “If you’re no longer able to control the behavior or the chip at design time, then you need to have something during operation to monitor it and switch over to spare parts or some other backup. For automotive electronics, you need to think about everything that can happen during operation. But if you say, ‘Let’s develop this based on the possibility that parts will fail,’ and you put in some spare parts, then you will exceed your monetary budget.”

Jancke said the key there is the ability to fail over to another system in case of an emergency, but that can be a very complex process. Like many of the changes underway in chip design, it requires a breakdown of some of the traditional silos, where systems, semiconductor, packaging, and software engineers work collaboratively on heterogeneous architectures.

“A heterogeneous architectures is not a new concept,” said Vik Karvat, senior vice president of products, marketing, and planning at Movellus. “It’s evolved and expanded in many verticals, including mobile, automotive, and AI. The difference now is heterogeneous compute elements are much larger and more powerful. This is exemplified by NVIDIA’s Hopper + Grace solution, Intel’s Sapphire Rapids and Falcon Shore platforms. However, as these elements get larger, and data center compute demand and density targets continue on their geometric growth curves, heterogeneous monolithic will transition to heterogeneous chiplet approaches in order to continue to scale. This requires system, semi, and packaging companies to all come together.”

Where we’ve been
In 1974, Robert Dennard wrote a paper involving MOSFET, which said that as transistors get smaller, their power density stays constant. That worked up until 2005, when leakage power was started to become problematic. “That was really the engine behind Moore’s Law,” said Roddy Urquhart, senior technical marketing director at Codasip. “Dennard scaling and Moore’s Law enabled you to take advantage of new generations of silicon geometry, basically double the number of transistors, and push up the clock rate by about 40% with each generation. Interestingly, around that time, Intel planned the Pentium 5 processor, and wanted it to go to 5 or 7 GHz, but they had to cancel it in the end because of thermal problems.”

Another limiter on processor design was the ceiling on CMOS clock rates, as shown below.

Because of these limitations, there was a clear shift to multi-core design, starting with mobile devices. “First was having processors that were specialized for a particular function like a GPU for the mobile phone graphics, dedicated microcontrollers, handling things like Wi-Fi, or Bluetooth and so on,” Urquhart said. “Second, there were multicore systems initially for dual core. Today, there are four core systems for running things like Android. For running something like an operating system, there are some operations that can be parallelized, and others that are inherently sequential, which is where Amdahl’s Law applies.”

Emerging challenges like AI/ML can take advantage of data parallelism to create specialized architectures for solving very specific problems. And there are other opportunities within embedded devices. For example, Urquhart described some research Codasip has been doing with a conventional three-stage pipeline, 32-bit RISC-V core using Google’s TensorFlow Lite for Microcontrollers to do the quantization. The company is then creating custom RISC-V instructions to accelerate neural networks using very limited computing resources.

Urquhart said this would work well for IoT devices, where there is simple sensing or simple video processing to do. “For applications like augmented reality or autonomous driving, you’re dealing with much greater quantities of video data. The way to exploit that is going to be exploiting the inherent parallelism of the data. There are a number of examples of this. Google is said to be using its Tensor Processing Units for doing image recognition on its server farms. The TPU is a systolic array, so it processes matrices very efficiently. That’s one approach the industry is taking.”

Where we’re going
In order to move forward in computational performance, one approach is to take relatively conventional cores and enhance them with either additional instructions or additional processing units, where something can be accelerated but still retain a certain amount of general-purpose capabilities. “Otherwise, you’re going to have to go to special arrays that some people are talking about for AI/ML purposes,” Urquhart said. “There are also some novel approaches. You wouldn’t have thought 10 years ago about using analog for matrix processing, but companies like Mythic have found that for inferencing purposes they don’t need super high precision. So they’ve been exploiting analog arrays to do the matrix processing.”

This speaks to the swing toward bespoke silicon approaches proliferating today, which has a pull-through effect on what the EDA ecosystem needs to deliver. EDA is always racing to provide solutions to solve the architecture, design and verification problems that design teams have.

“EDA doesn’t drive these things, but rather works to make people’s inventions possible,” noted Simon Davidmann, CEO of Imperas Software, noting that EDA also tries to help people who are pushing the boundaries of everything. “Typically what happens is that those people pushing the boundaries often will come up with their own ways of doing things. Then EDA comes to help, to make it more cost-effective, scalable, and share-able. Then the industry can move forward and it’s not full of proprietary, one-off approaches. It tends to evolve if there really is a market for it with EDA. If someone has a wild idea to do something, but nobody else wants to do it that way, EDA won’t touch it. They’ll have to build it themselves. Whereas if they’re pioneering a new way of doing things, which is a generalized problem, then EDA leaps on it to try and build technologies it can sell to them.”

It also shows off the capabilities of EDA engineers. “The people building semiconductors and architectures are very smart, but EDA has to be as smart to understand what they need and then help them do it — and do it in a way that is more general and cost-effective,” Davidmann said. “Yet there’s a tension between the invention of the inventors and EDA providing help to them. EDA works very hard to be close to the leaders and build them solutions that work. The changing/ending of these laws impacts EDA in that the industry has to keep rushing to do better and more on everything, as it always has. As architectures change, the engineering teams need the new technologies. EDA listens to the challenges that customers are facing and fighting, then tries to provide them with solutions. The laws ending keeps the EDA world on its toes. You have to be very agile and fit in the EDA industry, or your technology becomes mainstream and irrelevant. The goal in EDA is to solve the world’s design implementation problems, do it better, and help customers do it.”

Even considering the challenges of the traditional laws governing chip design and EDA ending, much is still possible. “It’s really a question of the scope of optimization able to be tackled, and how the systems are partitioned,” said Neil Hand, strategy director for digital verification technologies at Siemens EDA.

Until very recently, most designs relied on an initial best effort system decomposition, and then local implementation optimization for each part of the system. “While this has worked, it leaves a lot of potential optimization on the table,” Hand said. “The key to unlocking this potential will be new and/or enhanced tools/methods that enable a model-based, cybertronic-system-engineering design approach (MBCSE), including informed functional allocation within the system. These tools and methods will allow system designers to do system analysis and tradeoffs as part of the design process, and monitor as the design evolves.”

While the concept isn’t new, and has been successfully applied in other design disciplines, it needs to be adapted to “electronics on top” systems and traditional EDA tool users. Hand noted that vertically integrated systems houses have an advantage here, because they control all parts of the design, and various internal groups can work together. “In addition to these new tools and methods, the EDA industry needs to work with industry on building an ecosystem that enables sharing of system design data and the creation of virtually vertically integrated systems houses. So it’s not just about the tools and methods, but also effective data and metadata sharing.”

This includes workflows. Keysight’s Mueth observed that workflows are the new frontier of where the EDA industry is headed.

“In EDA technology, a lot of it is largely mature,” Mueth said. “While everybody’s making incremental headway and chipping away at things, the biggest bottlenecks now are these workflows around these complex systems. You have to take into account the whole product development cycle, because that is the task at hand. Let’s say you have a team that’s put together some of the workflow from multiple functions that they are all working towards to get this concept designed. Let’s get it verified. And let’s get it transferred to production. So it goes from concept to design and design verification, then to prototype DVT test. That’s the verification in the hardware land. Then there’s the pilot production where you do some limited runs and figure out how to make this really efficient for manufacturing. Then comes manufacturing. This means there are six major steps for a product development flow. The workflow today consists of many manual processes. The trick is to remove those, link everything to share the IP, introduce digital threads, and include interoperability for data and tools. That has to be part of the ecosystem, but things are so complex you can’t manage those manually anymore.”

How this applies in a market with increasingly customized designs remains to be seen. “Already we see in many applications that software-based SoC- or server-side solutions are no longer sufficient or competitive,” said Stuart Clubb, CSD marketing director for Siemens EDA. “Custom hardware accelerators are gaining ground as a solution that delivers lower power and higher performance with the ability to tune the hardware to specific application needs.”

Many of these accelerators are highly algorithmic in their nature and their design and verification in RTL remains a challenge, both in terms of time and engineering resources. Systems companies are addressing their specific needs by building their own SoC’s tuned to the role at hand. Chip companies, in contrast, need to respond with broad offerings, often containing variants of what might be the same accelerators for different markets, Clubb said.

This is where high-level synthesis (HLS) and high-level verification (HLV) are gaining traction.
Clubb explained that HLS/HLV work in combination to significantly reduce design and verification time over traditional RTL, while delivering more competitive solutions in the accelerator space. He expects this market demand and application will continue to increase across a broad range of vertical markets, from battery-sensitive edge applications all the way to solutions sitting in the server farm. “System architects and chip designers need to build smarter and more specialized hardware to make use of the process nodes and transistors available, yet being mindful of the physical limitations that we are now seeing as Amdahl’s Law and Dennard scaling start to bend and break,” he said.

Urquhart also noted that some of big improvements in computational performance stem from the ASIC revolution in the ’90s.

“Then, in the early 2000s, more general-purpose computational units took over because they were able to do the heavy lifting at that time, enabled by EDA tools including synthesis,” he said. “With the transition to SoCs in the last decade, and some of the other interesting things like creating chiplets and putting systems in packages, one of the key enablers — particularly in the SoC — has been the availability of processor IP. But we’re seeing the limitations of that. Even Arm has a tremendously wide product family from application processors down to embedded processes. If you’re going to get more performance, you’re going to have to have further specialization. That means you’re going to have to have a much wider community of people involved in designing, or more likely fine-tuning or customizing processor cores. That becomes an EDA problem. There are great opportunities for processor design automation, and with the design automation we’re going to have to enable a wider community of people to design or modify processors. In the past, it has either been people in microprocessor companies like Intel and AMD, or in the process or IP companies, like Arm, Synopsys and Cadence, but we’re going to have to open it up to a wider community.”

Conclusion
As chip makers shift from monolithic to multi-die solutions, new and fundamental challenges arise that will require innovative solutions. “Semiconductor vendors will be contending with OCV issues and closing timing on reticle-sized die,” said Movellus’ Karvat. “There will be binning and power challenges at package level, and we will need to figure out how to make multi-die solutions behave like monolithic solutions from performance, verification, and reliability standpoints. EDA plays a pivotal role in this.”

But that requires a substantive shift in semiconductor design. IBM’s Jacobi contends the semiconductor ecosystem hasn’t fully come to grips with what the end of Dennard scaling really will mean. “It’s going to drive innovation, and it’s going to drive other changes. Architects are going to contribute more by figuring out how things should work in this world, where we’re not able to take advantage of the value generated over the last 20 or 30 years that came from Moore’s Law’s and Dennard scaling. That trend is shifting, and the architectural profession is getting even more important than it used to be.”

Related
Semiconductor Engineering’s Newsletter Signup is here.
Designing Chips In A ‘Lawless’ Industry
The guideposts for designing chips are disappearing or becoming less relevant. While engineers today have many more options for customizing a design, they have little direction about what works best for specific applications or what the return on investment will be for those efforts.
Foundational Changes In Chip Architectures
New memory approaches and challenges in scaling CMOS point to radical changes — and potentially huge improvements — in semiconductor designs.



12 comments

Robert Anderson says:

There is a need for EDA and design facilities for legacy process nodes. The two or 3 large EDA companies bought out the competition, killed the products, and raised prices, killing off our ability to compete with ASIA in .13 to 1 micron processes.

Hong Xiao says:

Fundamental laws, like Ohm’s Law and Newton’s Laws, will never run out of steam. Moore’s “law” was, and is, not a fundamental law. It was an observation, a prediction, and an amazing vision that lasted decades longer than many people, including experts, excepted.

Bowie Poag says:

Faster compute is great, but…what good is any of it, if the code its executing grows more and more inefficient at a faster rate than the hardware improves?

We have to face the facts, eventually. Modern software design literally encourages bloat.

The solution isn’t to dump more and more effort into faster and faster hardware. The solution is to weed out the people, systems, languages, and approaches to software design that foster, encourage, and breed inefficient code.

Douglas says:

1st pc was IBM PCjr. CPU Intel 8088 @ 4.77 MHz
Memory 64 KB base. Ran WordPerfect or MS Word on 2 5 1/4 inch disks with a dictionary on 2nd 5 1/4 inch drive. Does basically the same job as my present PC for writing.

CPlusPlus4Ever says:

It is true that SW industry has some serious and fundamental issues. Why do we need a super computer to display a simple patient record? It has the same data as 20 years ago.

Matthew says:

Inefficient software is 2hat came to mind while reading this. Yes, some stuff needs a lot of processing power, but others you have to wonder why this simple program needs 1gb of disk space and 4gb of ram.

Time to find a terminator and rip out it’s CPU.

Douglas MacIntyre says:

“That’s really important because we have to make the transistor smaller for two reasons,” said Mueth. “One is for speed, the other is for thermal. As you’re clocking a bazillion transistors on a chip, you’re generating a lot of heat. The way around that is to shrink everything down, but at some point we’re going to reach an asymptotic peak.”

Asymptotes are limits, not “peaks”.

IBM’s solution for its Z Systems is to integrate AI into the processor chip. “We can access the data where it already lives,” Jacobi said. “If the data is in the processor chip, and in the caches of the processor chip — because that’s where whatever other business process was computing on that data, like a banking transaction, or a credit card transaction — I don’t need to take that data and move it somewhere else, to a different device or across a network…”

This guy doesn’t understand what cache is. It’s processor-local memory of limited size to hold frequently used data. AI typically needs huge data sets that are not going to fit into a cache– unless you want to redefine the word “cache”.

Matthew Slyman says:

You missed Professor Roger Needham’s article on the CMOS Endpoint

madmax2069 says:

With each jump up in processing power, the more inefficient the software is becoming and devs are relying more on brute force performance to achieve the same thing as efficient software running on much older and slower hardware.

Modern software has became a bloated mess of subscription services and inefficient code.

Tanj Bennett says:

It is data that fills your CPU memory, not code. And it does that because it is cheap to do so, and because users voted overwhelmingly for inefficient, bloated data systems like HTML. If you look at your task list you will likely see memory dominated by the browser and a bunch of Electron (directly bound browser code) based apps.

Why did they do that? Because it is cheap and versatile. Customers can run them in a $100 tablet. Sure, it contains 4GB of memory – $15 added to the retail price. The 4-core 2GHz CPU includes a GPU, neural network, signal processor and a bunch of other odds and ends on one chip for $10. So, they don’t care. As a result the devolpers use easy and productive tools.

A 40 year old PC Jr. was usable for maybe a million people who were generally young nerds. The modern tablet is usable by 7 billion people most of whom care not at all about how it works, just want it cheap, ubiquitous, and versatile.

The way forward, like it or not, is more of the same. The bloat in AI is astounding. But if it works and costs are mastered, that will be it.

Tanj Bennett says:

The cache in the latest IBM Z machines is coherently shared with the AI co-processor, and is even accessible over links to other CPUs. It is worth reading about their architectural choices, which are quite interesting.

Mueth is subtly wrong about the shrinking for another reason. When Dennard scaling worked it was true you had constant power density per cm2 as devices got smaller. That was pretty much the whole point of Dennard. Since we reached the endpoint of Dennard in logic nearly 20 years ago smaller devices meant denser power, heat per cm2 has steadily increased to around 100W/cm2 in the latest machines

Smaller logic is now useful to improving perf and does slightly improve power per unit of work, mostly due to shortening distance, although that requires a host of supporting improvements in wiring methods, materials, and routing, not just smaller transistors. It can also make the overall system cheaper by fitting more function into fewer packages.

Leave a Reply


(Note: This name will be displayed publicly)