Foundational Changes In Chip Architectures

New memory approaches and challenges in scaling CMOS point to radical changes — and potentially huge improvements — in semiconductor designs.


We take many things in the semiconductor world for granted, but what if some of the decisions made decades ago are no longer viable or optimal? We saw a small example with finFETs, where the planar transistor would no longer scale. Today we are facing several bigger disruptions that will have much larger ripple effects.

Technology often progresses in a linear fashion. Each step provides incremental improvement over what existed before, or to overcome some new challenge. Those challenges come from a new node, new physical effect, or limitation, etc. While this works very well, and many of the individual steps are brilliant, it is building on a house of cards in that if something at the base were to fundamentally change, the ripple effects throughout the design, implementation, and verification can be very significant.

Single contiguous memory
One of these changes has been in the works for some time. The von Neumann processor architecture, first described in 1945, with its single continuous memory space, was an absolute breakthrough. It provided a Turing-complete solution on which any finite problem could be solved. This became the de facto architecture for almost all computers.

Memory quickly became a limitation, both in size and performance. To overcome that, cache was introduced to make cheap, bulk memory appear to behave like much more expensive, faster memory. Over time, those caches became multi-level, coherent across multiple masters, and working over increasingly large address spaces.

But that is no longer a requirement for many modern computation functions. In an age of object-based software functions, and domain specific computing, the need for that memory organization can be detrimental. It is based on the premise that a program can randomly access absolutely anything it wants — something that security experts are wishing was not true.

The full cost of cache and coherency has to be fully considered. “Coherency is complicated and expensive to implement in silicon,” says Simon Davidmann, founder and CEO for Imperas Software. “When you go to multi-levels of caching, the memory hierarchy becomes more and more complex, and increasingly full of bugs, and consumes increasing amounts of power.”

When the tasks are well understood, this overhead can be avoided. “In a data flow engine, coherence is less important, because you’re shipping the data that is moving on the edge, directly from one accelerator to the other,” says Michael Frank, fellow and system architect at Arteris IP. “If you partition the data set, coherence gets in the way, because it costs you extra cycles. You have to use look-up tables. You have to provide the update information.”

The adoption of object-oriented systems, along with strongly typed languages that restrict type conversion and a few restrictions placed on programmers, could make execution flow predictable and avoid the need for a single, contiguous memory space. Tasks, such as those found in graphics and machine learning, operate on bounded memory blocks and do not benefit from complex memory management or hardware control over memory.

Domain-specific computation is causing many aspects of this to be rethought. “As an example, DSPs tend to provide a pool of distributed memories, often directly managed in software,” says Matt Horsnell, senior principal research engineer for Arm’s Research and Development group. “This can be a better fit for the bandwidth requirements and access patterns of specialized applications than traditional shared-memory systems. These processors often offer some form of memory specialization by providing direct support for specific access patterns (e.g., N-buffering, FIFOs, line buffers, compression, etc.)”

New memory types
There are very big implications associated with changing the memory architecture. “The challenge is that in the past, people had a nice abstract model for thinking about computing systems,” says Steven Woo, fellow and distinguished inventor at Rambus. “They never really had to think about memory. It came along for free, and the programming model just made it such that when you did references to memory, it just happened. You never had to be explicit about what you were doing. What started to happen, with Moore’s Law slowing and power scaling stopped, is people started to realize there are lots of new kinds of memories that could enter the equation. But to make them really useful, you have to get rid of the very abstract view that we used to have.”

The second, and related, change is coming through new memory technologies. SRAM and DRAM have been optimized for speed, density, and performance for a long time. But DRAM scaling has stalled, and SRAM suffers from variability at the newest nodes, making it difficult to maintain density. New memory types, based on different physics, could finish up being better, but this is possibly not the primary benefit.

For example, if ReRAM is adopted, memory cells inherently become analog, and this opens up a number of possibilities. “One of the foundational ideas of analog is you can actually compute in the memory cell itself,” says Tim Vehling, senior vice president for product and business development at Mythic. “You actually eliminate that whole memory movement issue, and therefore power comes down substantially. You have efficient computation and little data movement when analog comes into play. With analog compute-in-memory technology, it’s actually orders of magnitude more power efficient than the digital equivalent.”

This aligns perfectly with the multiply/accumulate functions required by machine learning. “The amount of energy consumed by performing these MAC operations is humongous,” says Sumit Vishwakarma, product manager at Siemens EDA. “Neural networks have weights, and these weights sit in memory. They have to keep accessing the memory, which is a very energy-consuming task. The power of compute is one-tenth the power needed to transfer the data. To solve this problem, companies and universities are looking into analog computing, which stores the weights in memory. Now I just have to feed in some input and get an output, which is basically a multiplication of those weights with my inputs.”

When analog and digital are decoupled, the analog circuits are no longer hamstrung. “We can design analog circuits that provide equivalent, or even better, functions in some cases than digital, and we can do it in older nodes,” says Tim Vang, vice president of marketing and applications for Semtech’s Signal Integrity Solutions Group. “The cost can be lower because we don’t need all the digital functions, so the die sizes can be smaller. We can reduce power because we don’t have as much functionality.”

When memory changes, everything in the software stack is affected. “What usually happens is there is an algorithm in place, and we see a way to optimize it, optimize the memory, so that the algorithm is much better implemented,” says Prasad Saggurti, director of product marketing at Synopsys. “On the flip side, we have these different types of memory. Can you change your algorithm to make use of these new kinds of memories? In the past, using TCAMs was mostly a networking domain construct to look up IP addresses. More recently, ML training engines are starting to use TCAMs. This needs software or firmware to change, based on the types of memories available.”

The end for CMOS
But by far the biggest potential change on the horizon is the end of CMOS. As devices get smaller, the control of doping becomes challenging, and this causes significant variability in the threshold voltage of devices. Doping defines the polarity of a devices, such as a device either being PMOS or NMOS, and it is the pairing of these that creates the CMOS structures that are the basis of all of the digital functions that are created. As the industry migrates toward gate-all-around finFET structures, a new possibility emerges.

“With horizontally stacked nanowires, you can actually build transistors with two gates,” said Giovanni De Micheli, professor of electrical engineering and computer science at École Polytechnique Fédérale de Lausanne, in a DAC 2022 keynote. “You use a second gate to polarize the transistor and make the transistor either a P or an N transistor (see figure 1). You get a transistor that is more powerful because it creates a comparator rather than a switch. Now, with these types of devices, you can have completely new topologies.”

Fig 1. 3-D Conceptual view of a GAA polarity gate. Source: Michele De Marchi thesis, EPFL, 2015

Fig 1. 3-D Conceptual view of a GAA polarity gate. Source: Michele De Marchi thesis, EPFL, 2015

This can, in theory, be taken even further by splitting the polarity gate into two. This would add the ability for each transistor to also become high- or low-threshold voltage devices, in addition to being either p-type or n-type. So each transistor could take on different power/performance characteristics during operation.

Let’s go back to the logic abstraction. “We have designed digital circuits with NANDs and NORs for decades,” says De Micheli. “Why? Because we were brainwashed when we started, because in CMOS that is the most convenient implementation. But if you think in terms of majority logic (see figure 2), you realize this is the key operator to do addition and multiplication. Today, all the circuits we are implementing for machine learning, the bread and butter in there is to do additions or multiplications. And that’s why majority is extremely important. Besides, majority logic is a natural model for many technologies like superconductors, optical technology, non-volatile logic in memory, and so on.”

Fig 2. New logic elements based on polarity gate devices. Source: De Micheli/EPFL

Fig 2. New logic elements based on polarity gate devices. Source: De Micheli/EPFL

De Micheli’s research indicates that circuits designed with majority logic would reduce delays by 15% to 20% using today’s slightly modified EDA tools.

But these types of changes do require significant rethinking of synthesis and other steps. “If this turns out to be a promising vector, you really need to rethink the synthesis engine completely,” says Rob Aitken, technology strategist at Synopsys. “Instead of effectively taking NAND/NOR circuits and building things from that, many of the new devices would be natively tuned to be XORs, or majority gates, or some other logic function. What would happen? Synthesis cares about the fundamental thing that you’re building, and while it is an oversimplification, logic synthesis takes a PLA and then collapses it into a multi-level object. Rethinking in a different logic style matters.”

Changing the fundamental transistor functionality has a major impact on many aspects of the flow. For example, with devices now having four or five terminals, rather than three, what impact will that have on place and route? How will it impact fan-in and fan-out and congestion?

Change is difficult. A promising technology has to overcome decades of optimization of the incumbent technology, and that creates huge inertial challenges. It also may require many parts of a solution to change at the same time, such as hardware and software, or tools throughout an implementation chain. But as the industry approaches some of the fundamental physical limits of semiconductors, it needs to become more flexible and willing to change.

Related Reading
Improving Memory Efficiency And Performance
CXL and OMI will facilitate memory sharing and pooling, but how well and where they work best remains debatable.
CXL And OMI: Competing Or Complementary?
It depends on whom you ask, but there are advantages to both.
IC Architectures Shift As OEMs Narrow Their Focus
As chip companies customize designs, the number of possible pitfalls is growing. Tighter partnerships and acquisitions may help.


Steve says:

I found this to be a fascinating article and very enlightening! i work in Semiconductors, this would really be a game changer…and for the better!

RigTig says:

Very timely article for me. I’m a software person exploring a RISC-V board and it occurred to me that register access is really fast, but memory access is way slower. So, I had the idea that the registers need to be objects that are directly manipulated by an object-sensitive processor (with a small instruction set ala RISC cf CISC). It’s out of my league to implement that in silicon, though I’ll use the idea in some software and look at how it might work. And then I read about how to change how memory can be implemented in very different ways. Ooohh! That has really got me thinking.

Schrodinger's Cat's Advocate says:

What happens when you get a long glitch or noise on PG? What if PG oscillates or is metastable?

Jung Yoon says:

Thank you for this insightful article. Connecting compute architecture to Si scaling to Software and security. It is a very interesting time to be in the field of advanced compute/storage driven by semiconductor innovation and this article connects key driving forces and challenges in an elegant way

Chris @ crossPORt says:

Change is hard. And it is hardest when it comes from an unexpected direction. That said, can you imagine if all this work-around addressing fundamental issues like heat, latency, bandwitch, asynchronous compute and data exchange/transmission could be solved over night. Most won’t even try to imagine its possible. Even when it has already happened and is demonstrable. But Change is hard.

Leave a Reply

(Note: This name will be displayed publicly)