Power Impact At The Physical Layer Causes Downstream Effects

PHYs have a growing impact on performance and power in both planar and multi-die designs.


Data movement is rapidly emerging as one of the top design challenges, and it is being complicated by new chip architectures and physical effects caused by increasing density at advanced nodes and in multi-chip systems.

Until the introduction of the latest revs of high-bandwidth memory, as well as GDDR6, memory was considered the next big bottleneck. But other compute bottlenecks have been exposed since then, particularly as more heterogeneous approaches are used to compensate for a slowdown in Moore’s Law and the end of Dennard scaling. While those heterogeneous approaches are effective at boosting the amount of compute horsepower in a device, nothing comes for free.

“There’s a lot of heterogeneous computing with CPUs, GPUs, FPGAs as accelerators, custom ASICS, along with a new class of DPUs (data processing units),” said Suresh Andani, senior director, IP cores at Rambus. “Because of this heterogeneous computing/heterogenous architecture, the data is being moved around quite a bit. In the data center, the data comes in from the switches into the servers, and gets to the CPU somehow. There’s GPUs, FPGAs and ASICs hanging off it. If those interfaces are not fast enough, they are not serving their purpose. Interfaces such as PCI Express are moving very fast from generation 4.0 to 5.0, and 6.0 is already in 0.5 state. With Ethernet, it’s same thing. It’s getting faster and faster, and there are some latency requirements also that are coming through.”

Latency is key, and it has an impact on overall device performance as well as bandwidth requirements. “If you have to use PCI Express, first you’ve got to wrap the packet and use the PCIe protocol,” said Steven Woo, fellow and distinguished inventor at Rambus. “You’ve got to unwrap it on the other side, do something with it, eventually get your data, wrap it all back up and send it. That starts to become a lot of overhead, because now you’re sending all this extra wrapping in addition to the data.”

This is why there is so much buzz around interconnects such as CCIX and CXL. “Part of why CXL offers a lot of promise is because it gives the memory-level semantics, which keep the latency low and allows you to get the most effective bandwidth out there. It allows you to make the most of that link, and what we’re seeing with interfaces and links and such, in the case of some memory interfaces and things like that on the high-end processors, is a third to half of the power budget is really around the I/Os on that chip. That said, data movement is a real limiter here. If you could get rid of that, or even some fraction of it, you can make a huge dent and spend that more on the compute engines,” he pointed out.

But latency needs to be addressed in the context of changing workloads within the box. There are many new types of heterogeneous workloads, including everything from genomics to AI training and inferencing applications like voice and video recognition.

“These applications require a lot of parallel processing, and for that reason these accelerators and other processors are coming to help them in CPUs, like Intel, AMD or IBM CPUs,” said Manmeet Walia, senior product manager for high speed SerDes IP at Synopsys. “All of those are connected through PCIe until you get to the smart NIC card, where Ethernet takes over. Within a rack, the Ethernet needs to travel to the top of the rack switch.”

This makes design tradeoffs get trickier because the PHY analog/mixed signal IPs typically do not scale with the move to smaller process geometries. So with every process node, architectural enhancements must be implemented to achieve better power and area in order to keep it cost effective.

“Mixed-signal/analog PHYs are getting more difficult as we are scaling down the process technologies,” Walia said. “Going from 7 to 5nm, the layout rules are a lot tighter, and the parasitics are higher in some cases, so it’s getting more challenging to implement these in the smaller geometries. For that reason, one of the biggest trends is die-to-die PHYs, which essentially means the central chip will continue to scale down. For example, let’s say your processor chip has the digital logic. It will continue to scale down and will be connected with very simple, minimalistic, portable die-to-die PHYs. There’s nothing fancy in them. It’s just something that can take your signal to the other chiplet/companion chip. The companion chip market is where the I/Os will be. These I/O chips can stay in a larger process geometry, which has already been silicon proven. That will offer a lower risk, lower cost and better time to market.”

This may sound logical enough, but it represents another evolution in chip architectures.

“In the past there were two types of design philosophies,” said Andani. “Either you were going after the consumer/client market, which is very power-sensitive — every milliwatt and microwatt mattered. You built your interfaces very customized to that kind of market. Then there was the enterprise/higher-end market, where performance was the key criteria. Tradeoffs were an ‘or.’ Is it a power optimized design, or is it a performance optimized design?  Now, as we go forward, even in the enterprise space, it’s not an ‘or’ anymore. It’s an ‘and.’  Not only do you have to meet the performance requirements, it’s performance and power.”

Power is particularly complex because it needs to be considered in context. “One of the interesting aspects of power is that it’s both a local and global phenomenon,” said Marc Swinnen, director of product marketing for the Semiconductor Division of Ansys. “Even if you make every block iron-clad and you test it extensively so there’s not going to be any dynamic voltage drop on any particular cell, when you put all of these IP blocks together — and each core can have its own power supply — you still get resonance effects between these blocks. These blocks can be switching at a particular tempo, and if you hit a particular resonance spot then the voltage will start resonating up and down, and you have voltage problem that any particular block will not have seen. It’s a problem in the mesh. But you can’t detect that unless you do a total system analysis.”

That complicates how signals are routed and where different blocks are placed. “From the analysis side, that’s relatively easy,” said Swinnen. “The problem is the accuracy, because some of the blocks haven’t been made yet. It’s a rough black-box model for the block, and these resonances are hard to predict up front. There are two levels of power analysis, static and analysis. Static analysis can be done at the floor-planning stage. What is the general power draw? You can build that into the original power mesh. The dynamic analysis, which is where the switching comes into account, only happens after you have placement. Then the activity can be applied.”

All of that needs to be considered in the context of local routing and floorplanning and global power budgets. “If you look at all these accelerator cards, which are 75 watts, if your PCIe link itself along with some DDR and all that is going to take up 40 to 45 watts, you have barely 30 watts left to do any compute on that card. You need to make sure your interfaces are burning as low power as possible. This is forcing us to think about both power and performance when we design our next-gen interfaces, whether it’s PCIe, Ethernet, chiplets, XSR, HBI, HBM, whatever it is. If you want more performance, logically it means you have to burn more power in order to get to that performance, but that’s where IP vendors can differentiate themselves, because every spec out there says this is the performance you need to meet. There are very few, if any, specs out there that say, ‘You must meet this power for this interface,'” Andani said.

Further, Woo noted that to make a design flexible, you’d like to have some type of protocol that allows you to arrange things. “However, if that protocol is too heavy, then you start impacting the bandwidth and the latency,” he said. “These two forces are tugging in opposite directions. The question will be finding the right balance, and it wouldn’t be too much of a surprise if there were a couple of different ways that people thought about doing it because the use cases can be very different between some of the big data center type people.”

More data, more computing, more complications
The amount of digital data in the world continues to grow very rapidly, regardless of industry sector. Use case by use case, problem sets are becoming larger and more complicated.

“The number one thing is to continue as best we can on that historic curve where every new standard — which is about every five to six years — doubles the per-pin data rate, while also trying to double and quadruple the capacity over what we’ve had before,” said Woo. “That seems to have solved the historic need. No one would complain if we could go faster than that, but it’s a real challenge with the various kind of process technologies that are out there, and how quickly these can be advanced.”

It’s more difficult to maintain a similar power envelope and still double the data rate and double the capacity. “As an industry, we try and shoot for that target,” he said. “The questions that have started coming up now are whether we are at a point where maybe DRAM process technology isn’t advancing as much as it has in the past. It’s subject to the same things as Moore’s Law and Dennard scaling. What are we going to do to stay on that curve? We’ve seen the emergence of things like HBM memory, which is stacked, but it’s not really the best solution for things like main memory, so you start to wonder if other types of memory will move in that direction. Things will evolve, and along the way as we try and hit those targets of capacity and bandwidth. Those will be the most important things to hit first, under the same power envelope or better.”

This means there are PHY-related considerations that reach beyond the SoC architecture.

And even though Moore’s Law has slowed, the march to the next node continues — at least for now. “When multi-patterning first started to become reality, everybody expected the big IDMs and a couple of the big leaders to jump to a multi-patterning node, but the rest of the industry would hang out at 20 or 28 forever,” said Michael White, Calibre product marketing director at Mentor, a Siemens Business. “The opposite has come to pass. In fact, everybody continues to march down the Moore’s Law curse. Even for relatively small IP providers, to be competitive and to have the offerings the industry is looking for, they’re going to 7nm, 5nm, and maybe they’re talking about 3nm as part of the roadmap.”

Developing and validating IP gets more difficult at each new node. “There are another 25% to 30% more DRC checks that a fabless or IP provider has to go through for every new node, and that progression has just continued node over node,” said White. “We’ve seen a little bit less growth from 7nm to 5nm, because in that jump EUV was first brought in for some layers. EUV made the DRC checks a little simpler for those layers in that they can be single patterned instead of double, triple or quadruple patterned. But EUV is expensive and is capacity limited, so the foundries are only using on the layers where they absolutely have to get reasonable lithography quality, reasonable CD uniformity, and so on. So multi-patterning has not gone away. Folks were thinking years ago that we’d get EUV and physical verification would get simple. Nope. Even though it has slowed the rate of increase node over node, between 7nm and 5nm, at 5nm the lithography k1 factor has marched down enough that we’re actually starting to use double patterning with EUV. We got a holiday for a node but we’re back to the same march that we’ve been on for 30 years as far as rate of increase and increasing complexity node over node.”

What this means for the memory and system interfaces is they need to be coloring-aware and multi-patterning aware, which were not a consideration at 28nm.

“Given that different foundries had slightly different approaches as you got to 16nm and 14nm, which were that first inroad into multi-pattering with some foundry ecosystems, you still didn’t need to think about it too much with certain foundries, while with others you did. Certainly, anybody at 7nm these days needs to be aware of that. You need to be aware that there are preferred masks for critical nets that will impact your performance. Say you are designing some super-high-speed SerDes at a multi-patterning node. You may want to make some design choices about the mask layer — is it on mask one or mask two? — to make sure you get the performance that you want. Similarly for analog designers, if you’re building analog at any of these nodes, it’s something to pay attention to now.”

With the era of 3D coming on strongly, there is an impact to interface choice, White said. “Whether you stay planar SoC, or you make choices to start using one form or another of stacking, it’s a deep, complex discussion. The foundries and the industry are still trying to figure out in what situations does it make business sense for me to start stacking in one form or another.”

The chip industry is still doing coarse segmentation of designs where an SoC is built with high-bandwidth memory sitting next to it on a silicon interposer. That is likely to change.

“That’s as coarse as you can get,” White said. “You’ve got the entire SoC, you’ve got your memory, and that kind of thing is now becoming commonplace for large, high-performance servers or GPUs or microprocessors because it allows these companies to focus on their area of expertise, buy the HBM from a memory manufacturer, and they know they’re going to get much better performance out of it because of that intimate connection between the memory and the SoC. It’s becoming a pretty obvious choice if you’re that class of design house. Similarly, if I am a microcontroller expert and I want to use RF technology from some other supplier — and maybe I’m not an RF guru — we see those kinds of choices being made more and more, where segmentation of IP is being done that way versus buying the RF IP and integrating it into my SoC. That choice is especially being made if the microcontroller or the SoC really needs to be on an advanced node from a performance or feature perspective. But if you can get away with an older technology node for the peripheral IP, you want to do that because then you’re minimizing the overall cost of this integrated package.”

Choosing and abusing the PHY
All of this circles back to how all of these pieces are connected over the physical layer.

“For any PHY, it comes down to performance, power and area,” said Synopsys’ Walia. “Those are still the three big metrics. Beyond that, we also focus on latency, because for many applications that use co-processors, much is now moving towards direct memory access, cache currency, and so on. What becomes really important is that you are operating at the lowest possible latency. Imagine your two dies talking to each other and they’re cut in the middle, with the CPUs sitting on either side. They need to know what is happening in the memory that is residing in the other dies. For that reason, latency is critical. It’s almost as critical as PPA.”

Beyond that, in data centers — which are harsh environments because of the temperature and power supply variation, as well as cyclical cooling and heating — when these IPs are deployed, they have to be continue functioning regardless of voltage and temperature changes.

“The process is fine, but they have to be very robust against voltage and temperature corners,” Walia said. “What that means is that when we have to go above and beyond what the spec calls for. We have to cycle these temperatures very quickly from -40° to 125°C, and back down, from hot to cold, cold to hot, and we have to do it very quickly. We want to make sure the PHY can self-calibrate, and self-adapt to the changing environment, and so we are also moving a lot of this functionality into the digital domain. A lot of the brains of the PHY, all the protocol-agnostic calibrations and adaptations, all the routines startup, boot-up, and everything that’s happening is all digitally managed and assigned to a microprocessor. That’s critical for some of these applications.”

Finally, in the battery powered space, whether it is IoT devices, augmented reality, virtual reality in personal devices, it’s all about low power. And not only low active power but low standby power.

“Think about your 5G phone,” he said. “It’s only going to need PCIe 4.0 type of bandwidth at the very peak of its functionality. Most of the time it’s using lower speeds and lower functionality. For that reason your active power needs to be very carefully managed, not only for the highest data rate but also for the lower data rates. On top of that, in standby modes, the power really needs to come down, so we implement some features like power gates and power islands so all the circuitry can be shut off to cut down on the leakage to close to zero.”

—Ed Sperling contributed to this report.

Leave a Reply

(Note: This name will be displayed publicly)