Slower Metal Bogs Down SoC Performance

Interconnects are becoming the limiter at advanced nodes.


Metal interconnect delays are rising, offsetting some of the gains from faster transistors at each successive process node.

Older architectures were born in a time when compute time was the limiter. But with interconnects increasingly viewed as the limiter on advanced nodes, there’s an opportunity to rethink how we build systems-on-chips (SoCs).

”Interconnect delay is a fundamental tradeoff point for any computer architecture,” said Steve Williams, senior product marketing manager at Cadence. “Processor architecture has always been informed by interconnect delays.”

Past architectural nods to interconnect delay, however, focused largely on moving data between chips. Increasingly, data may take a non-trivial amount of time to go to where it’s needed even while remaining on-chip. This is resulting in new high-level architectural approaches to SoCs.

Moving in opposite directions
The goals of shrinking process dimensions are fundamentally twofold — to create faster transistors, and to squeeze more of them into a given silicon area. Both have been successful. But connecting those faster transistors requires interconnects, and if those interconnects take up too much space, then the integration goal will not be met.

Chipmakers have squeezed those interconnects in with increasingly narrow lines, which are placed ever-closer together. But line resistivity is inversely proportional to the cross-sectional area of the conductor. Making it narrower shrinks that cross-section. It could be compensated by making the lines taller (similarly to the approach taken with DRAM storage capacitors), but if placed aggressively close together, such tall lines would effectively be metal plates with high capacitance. That, in turn, would increase delays.

Fig. 1: Metal lines have resistivity proportional to their cross-section. The top left is a conceptual illustration of older wide lines. The top right shows those lines narrowed and closer together, with reduced cross section and higher resistivity. The bottom version shows an attempt to maintain the original cross-sectional area with close spacing, but it results in high mutual capacitance. Source: Bryon Moyer/Semiconductor Engineering

So a balance is struck between the cross-section and resistance against line height and mutual capacitance. The net effect is that metal delays have increased tenfold from what they were roughly 20 years ago.

“At that time, 1 millimeter of wire had about 100 picoseconds of delay,” said Rado Danilak, CEO of Tachyum. “Today, 1 millimeter of wire has a 1,200-picosecond delay.” That works against improvements in transistor speed, but it also changes the balance of delay contributions between transistors and interconnect.

While wire speed has decreased, it doesn’t always cause big a difference in actual delay of a given signal. “Yes, technically, interconnect resistance and capacitance have gone up,” noted João Geada, chief technologist at Ansys. “But the distance between transistors, on average, is significantly smaller.” Nonetheless, designers and architects are paying more attention than ever to these on-chip delays.

Modern metal stacks can have as many as 16 layers at the 5nm node, up from 10 layers at 28nm. Not all of these layers suffer from this slowdown, though. The lowest layers with the smallest lines see the greatest effect.

“After 28nm, the layer stack started telescoping,” said Eliot Gerstner, senior design engineering architect at Cadence. “You’ve got those bottom layers that are being double-patterned. And you can really communicate only three or four cell-widths away on those lower-level metals because they’re so resistant.”

As a result, signals that need to travel farther than that may need to be promoted to a higher layer, where wider, less-resistive metal can carry the signals over longer distances. But even there, challenges remain. The vias and via pillars that help to move signals between layers also are becoming increasingly resistive. And because the advanced transistors have lower drive than prior generations, long lines are more susceptible to noise, and these long signals may need to be buffered.

That means running the signal back down through the metal stack to the silicon, where a buffer can restore the signal for yet farther travel, after pushing back up to the higher metal layer. “By using upper metal layers, you get better long-distance communication because you’re having thicker metal at the top,” said Michael Frank, fellow and chief architect at Arteris IP. “The cost is that you have to go to multiple buffer stages to drive these long and heavy wires.”

As usual, it’s a matter of tradeoffs. “Deep inside a processor, structures like multipliers and register files are limited by all the ‘wires’ needed to route operands around and to enable multiple entry/exit ports,” said Cadence’s Williams. “Too many wires and your area and speed suffer. Not enough wires and you aren’t getting the most from your design.”

There are three levels at which these delays can be addressed. The most fundamental level involves the process itself. Beyond that, delay challenges typically have been addressed at the implementation level. But when things become yet harder, architecture becomes an important aspect of dealing with metal delays.

Process and implementation
At the process level, wire delays have resulted in a re-evaluation of which metals to use. When lines get thin, copper’s lattice structure becomes a weakness. Vibrations in the lattice (phonons) shorten the mean-free-path of electrons, increasing resistivity. “We are getting to lattice and quantum mechanical effects such that, at very narrow widths, the copper lattice in the metal has interactions between phonons and the charge carriers,” said Arteris IP’s Frank.

This is why cobalt is being considered in these applications. “Cobalt has a different lattice structure,” Frank explained. While not as good a conductor as copper for big wires, it becomes less resistive than copper for very fine wires. “If you go below 20 or 30nm wires, cobalt has an edge,” he said. That, plus moves to use cobalt instead of tungsten in vias, can help relieve some of the delay impact at its source.

At the implementation level, designers rely on sophisticated EDA tools as well as manual manipulations to coax a design to closure. The two classic approaches to higher clock speeds are parallelism and pipelining.

Lower-level parallelism sacrifices gate count for speed. “If your building blocks get too big, you break your function up into multiple parallel units, with multiple parallel data paths,” said Williams. This can mean doing the same calculation in multiple places.

“As long as you can afford the power, it is sometimes cheaper to recalculate results rather than to transport them from here to there,” said Frank.

Pipelining, meanwhile, shortens paths for a faster clock period at the potential expense of latency. “To get the core cycle times higher, you take the work you do, and you split it into smaller chunks,” said Steven Woo, fellow and distinguished inventor at Rambus. “It’s a larger number of steps, but each step is a little bit smaller.”

Williams agreed. “When you can’t address delays with buffers, you use pipelining to break up long events into a series of short ones that each can be done quickly, allowing a higher clock rate,” he said.

Both techniques require additional gates or flip-flops, but net area still can be reduced by lowering the burden on the transistors. “Designers actually are going to consume less area and power at a given frequency than they will if they don’t have those flops,” noted Cadence’s Gerstner, referring to the extra flip-flops needed for pipelining.

But there’s only so much that can be done during implementation. At some point, metal delays must be considered at the architectural level, long before the design work starts.

NoCs and clocks
Where logic delays once dominated against finite but fast metal delays, now those metal delays make up a much larger percentage of the performance problem. “Distance is extremely expensive,” said Gerstner. Fundamental architecture decisions that were made in a time when metal delay mattered less may be challenged with the new reality.

One architectural change sees the notion of the bus for major chip interconnect giving way to the network-on-chip (NoC). “NoC companies used [the pipelining idea] to break up long interconnects into a chain of small ones,” said Williams.

Arteris IP’s Frank echoed this observation. “This whole transition from 180 to 5nm has pushed a lot of people to go for NoCs rather than bus structures, because you cannot close timing over large areas,” he said.

NoCs increasingly are relied on for advanced, large chips. “Almost all SoCs with more than about 20 IP block use NoCs today,” said Kurt Shuler, vice president of marketing at Arteris IP. He noted that nearly half of the company’s NoC designs are on 7nm or smaller processes.

There is a cost to using a NoC, however. The bulk of the signals using the NoC require arbitration in order to place a packet on the network, and that can take tens of clock cycles, which adds to latency. “You need to think about all those arbitrations that you have in the interconnect that are creating congestion issues,” noted Pierre-Xavier Thomas, group director for technical and strategic marketing at Cadence.

Parallelism has a role here, as well. “If the cost of communication is high, you have to communicate a lot in one shot,” said Gerstner. “And so for the next generation, we’re already planning on 1,024-bit interfaces.”

This helps to amortize arbitration delays or other interconnect overhead. “When you pay the latency cost, you get more data back,” noted Williams.

Another fundamental aspect of architectural change involves clock domains. The challenge of maintaining consistent timing across a chip that is getting bigger (by delay standards) has spurred a rethinking of wide-ranging clock domains in favor of “locally synchronous, globally asynchronous” clocking.

“People think more about asynchronous clock domains nowadays than they did 10 years ago,” said Frank. “This directly impacts the architecture, because every transition over a clock boundary adds latency.”

With this approach, one optimizes for a particular clock domain only within a given radius. Beyond that, one can think of distant destinations as having their own timing. Signals would need to be synchronized for long runs, but it relieves the timing-closure challenge of maintaining complete synchrony across long distances and between large blocks.

SRAMs pose their own unique challenges. Performance isn’t changing at the rate of the rest of the chip. “The memories have not been shrinking as fast as the standard cells,” noted Gerstner. Cadence’s approach is to “flop,” or register, data going into and coming out of the memory. “On our newer architectures, we are now flopping both the inbound memory request and the outbound results,” said Gerstner.

Synopsys is taking that one step further. “In our in our next generation, the SRAM is going to be running in its own clock domain,” said Carlos Basto, principal engineer, ARC processor IP at Synopsys. “The speed of the SRAM will be completely decoupled from the speed of the rest of the core. The tradeoff there is an increased latency in accessing that memory.”

That means, of course, that appropriate clock-domain crossings must be provided to ensure reliable signaling. “The cycle before and after that SRAM access have to be designed extremely, extremely carefully,” said Basto.

Fig. 2: SRAM timing can be eased through pipelining (left) or decoupled from core timing by placing the SRAM in its own domain (right). Clock-domain crossing must be used where the signals move from one domain to the other. Source: Bryon Moyer/Semiconductor Engineering

In addition, the sizes of memory blocks are limited by foundries. “Memory compilers aren’t expanding the maximum size of an of a single macro cell,” said Gerstner. “As a result, we have to bank memories considerably more.”

Changing ISAs
The change in delay contributions can even impact instruction sets and associated software development tools. “Because data movement is such a limiter to performance, suddenly the guys making instruction sets, the guys making compilers, and the programmers themselves can no longer treat that hardware as abstract,” said Rambus’ Woo. “You actually have to understand the underlying structure.”

“The architecture and micro-architecture of processors and accelerators are adapting to ensure that pipelines can be fed efficiently,” said Francisco Socal, senior product manager, architecture and technology group at Arm. “At the architecture level, features such as GEMM [general matrix multiplication] allow more efficient use of memory by software, while micro-architectures continue to evolve techniques such as speculation, caching and buffering.”

Tachyum, a processor startup, is attempting to take advantage of this change with a new clean-sheet instruction set architecture (ISA). The company illustrates its approach with a discussion of what it takes to achieve a 5GHz clock — a 200 picosecond clock period (for convenient math, but not unrealistic, according to Tachyum). The question is, what can be done in 200ps? Anything that can’t be completed in that timeframe either would need to be broken down into smaller chunks, through pipelining, or span more than one clock cycle. The ISA is one area where architects have the flexibility to go either way.

Tachyum’s assertion is that many currently prominent ISAs were developed back when transistor delays dominated. As those delays have shrunk, the amount of time it takes the arithmetic logic units (ALUs) to do their work has come down. Logic delays would have used a majority of that 200-ps cycle in the past. But now logic may account for well less than 100 of those picoseconds. “Computation is less than half the time, and half the time is getting ALU data from other ALUs,” said Danilak.

An example of how delays have affected the ISA has to do with getting data to an ALU. Given multiple parallel ALUs, a given operation at one of those ALUs may take its input from one of three sources — a register, the ALU itself (with the result of its prior operation), or a different ALU. Tachyum said that the first two can be done within 100ps. If the data comes from a different ALU, however, it needs more than that 100ps.

The company’s solution is to split the instruction set. Single-cycle instructions are used where the data source permits. Two-cycle instructions are used otherwise. The compiler makes the decision, because in most cases the compiler knows where the ALU inputs will reside.

Fig. 3: Tachyum’s ISA has 1- and 2-cycle instructions, depending on the source of the data. Data from registers and the same ALU can arrive in 1 cycle. Data from other ALUs needs 2 cycles to arrive. The compiler selects the appropriate version of the instruction. Source: Bryon Moyer/Semiconductor Engineering

It’s possible, however, that, with dynamic libraries, the location of the data won’t be known at compile time. In this case, the compiler needs to work optimistically, assuming the data will be close by. But Tachyum has added a backstop in case that assumption is wrong. “We have hardware that detects when we try to consume data too early, and it will stall the machine,” said Danilak. This provides for two versions of some of these instructions — the one-cycle version and the two-cycle version. But it helps only if the one-cycle version is needed frequently enough to make a difference. Tachyum claims 93% of instructions use the faster version.

Playing with the cycle count also can be a strategy for processor architectures like Cadence’s Tensilica, which allows custom instructions for an application. They provide flexibility when defining the number of clock cycles that a given custom instruction will consume. “The native instructions have a fixed cycle count,” said Gerstner. “Any additional custom instructions will get a cycle count per design.”

ISA changes have huge implications, and companies that have to support legacy code may not have the freedom to redo their ISA. In the case of custom instructions in a Tensilica core, these are usually specific to an embedded application. Those cores are not likely to be executing a wide range of programs created by others, making legacy less of a concern.

The challenge with any architectural approaches is that they must be considered very early on in the planning. The benefit, however, is that they may reduce the burden on implementation, ultimately providing both faster time-to-market and faster performance. We’re likely to see a continued focus on architecture as a way to adapt to changing delay dynamics.

Interconnects Emerge As Key Concern For Performance
Complexity, abundant options, and limits on tooling make this an increasingly challenging area.
Interconnect Challenges Grow, Tools Lag
More data, smaller devices are hitting the limits of current technology. The fix may be expensive.
Big Changes In Tiny Interconnects
Below 7nm, get ready for new materials, new structures, and very different properties.

Leave a Reply

(Note: This name will be displayed publicly)