More factors enter into decisions about how to optimize designs for specific applications.
Rightsizing chip architectures is getting much more complicated. There are more options to choose from, more potential bottlenecks, and many more choices about what process to use at what process node and for which markets and price points.
Rightsizing is a way of targeting chips to specific application needs, supplying sufficient performance while minimizing power and cost. It has been a topic of conversation across the semiconductor industry for years because as power becomes a bigger issue, improving the efficiency of designs by limiting the compute resources is a big opportunity. ARM‘s push into the server arena and its rollout of the big.LITTLE architecture is a case in point. So is Intel‘s rollout of Quark and Atom chips to supplement its server and PC chips.
Compute cycles are cheaper than in the past, but they still have a measurable cost. In the mobile market, this cost shows up in battery life. In data centers, it is reflected in the utility costs of powering and cooling server racks. This also accounts for why new memory types, such as , ReRAM and 3D XPoint, are under development. All of those are an attempt to deal with similar issues from the memory side.
But rightsizing has become much more difficult than just pairing a processor’s frequency or size to a specific application or changing the memory size or type. Increasingly, it can involve decisions about:
• Where the processing is done, whether it’s on-chip, at the edge of the network, or in the cloud;
• What materials are being used for the substrate and insulation;
• Which process node works best for a particular application, and
• How various components, or all of them, are packaged.
“What’s changed is that the processor speed has been progressing along with Moore’s Law for several decades while other things haven’t kept up,” said Steven Woo, vice president of solutions marketing at Rambus. “The bottleneck is access speed to memory and the network.”
There are a variety of new developments creeping into this equation. Historically, increasing the size and amount of memory has improved performance. That isn’t necessarily true anymore.
“If you use smaller memory chips, you can run that memory faster than if you’re adding an enormous memory system,” said Woo. “In the past, the memory hierarchy was on-chip cache, off-chip cache in the same package, discrete DRAM and solid state drives or other storage. We’re likely to see more levels added, which could include high-bandwidth memory or the Hybrid Memory Cube. So rather than just adding larger capacity memory, the metric may be less about power per bit or power/performance per dollar.”
The rule of thumb used to be as chipmakers ascended the memory hierarchy, latency would drop by a factor of 10 and bandwidth would increase by the same number, with the cost per bit rising at each level. But there are so many kinds of memories being added into the mix that those kinds of tradeoffs are becoming harder to quantify.
FinFETs complicate this equation, as well, because the line between power and performance is much more dramatic. Rather than a gentle slope, it looks more like a cliff.
“When you look at delay versus a supply voltage, reducing supply voltage is the key to reducing power,” said Tobias Bjerregaard, CEO of Teklatech. “But what happens typically at non-finFET technologies is that when you reduce the supply voltage, at some point you start getting a much lower performance and the curve slopes faster upwards. At some point, you just can’t get the thing to work. FinFET is not sloping up in terms of delay as fast. It’s quite flat almost all the way down to the point where it just breaks, and just can’t work anymore. This is a good thing because it allows you to reduce the supply voltage quite far, but it also means that if you violate your power integrity margin, then it doesn’t work at all. It doesn’t just get slower. It doesn’t work.”
That makes the choice about what kind of transistor and where to use it much more difficult. Does an architect choose 28nm or 40nm, or a 16/14nm finFET, particularly if there are going to be derivative chips for multiple markets? Those decisions become even more complicated if the roadmap extends beyond just 16/14nm, too. The next iterations of transistors involve lateral or vertical nanowire FETs, where the impact of these kinds of decisions is untested.
“People want to utilize that potential to reduce the supply voltage, but the drawback is that you have a much harder power integrity wall that you hit,” said Bjerregaard. “This is where having control over your power integrity becomes key to actually utilizing that potential for scaling down voltages. When rightsizing memories, of course, large memories are more efficient in terms of area and so on, but they are also slower and we see that a lot of designs have their critical path around the memories. People want to make them smaller and faster, but on the other hand they are busting area and we are back to the profitability bar. It’s all connected—timing, area, profitability, power. We are working down to 10nm where these issues are becoming even more critical than they were in previous technologies.”
Leaky pipes
A big driver for rightsizing in the past has been static leakage. Between 180nm and 20nm, leakage current steadily increased to the point where “on” and “off” became relative rather than absolute terms. It improved again at 16/14nm with finFETs due to better gate control, but the downward spiral will begin all over again at 10nm and 7nm, this time coupled with increasing dynamic power density.
That puts a new spin on rightsizing, because if leakage current cannot be brought under control, then the advantages of moving to the next node (or at least moving everything to the next node) are diminished.
“Leakage is a constant, which provides a good reason to run fast and shut off a circuit,” said Drew Wingard, CTO of Sonics. “Basically what you’re doing is rightsizing at the level of the application. If you look at video encoding, you build a decoder for the highest-resolution video. But most video shipped today is not at the highest resolution. So you can run it at full HD streaming and use a quarter of the processing capability, or you can cut the clock by three-quarters and get it done in the same time.”
It gets more complex from there, Wingard noted. The processor clock speed has to be in sync with the wires. Thinner wires increase capacitance. The longer they are, the greater the problems. In addition, the memory subsystem has to match the processor.
“If you run the DRAM subsystem fast and then cut it off, that only works if you run the other things fast,” he said. “It becomes a holistic challenge. You can power gate using software. But you also can put a lot of smarts in the hardware if the driver is not running very often so that the driver does not do as much. You shut off the processor running that driver.”
Thinking bigger
That holistic view is beginning to show up in a number of areas. While a number of factors can be handled locally, power is a global issue for chips and the systems that contain those chips. Chipmakers are well aware of that, which is why they using more tools simultaneously than ever before, particularly for such things as modeling. Whether that is an integrated package from one vendor, or whether the chipmaker has pulled together multiple vendors’ tools, varies from one chipmaker to the next. But what is clear is that no one tool does everything, but more tools are needed to get a full picture of how to create a more efficient design.
Gordon Allan, Questa product manager at Mentor Graphics, said the key areas where there has been a lot of activity lately involve performance/throughput, latency and parallelism. “Designers are also using the communications fabric on chip to scale to more cores. We’re seeing a lot more system-level analysis and verification of cache coherency. We’re seeing this in the demand for verification IP for ARM AMBA 5 and related technologies. There is a lot more standardization going on around fabrics, architectures, topologies and memories.”
There is good reason for that. Data is exploding at every facet of design, and rightsizing has to make sense of all of this data. “We’re seeing more bugs in corner cases that were not anticipated,” Allan noted. “That’s also why we’re seeing more on-chip/off-chip equations. As packaging options become more affordable with multiple die, it will pave the way for more compute architectures. There is an interesting abstraction going on. Abstraction is the enemy of rightsizing because you’re dealing with software abstractions, hardware abstractions, and rationalization of storage and memory. The next round of innovation in tools will be focused on rightsizing.”
Aart de Geus, Synopsys chairman and co-CEO, alluded to this as well in his keynote speech at the Synopsys User Group last month, but from a different angle. He said that if the industry can achieve 100X improvement in performance and power, it will have huge implications for everything from wearable electronics to assisted and autonomous driving and industrial automation. Improvements of that scale require rightsizing and more informed decisions about what to use where and how to put it all together.
The software piece
Software and IP are the glue behind all of this. But if software can be more closely linked to the hardware, then fewer processor cycles are required in the first place. This is the thinking behind Linaro, an independent organization working on improving the performance and efficiency of open-source software running on the ARM instruction set.
“More companies are taking a holistic view of software these days,” said Tom Beckley, senior vice president of the custom IC and PCB group at Cadence. “The system architecture role is becoming more important. They want to craft system-level models and be able to bring them back up. That’s where we are today. If you run a power test, what’s in the software, what’s in the hardware?”
Aveek Sarkar, vice president of product engineering and support at Ansys, agrees. “What you’re really doing here is optimizing the design, and to do that you have to have confidence that you are looking at all scenarios. Then you can do the optimization. So if there is a voltage drop, you need to be able to profile that across multiple different profiles. The pattern will drive where you can optimize. If you’re putting a power grid across entire designs, in most cases that is overkill.”
Related Stories
New Memory Approaches And Issues
Memory Choices Grow
Leave a Reply