How Many Cores? (Part 2)

Part 2: Fan-outs and 2.5D will change how cores perform and how they are used; hybrid architectures evolve.


New chip architectures and new packaging options—including fan-outs and 2.5D—are changing basic design considerations for how many cores are needed, what they are used for, and how to solve some increasingly troublesome bottlenecks.

As reported in part one, just adding more cores doesn’t necessarily improve performance, and adding the wrong size or kinds of cores wastes power. That has set the stage for a couple of broad shifts in the semiconductor industry. On the planar, shrink-everything front, the focus has moved from architectures to microarchitectures and software. On the 2.5D and packaging front, the commercial rollout of high-bandwidth memory and the Hybrid Memory Cube have substantially reduced performance bottlenecks, but there are lingering questions about just how mature the design tools are and how quickly the cost of these designs will drop.

Both of these shifts will have significant impacts on the types and numbers of cores used in designs, as well as where the big challenges will be in the future. In addition, they make it imperative for system architects to drill drown much further into what cores work best where, how they will be used, where they will be located, and whether there are alternatives that may improve price/performance/area. That requires much more work on the front end of the design because some cores can be resized or replaced with different kinds of cores to meet different throughput numbers, and potentially more verification on the back end because these new approaches can introduce unexpected corner cases.

“We’re seeing different traffic patterns,” said Sundari Mitra, CEO and co-founder of NetSpeed Systems. “The sub-optimal way is to size everything to meet all of the requirements, or to size everything at the same time so that all peak requirements are met. You need to do latency and bandwidth analysis and make this more heterogeneous so you can add in ‘what if’ analysis.”

This is roughly the equivalent of replacing a single bell curve with an aggregated model of bell curves to get a more detailed and accurate picture of how a device will be used. It represents a combination of the physical attributes of various cores and compute elements, the software that runs on them and connects them, the memories, where they are placed, and how all of these elements are packaged together.

“The issue for multi-core systems doesn’t lay in the hardware, which can be scaled depending on the needs,” said Zibi Zalewski, general manager of Aldec‘s Hardware Division. “The problem is on the software side—how to use such multi-dimensional processing power efficiently in the application. This was one of the reasons engineering teams started looking at using FPGAs to accelerate algorithms instead of multi-core processors. FPGAs resolved the speed problem while compilers and C like languages for software developers became available. The recent merger of Intel and Altera only confirms that process going forward. A joint architecture of a traditional processor and FPGA provides a new solution for the never-ending speed race.”

FPGA vendors were among the first to embrace 2.5D packaging, although not for performance reasons. Xilinx and Altera (now part of Intel) adopted interposers in quad-core chips as a way of improving yield, because smaller chips yield better than a single large chip. Interposers simply eliminated the performance overhead of breaking apart one chip into four pieces.

In the ASIC world, advanced packaging has put the focus on higher clock speeds because dynamic power, leakage current, and thermal effects can be more easily isolated in a package than on a die. Fan-outs and 2.5D provide much higher-speed interconnects than a planar configuration, which has a direct impact on resistance and capacitance.

“The more cores you have, the more access to memory will be required,” said Bill Isaacson, director of ASIC marketing at eSilicon. “That puts a lot of stress on an ASIC and the routing capability of that ASIC, particularly how you partition a design.”

The cost equation
One of the reasons that FPGA vendors could appreciate the cost benefits so quickly was that the FPGA is the system. The business structure for most SoC vendors, particularly fabless companies, is entirely different. It rewards efficiencies at every level of the design, which often means the individual components or blocks. That has stymied the adoption of FD-SOI for years, and it has slowed the adoption of fan-outs and 2.5D until it became so painful to move to the next process node that many companies began looking for alternatives.

In the case of 2.5D, the real breakthrough was HBM-2, which is now offered commercially by both Samsung and SK Hynix. Each company has said that economies of scale will follow the same path as for other memories. For that and performance/power reasons, 2.5D has become popular outside of the FPGA market, first in the networking equipment and servers because of the performance benefits of interposers, and more recently with a growing number of designs reportedly under development across a broad spectrum of markets.

As 2.5D goes mainstream, many of the early concerns about new packaging approaches are coming into focus. One involves known good die. While that remains an issue for 3D-IC, it has proved to be far less of an issue for 2.5D and fan-outs. Handling of die and even thinning of those die has been perfected over the past several years by foundries and OSATs. The bigger issue is the ability to identify and fix potential problems before those die are packaged together. eSilicon’s Isaacson said testing methodologies are now in place for 2.5D, and while the industry still has not hit high-volume production on this kind of packaging outside of networking and server chips, there is a recognition that debug needs to be part of the packaging.

Design teams need to understand how to work with physical effects in those architectures, as well. “The number one problem is thermal,” said Arvind Shanmugavel, senior director of applications engineering in Ansys‘ Apache business unit. “The challenge is how to simulate true thermal behavior and how to model the electrical behavior of the interface in 2.5D. You have to model and simulate the entire system in one package.”

But what needs to be modeled also is beginning to change. With power, performance and cost all now in flux inside a package, and software being much more tightly integrated, the tradeoffs are changing significantly.

Coherency and other memory considerations
In the past, almost every discussion about multiprocessing or multiple cores included coherency. Memory had to be updated to keep all of the cores or processing elements in sync. This is no longer a straightforward discussion. As designs become more complex and include more cores and more memory types there are three options:

• No coherency. Cores can be run asymmetrically and independently.
• Limited coherency. Some cores are coherent while others are not.
• System-level coherency. The coherency discussion shifts from the CPU memory to the overall system using a variety of compute elements. This is particularly important in high-performance computing, which use a collection of processors ranging from CPUs and GPUs to FPGAs.

What is not clear is how the boundaries between these worlds will shift in the future as new packaging approaches begin to roll out.

“If you look at the heterogeneous compute paradigm, the software guys demand that all the addresses are consistent,” said Drew Wingard, CTO of Sonics. “Everything needs to be cache coherent. You can migrate parts of a program, but they still have to be cache coherent with each other. At the same time, the energy cost for global cache coherency is too high. It may take multiple fetches to retrieve data. And the accelerators do not work well if there are not caches. That leads to a memory bottleneck.”

How HBM-2 plays into this picture isn’t completely clear yet. “There’s still a challenge in leveraging that bandwidth,” Wingard said. “HBM needs to spread traffic evenly. That’s why people are looking at things like multiple memories in parallel.”

At the same time, these packaging approaches open new options for using cores and memories differently. For example, a core that can run at 2GHz using a highly congested memory architecture may be able to run at twice that speed in a package connected to memory using microbumps, with encryption/decryption running in the background on a separate asymmetric core at a fraction of the power.

But the changes aren’t limited to 2.5D or fan-outs. Cores connected to common memory architectures, regardless of the packaging, also can add some resiliency into systems. That represents a completely different way of using cores connected together as part of a system rather than just a fail-over for redundancy.

“In the future you’re going to see more cores available for software, but not the same cores,” said Kurt Shuler, vice president of marketing at Arteris. “They may communicate with each other, but they generally only have to control part of the system. So in the human brain you’ve got sight, hearing and smell. If something happens to the part with smell, other parts of the brain take over. They don’t do that as well, but it still works.”

Memory architectures can play an important role here, as well. Most of the current approaches use on-chip SRAM and off-chip DRAM. But different packaging options, coupled with different memory architectures, can change the formula.

“You can certainly rebalance things,” said Steven Woo, vice president of solutions marketing at Rambus. “You can aggregate memory and make that available to a processor, which allows you to use less processors in a system. If you look at a data center, CPUs are 10% utilized. That means you can use 1/10th the number of CPUs and the same memory capacity.”

This same formula applies on an SoC or in a 2.5D package, as well. “The number of cores is growing faster than the natural growth rate of memory,” said Woo. “Cores are having trouble getting enough memory, and in some cases, cores are being starved out. We see this in multicore processing, where sometimes the cores sit idle because there is not enough memory capacity.”

Moving this up one level of abstraction, all of these pieces need to be viewed at the system level. At that level the big problem is communication, according to Maximilian Odendahl, CEO of Silexica Software Solutions, which develops multicore compilers. “The multicore compiler now needs to decide what to put where. It’s all about what do you want to communicate, not only what goes where. But there is no way to do this manually anymore. It’s too complicated.” He added the communications issues are the same, whether it’s 2.5D or planar.

It’s easier to view this from a macro level. But on the ground, the number of changes required to make this work smoothly is significant. Those changes affect everything from how chips are architected and assembled, where components are sourced, what process is used, how they are packaged, and even what are the necessary skill sets and interactions of engineering teams.

“The problem is all those elements need to cooperate with each other at the very high speeds,” said Aldec’s Zalewski. “That includes fast memory access for the operations accelerated in the FPGA, with the ability to exchange the data with a processor or even a software API. Memory and FPGA vendors are all working on the solutions. Fast dynamic memories, built-in FPGA memory blocks, integrated controllers optimized for cooperation in such an architecture, or even integrated chips with a processor and programmable logic—these are becoming very critical tasks. The speed of the processor core itself is no longer a mainstream subject. R&D teams are working on the high-performance infrastructure and FPGA-accelerated projects for the new computing architectures, with market challenges coming from big data processing server farms, automotive driver assistance systems or Internet security.”

So how many cores will be required? There is no simple answer to that question, and the answer will likely get increasingly complicated over the next few years as the number of options and tradeoffs continue to increase and evolve.

Related Stories
Heterogeneous Multi-Core Headaches
More Uses For Hypervisors
Rethinking Memory

  • Patrick Wise