Integration has been one of the natural paths for the electronics industry, but in some cases it may no longer be the right way.
In the Electronics Butterfly Effect story, the observation was made that the electronics industry has gone non-linear, no longer supported by incremental density and cost-reducing improvements that Moore’s Law promised with each new node. Those incremental changes, over several decades, have meant that design and architecture have followed a predictable path with very few new ideas coming in along the way. Even though significant perturbations are looming on the horizon, inertia has built up within the system and new architectures are likely to be resisted.
But with the economics of following Moore’s Law looking risky for all but a few applications, design and semiconductor houses need new ways to differentiate their products and they can no longer count on more transistors, higher performance or lower cost just from going to the next node. The industry has to take architecture a lot more seriously and to look for new ways to tackle existing problems.
That article concentrated on some of the impacts that new memories and 3D stacking could have. It postulated that it could lead to a breaking down of the memory hierarchy and that simplification would result from that. But that is only one possibility of what could happen, and several people within the industry see the possibility of the opposite happening.
Pulling more memory into the package has been the path that the industry has followed since the invention of the semiconductor Integrated Circuit back in 1959. The primary seasons for this are cost, because integrating more in the package reduces the system integration costs and increases performance. Running interconnect off-chip significantly adds to the parasitics, which slows down communications and consumes power. “Most of the power consumption today is in the interconnect but if you make them in close proximity, the power will go down,” says , president and CEO of MonolithIC 3D. “The human mind does not need a lot of power because it is 3D and memory and logic are tied tightly.”
Drew Wingard, chief technology officer at Sonics, is one of the skeptics when it comes to new memories causing significant architectural change. “The fundamental issue that has plagued computer systems for the last 30 years is that you can never get enough memory close enough to the processor to keep up with the operations inside the processor. The communications delay to get to the memory cells never scales in the way you want it, and so you may be able to reduce it somewhat but memory hierarchy does not go away.”
The desire to bring the memory into the package is because of the 1000:1 disparity between memory access speeds and the processor speeds. This has been complicated by the migration from single processor to multi-processor, a trend that started when it was no longer possible to effectively produce processors with higher performance even given the advancements of Moore’s Law.
Because of the desire to make it as easy as possible to migrate existing software, a symmetric multi-processing architecture was adopted and this required complex cache coherency mechanisms to get created. “We long ago went past the point of where hardware was dominant in an SoC, yet companies still spend millions to license a core, and then optimize the software for that platform,” says Neil Hand, vice president of marketing and business development at Codasip. “All of the effort in the industry is to make this process more efficient, but it is still an incorrect premise. What is needed is a totally different approach and tools where hardware is created to support the needs of software.”
What if different architectural decisions were made? What if instead of clustering processing power and integrating more into a single die, it was distributed? What if compute power was made into a hierarchy rather than memory? “I don’t think we have found, except for memory, effective ways to throwing transistors at problems to just build gigantic blocks,” says Wingard. “We tend to find the best energy result when we partition into smaller chunks, and they can execute much more quickly with more local processing and memory with more tight coupling. This leads to the problem where I have a lot of little things and a challenge that, as human beings, we are not good at reasoning about levels of complexity.”
Some aspects of this distributed architecture have always been in place, using dedicated accelerators for specific tasks where a different architecture can perform the task faster or more efficiently than a general-purpose processor. Examples have been accelerators, GPUs, DSPs and more recently neural nets for performing vision processing. This is the acceptance that the general-purpose processor is the least efficient way of performing almost any task. It just happens to be the most flexible, and thus allows for decisions about functionality to be delayed until later, or even defined in the field.
We also have seen processor specializing happening more recently as a power reduction technique. Two processors of differing levels of capabilities are closely coupled together so that the right amount of compute power can be assigned to the job. An example of this is the ARM big.LITTLE architecture.
But that is just one pre-packaged solution. “We already recognize a huge energy efficiency and latency range from small, specialized embedded hard-real-time processing subsystems compared to general-purpose CPU clusters running big operating systems,” points out , fellow at Cadence. “This is approaching 100x difference for the same compute. By building a hierarchy of processing subsystems, each layer providing increasing capacity, generality and intelligence, we can create systems with energy efficiency driven by the lowest ‘always alert’ layer, and with generality driven by the highest ‘scalable application processing’ layer.”
Rowen believes that we may see one layer for each order-of-magnitude improvement between lowest-level sensor processing consuming at the micro-watt level, up to cloud processing at the watt level with about six layers of distinct processors and subsystem layers.
When we add non-algorithmic platforms and programming into the mix, the reasons for segregation become even more pronounced. “The proliferation of exciting new brain-inspired pattern recognition methods like Convolutional Neural Networks (CNNs) is a harbinger for a whole category of platform architectures and programming models that look nothing like today’s sequential processors and programming languages,” says Rowen. “As applied in vision and speech CNNs, the system developer does not supply an algorithmic description but simply a set of input data and associated expected output, and the tools perform the training of the network to implement a recognition subsystem. These techniques are surprisingly versatile, giving system architects a whole new category of building blocks for implementing fast, efficient, scalable and powerful new cognitive capabilities.”
“There are good reasons why the processing elements will not all be the same, because the types of processing are different,” says Wingard. “I/O processing is different from video processing and is different from speech processing or image recognition. We will have hierarchies. Hierarchy is the way we manage complexity.”
Kevin Cameron, principal at Cameron EDA, talked about a new approach that was highlighted at the 2015 Flash Memory Summit. “Rather than bringing memory contents to the processor, Vladimir Alves, CTO of NxGn Data, suggests that you take the processor to the memory and embed it in-situ. This distributes the computation task and can result in some significant gains for cloud-based applications such as Hadoop and MapReduce.”
Additional analysis can also be performed on the application to find ways to distribute the task. “Can I avoid performance issues by going to a processor and memory architecture where you allocate stuff into memory in a way that it has locality and for that chunk of memory you will have a processor associated with it?” asks Cameron. “If you now have an event that needs to get propagated, you send it to that memory. Now you have to think about how to abstract the communications and you’re probably doing that with a hardware communication channel rather than with shared memory.”
Many of these concepts can be taken even further, something that the may force. A hierarchy of processing is required rather than a centralized architecture with each level balancing the cost of compute with the cost of communications.
Unfortunately, when looking at architectural optimization tools, there are few on offer. Virtual prototyping does enable a model of the system to be constructed, but the industry lacks system specification languages, system-level verification tools, tools that allow multiple architectures to be compared, analysis tools that utilize statistical techniques, and many other capabilities that would target high-level decision making.
The problem has always been that there are too few system architects to make the creation of tools worthwhile, and virtual prototypes have only made the cut because they can support early software development and debug. This means that architects have to work in the dark and draw upon past experiences, but that may just lead to more inertia.
Leave a Reply