What will it take to achieve mass customization at the edge, with high performance and low power.
Academia has been looking at specialization for many years, but solutions were rejected because general-purpose solutions were advancing fast enough to keep up with most application requirements. That is no longer the case. The introduction and support of the RISC-V processor architecture has attracted a lot of attention, but whether that is the right direction for the majority of modern computation may change over time.
The architectural question of whether to specialize or generalize has had different answers over time, which led to the 1990s observation called Makimoto’s Wave. Tsugio Makimoto was the CTO for Sony at the time.
Fig. 1: Makimoto’s Wave. Source: Semiconductor Engineering
One of the hidden gems of DAC each year is the IEEE CEDA distinguished speaker luncheon. This year’s talk was entitled, “Computing Systems at the Edge: Specialize or Generalize,” given by Tulika Mitra, vice-provost (academic affairs) and provost’s chair professor of computer science at the National University of Singapore.
What follows is a condensed version of Mitra’s talk that concentrated on the question of specialization versus generalization for edge IoT devices.
“I am here to take you through the debate between specialization and generalization,” said Mitra.” The first part of her talk provided a brief history of the semiconductor industry following the early stages of Makimoto’s wave starting from the LSI era. “They were boasting about integrating 22 components onto a single IC. The first integrated circuits were very specialized. They were application-specific.”
That changed when Intel introduced the 4004 processor in 1971. “By today’s standards, this was a very simple design with only 2,300 transistors. But it is the first general-purpose programmable microprocessor that used the Von Neumann architecture and the stored instruction concepts. In that sense, it was revolutionary, because it allows you to have a circuit that is re-used to execute different algorithms. With this, the industry moved into the domain of generalization where we have a fixed instruction set architecture that determines the contract between the hardware designers and the software developers. It provides a very high degree of programmability and you have support for multiple applications.”
The problem with these architectures is they spend a lot of energy extracting parallelism and not performing useful computation. “By 2005 we had an end to frequency scaling because of the breakdown of Dennard scaling. Single-core optimization was no longer possible, and the industry moved to multi-core and many-core architectures. The problem is that software design has not kept up with this kind of parallelism.”
Even these architectures have a limit. “Amdahl’s Law says that even if you can parallelize 99% of your application, that leads to a maximum speedup of 100, no matter how many processors you are using. Today, we are also hitting the end of the road with Moore’s Law, so we are forced back into the specialized domain.”
Her talk looked at some of the accelerators being introduced, including the Google Tensor Processing Unit and the Darwin genomics processor. “If you look at the GPU, it is a massive parallel architecture. These things are actually designed for a particular domain. And because they have massive parallelism, you can get very high performance. But because you’re designing it for a particular domain, the hardware can be very simple and it can be low power. But at the same time it is not easily programmable.”
Mitra then introduced some of the specific requirements for edge computing, including the need to do more processing close to the data rather than having to transmit data to the cloud. “This is both inefficient and depends upon a reliable connection, which cannot always be guaranteed.”
A finger movement recognition example was introduced, and a baseline set by determining the power and performance of doing this using an Arm Cortex M3 processor. A further datapoint was established using a mobile phone. “While the M3 provided a low-power solution, it was not fast enough, and the mobile phone used too much power for an edge application.”
A common solution deployed in the industry is to create custom accelerators that are tuned for particular tasks. “There are many accelerators, and there has been an explosion in terms of edge AI accelerators. They can get very high efficiency, and can attain 3 to 4 TOPS per watt. There are other accelerators. For example, MIT has one for navigation. But how can you combine them together?”
Again, she looked at existing commercial systems pointing out the number of accelerators and the size of the processors. “If I am looking at tiny IoT devices for the edge, having several domain-specific accelerators plus multiple general-purpose cores is not tenable.”
An era of specialization
There are some attributes for tasks that make them suitable for specialization. “The application has to have parallelism. You need computation that is performed with a regular interval. And the computation has to have memory locality. But there are drawbacks to specialization. Given the design cost, you better have enough volume for you to amortize costs. It’s not possible for all kinds of applications. And while specialization may help some applications, it can make other applications suffer.”
There is a tradeoff between specialization and generalization. “On the X axis (see figure 2), you are seeing the speed up from specialization, compared to a general-purpose processor. On the Y axis you have the production volume. It shows the number of chips you must produce in order to benefit from specialization. The light blue area is where most of us were targeting. You are getting lot of performance improvement from the general-purpose processors. That’s why you did not see specialization for a while because you needed to have high speed-up from the specialization or very high production volume. But now, we are in the post Moore Era, where we are not getting that much benefit from the general-purpose microprocessor. There are more applications that are going to benefit from specialization. But even then, you have a portion which is not at all benefiting from a specialization.”
Fig. 2: The economics of specialization. Source: Tulika Mitra/DAC
One approach is to use incremental specialization. “You start with a general-purpose processor and add a little bit of a specific function through custom instructions. That takes a recurring operation and turns it into a custom instruction. This was a hot topic 20 years ago and people were looking at how you take an application and find all the specializations. Today, with RISC-V processors, we have custom instruction set extensions where you can add these instructions. You can create a library and you can have these micro-accelerators.”
But can you do these processor customizations for many-core and multiple-core architectures? “The idea was to make the processor as simple as possible, and then add a custom instruction to each of these processors. We also tried to fragment the custom instructions and spread them across multiple processors which were then stitched together. We were successful in the sense that we could meet the real-time requirements of the kinds of applications that we were looking at. We could now do finger gesture recognition. But at 139mW, this is not exactly what we were hoping for. One of the reasons is because the processors still take away a lot of power.”
Fig 3. Many-core processor customization. Source: Tulika Mitra/DAC
The universal accelerator
Just like with the von Neumann machine, Mitra was looking for an accelerator that could run any algorithm. If that were possible, the NRE cost could be amortized over multiple applications and still manage to meet the performance and power goals. “You have to keep the hardware as simple as possible otherwise it will take a lot of power. So you push the complexity to the software. That means software needs to be aware of what is going on in the hardware. It’s not enough for you to just expose the interface of the instruction set architecture. You need to expose the underlying architecture to the software.”
One way to do that is to move from a computing model, which is inherently sequential, to a data flow computing model that offers asynchronous parallelism. “In this model you expose the entire computation in the form of a dataflow graph, and that exposes the dependencies between the operations and exposes the parallelism. You map this entire dataflow graph onto hardware, and let it execute. You don’t need to extract the parallelism.”
The first dataflow processor was described in 1975. It was not widely adopted because general-purpose processors were doing well. There also were problems with the software stack. “Today, AI accelerators follow this data flow computing model. However, the hardware is instantiated for a specific data flow. We wanted to see if different data flows could be supported on the same hardware, and that’s the idea for software-defined hardware. You want the software to instantiate different data flows corresponding to different applications on the same hardware. Then you get a universal accelerator.”
One way to do this is to use coarse-grained reconfigurable arrays (CGRAs). Examples of this first surfaced around 2000. Today, a number of companies have a CGRA in their portfolio, including Samsung, Intel and Renesas.
A CGRA is a 2D array of processing elements, where each of these processing elements has a very simple ALU, and connects north, south, east, and west to four neighbors. “What is interesting about this architecture is that you can change the operation that a particular PE is executing every cycle. And you can reconfigure the network to do different routings. You have powerful cycle-by-cycle reconfigurability. The software sends this configuration data, which is stored in the memory, but because the configuration memory is limited, it does mean that you can only map things where you have recurrent computation.”
Fig. 4: CGRA – Efficiency with programmability. Source: Tulika Mitra/DAC
“TPU is specialized to dataflows with matrix multiplications. But with a CGRA, you can give it any data flow and the software will instantiate the same thing on that architecture. The advantage of doing that is it can do speech recognition, it can do FFT, it can do matrix multiplication, it can do convolution. All these data flows can be mapped onto the CGRA and we get very good efficiency and performance.”
The challenge is to do dataflow synthesis onto the CGRA. “This is very similar to place-and-route. We want to take the dataflow graph and to map it onto the PE. So you do placement, then you route the dependencies, which are the connections from one PE to another. What is interesting here is this is not just a spatial mapping, but a spatial temporal mapping. With spatial temporal mapping, one PE is executing this operation in one cycle, and it is executing a different operation in the next cycle. Similarly, all the routings are different. This gives it a lot of power, and is not limited to one iteration at a time. You can actually juxtapose multiple operations in parallel using software pipelining techniques.”
Mitra discussed the similarities and differences between the CGRA and FPGA. She pointed to two major differences. The first involved an abstraction where the FPGA is at the bit level, whereas a CGRA is at an operations level. The second is the temporal reconfigurability of the CGRA.
The National University of Singapore produced its first CGRA in 2019 called HyCUBE. It clocked impressive performance characteristics compared to processors, FPGAs, and the Samsung Reconfigurable Processor. The university has since released a much larger CGRA that also adds a RISC-V based controller and some other architectural advances. “It has the highest efficiency among CGRAs today – 582 GOPS per watt at 0.45V, and we are using 40nm ultra-low power process. Most of our competitors are using a 22 nanometer process, and we estimate performance would be 1 TOPS/W in that technology.”
The complexity is in the software. “CGRAs we’re not successful in the past because we did not have the software toolchain. Spatial accelerators use place-and-route heuristics. These routing heuristics are dependent on the accelerator you are using, and so you need to hand-craft the compiler. That takes a lot of time, and place-and-route makes very local decisions. They might not be optimal for the whole system. So we wondered if machine learning techniques could automate the compiler generator for a new CGRA.”
This led to the creation of LISA, an automated compiler generator. “We created a lot of synthetic data flow graphs (DFG) and used that to create a training set of data. Once you have a new CGRA, we build a graph neural network model that tries to estimate certain attributes from this graph, such as the scheduling order, and the spatial and temporal distance between two DFG nodes. Then, when you have a new data flow graph, you generate labels that inform our simulated annealing-based mapping approach. That gives us a very portable compiler with very high-quality mapping in record time on diverse set of accelerators.”
Initial results had long compile times, so several techniques were used to speed this up. This included abstraction, common reuse patterns, and clustering. The whole thing has been packed up as an open-source CGRA toolchain, shown in Figure 5.
Fig. 5: NUS Morpher open-source CGRA toolchain. Source: Tulika Mitra/DAC
“We have this abstract architecture model, so you can model any architecture and you can have any application. It will map that application onto your architecture. It will also allow you to do an FPGA emulation. And more importantly, we actually have a simulation and verification flow, which is missing from state-of-the-art CGRA tools.”
Challenges remain, however. “A significant portion of embedded applications include control divergence inside the loop body. The CGRAs are not good at that. Plus, CGRAs are very highly parallel architectures, and you need to bring a lot of data into the system. We have made some progress, but one can do a whole lot more. We are trying to investigate that in the context of a large project, where we are trying to generate next-generation software starting from all the way from application to circuit.”
Related Reading
Programming Processors In Heterogeneous Architectures
Optimizing PPA for different processor types requires very different approaches, and all of them are now included in the same design.
Processor Tradeoffs For AI Workloads
Gaps are widening between technology advances and demands, and closing them is becoming more difficult.
Thanks alot for posting this. As someone who is working on FPGAs, the idea of CGRA is fascinating.
The way her talk and your blog presented the information is quite comprehensible.