中文 English

Making Sense Of New Edge-Inference Architectures

How to navigate a flood of confusing choices and terminology.


New edge-inference machine-learning architectures have been arriving at an astounding rate over the last year. Making sense of them all is a challenge.

To begin with, not all ML architectures are alike. One of the complicating factors in understanding the different machine-learning architectures is the nomenclature used to describe them. You’ll see terms like “sea-of-MACs,” “systolic array,” “dataflow architecture,” “graph processor,” and “streaming graph processor.” Unless you’re a computing expert, it can be hard to tell what the implications of these distinctions are — or whether they’re even important to understand.

Sea-of-MACs architectures, such as those featuring GPUs, provide a massive multi-core solution to the one-core problem. But in the end, the processors are simply being fed code that represents the network. Simpler accelerators like the micro-NPUs used for the NXP i.MX 9 family are MAC accelerators for modest work. “We see the need for different levels of capability — not just a ‘big-TOP’ accelerator, but cost-efficient products with a smaller NPU,” said Amanda McGregor, head of product innovation for the applications processes for edge processing at NXP.

Such architectures may include dedicated hardware to promote inference efficiency, however. “It has hardware acceleration for compression, pruning, zero weight removals, and things like that,” said Gowri Chindalore, head of strategy for edge processing at NXP.

Graph processors, meanwhile, operate at least notionally at the graph level rather than the code level. In a static implementation, different portions of an array of processors are assigned to different layers of the graph. “The compiler is essentially trying to trade off how many cores I’ve got, but also how many layers I can fit together [into a tile group] without overflowing,” said David Hough, distinguished systems architect at Imagination Technologies.

Dynamic versions may be called streaming graph processors because the network itself streams through the architecture, affecting decisions as to what will happen when and where. For these architectures, the compiler assembles metadata about the network, and that metadata informs the dispatch decisions made in real-time. “We’re preserving the graph throughout our entire software tool chain all the way into hardware, and the hardware understands the dependencies and manages them dynamically,” said Rajesh Anantharaman, senior director of products at Blaize.

This keeps things working at a higher level – one that Blaize calls task parallelism. “The instruction pointer is not the mechanism for describing the next task to be completed,” said Val Cook, chief software architect at Blaize. “It’s actually a root node of a DAG [directed acyclic graph].”

Atlazo, in the meantime, also claims a streaming graph architecture, but one focused on voice for use in extremely small, power-stingy applications like true-wireless devices. “Voice is the UI for triggering voice commands, speaker recognition, or even scene awareness,” said Karim Arabi, CEO of Atlazo.

Proponents of dynamic architectures point to complex real-world situations as motivators. Rehan Hameed, CTO and co-founder of Deep Vision, described an in-store vision application. “The processor in the camera would be detecting people and then tracking them to the store,” he said. “If it finds the person, then it will be doing pose estimation to figure out if the person is actually picking something up from a shelf. And if it detects that, then it has another model for product identification. It’s all dynamically being decided.”

Whether static or dynamic, scheduling or placement are done in a manner that makes the output of one kernel immediately available to the next kernel, without the need to move those from one place in memory to another.

This characteristic places some limits on how much pre-compilation one can do. It is certainly possible to optimize a network, pruning connections and fusing layers or nodes as necessary. But those efforts still leave a graph in place for the processor. The notion of “graph lowering,” where the graph is further compiled, would destroy the graph character, and so would be of limited utility for a streaming graph processor.

More architectural styles
There are still other architectural variants. Syntiant, a company that has been in the edge business longer than many of the new players, relies on a completely different approach to the whole notion of taking a TensorFlow network and adapting it for the hardware platform. Instead of running a translator like TensorFlow Lite or compiling down to native code, it designed a processor that uses TensorFlow directly as its instruction set.

While Syntiant hasn’t invested in model-adaptation tools, it has focused more closely on the training process, since that’s where all of the application-specific tuning is done. “We’ve developed this full training pipeline that we can go from dirty data, clean the data, structure it, go through a training pipeline, and offer a turnkey solution,” said Kurt Busch, CEO of Syntiant.

David Garrett, vice president of hardware at Syntiant, explained the benefits of this approach. “TensorFlow releases a binary that executes on our device, and it already knows speed/area/power,” he said. “So as part of the training, you can train against a hyper-parameter search of energy efficiency and inference rate.”

Fig. 1: A block diagram for devices using Syntiant’s second-generation core. The processor executes TensorFlow instructions natively. Source: Syntiant

GrAI Matter, meanwhile, is treading a line between neuromorphic SNNs and the more common artificial neural network (ANN). “We’ve taken a hybrid approach. We do event-based processing,” said Mahesh Makhijani, vice president of business development at GrAI Matter. While SNNs work well for such an approach, they’re also harder to train – and suitable data sets for training are hard to come by.

GrAI is using a more traditional architecture, but rather than fully processing each video frame, it processes only those pixels that have changed from the prior frame. So the first frame is processed as a whole, while subsequent frames require less processing. “We’ve inherited the event-based architecture from spiking networks, but we have parallel processing like Hailo and Mythic and others,” continued Makhijani. “We don’t do in-memory compute as analog, but we do digital near-memory compute.”

The company doesn’t allocate one processor per node, so frames that need more processing may have higher latency if there is a need to time-share processors. For sparsely changing frames, latency and power are reduced.

Fig. 2: In the top image, every node executes for every frame. In the bottom image, only those pixels that have changed are processed, keeping many nodes inactive for frames with few changes. Source: GrAI Matter

Finally, there are whispers of photonics-based machine-learning approaches. (This will be the subject of future coverage).

FPGA-based processors
There is, at present, one unique dynamic AI processor, and it leverages an FPGA fabric. Flex Logix can implement network nodes in hardware, but that hardware can be reconfigured in real time. “We program the connections like an FPGA, and then we reconfigure them for every layer of the model,” said Geoff Tate, CEO of Flex Logix.

Fig. 3: High-level view of the Flex Logix architecture. Source: Flex Logix

This is unlike other dynamic processors, which allocate cores for executing software. This allocates hardware, precompiling the design for the FPGA fabric. Layers can be fused, for example, to reduce the number of iterations it takes to get through a network. “When we see layers with large intermediate activations, we see if we can combine them into a single fuse combined layer so the outputs of one layer go directly into the inputs of the next layer,” said Tate. “This makes a 50% to 75% improvement in performance compared to doing only one layer at a time for YOLO v3.”

Fig. 4: Two layers are shown fused together, with intervening activation logic. Source: Flex Logix

The key lies in being able to reconfigure quickly. Flex Logix claims to be able to do so in 4 µs, with a configuration buffer and a weight buffer that allow it to hide loading the new pattern and weights behind the current execution. While one set of nodes is being processed, the next one is being readied, making for a quick switch-over.

“When we finish processing the layer, we reconfigure in microseconds,” said Tate. “These layers take milliseconds to run, because YOLO v3, for example, is 300 billion MACs for a single megapixel image. And then we change the configuration into a different tensor operation to execute layers required for the next step. We bring the configuration and the weights in from DRAM while we’re executing the previous layer.”

Fig. 5: While Layer 0 is processed, Layer 1 can be loaded. Reconfiguration takes about 4 µs. Source: Flex Logix

Pure FPGAs take this general notion further, with less dedicated hardware in the chip. “Neuromorphic guys are using FPGAs today,” noted Nick Ni, director of product marketing, AI and software at Xilinx. The downside is that the implementation is less area- or cost-efficient, because all of the logic is programmable. The upside is that you can react to the latest models.

“With many AI-focused SoCs introduced in this space, the architecture is fixed and not adaptable at the fundamental hardware level,” said Ni. “This works great for the AI models for which they were designed a few years back, but they often run very poorly on modern networks. FPGAs can be reprogrammed with the most optimal domain-specific architecture without creating a new chip.”

Whole network vs. partial network
While dynamic architectures may handle a piece of the network at a time, static ones often attempt to house an entire model in a single chip. The clear benefit is there is no longer a need to move configurations and weights around in real time. The model is loaded, and then data flows through it from beginning to end.

There are several flavors of this arrangement as well. The simplest is the systolic array, so-called because the data enters at one point and flows through the architecture much the way blood flows through our veins. Depending on how this is set up, one could initiate one set of data and, after it starts flowing, initiate a new set of data right behind it. This is possible only because the hardware isn’t re-used for different nodes. Each execution unit becomes dedicated to a specific piece of the model.

But this approach realistically works only if all of the weights can be stored near their execution units. That creates the need for an enormous amount of memory, given that some models can require billions of weights.

A given system is also likely to have multiple applications running at the same time. In a static architecture, given sufficient resources, those applications each can get their own hardware resources. In a dynamic architecture, one can choose to allocate the reusable hardware to the different applications, allowing them to run alongside each other. That may be tough to manage, however. It may be easier to use the same resources for all applications, time-sharing to keep them all moving.

All model parameters in-memory
Another major category of static platform leverages what is variously called in-memory compute (IMC) or compute-in-memory (CIM). This exploits a feature of non-volatile memory arrays that allows them to perform multiply-accumulate functions by accessing rows and columns, with no explicit multipliers or adders. The math is done in an analog fashion.

The idea is that the weights are stored in these arrays, and the incoming signals are then multiplied by the weights. Given enough array space, the entire model can be loaded at once. For non-volatile-memory-based arrays, those weights can even survive being powered off, making it necessary to load them only once. DRAM is then not needed for the weights. “There’s no external DRAM,” said Tim Vehling, senior vice president for product and business development at Mythic. “We don’t need it, and we don’t want it.”

Calibration is used to eliminate the effects of variation during manufacturing. It’s also necessary for real-time use to correct any aging effects. This comes at a cost, of course. “Without compensation and calibration, these analog circuits would be 50 times more efficient, rather than 5 to 10 times,” said GP Singh, CEO at Ambient Scientific.

DACs and ADCs also are required to convert digital values into analog ones for multiply-accumulate results that are then converted back into digital form. Those DACs and ADCs, which have to be of very high quality, can take a lot of die area, creating a tradeoff between the number of analog arrays — smaller ones sprinkled around easing signal routing — and die size, because peripheral array circuitry would have to be replicated for each array. “Fusing” arrays saves silicon since, for example, a single sense amp can be used for multiple calculations, better amortizing the circuit area.

Mythic has been working on a flash-based version of this technology. “It’s NOR flash, and yet it has highly proprietary programming techniques,” said Mike Henry, co-founder and CEO of Mythic. Their differentiation comes from this new architecture rather than on aggressive silicon technology. “We don’t have to use crazy process technology,” added Vehling. “We can stay in standard, almost lagging-edge, nodes.”

Fig. 6: Mythic’s architecture uses flash-based arrays to store weights, with a nano-processor for performing other calculations in conjunction with a SIMD block. Local SRAM stores activations from neighboring cells, delivered over a NoC. Source: Mythic

Ambient Scientific, on the other hand, uses what it calls 3D SRAM for its cell. This builds upon an SRAM cell to include a multiplication capability. The cell will lose its contents on power-down, but sleeping saves power. “All of our SRAMs are naturally in sleep mode, but they wake up with zero latency, ” said Singh. Rather than calibrating its DACs/ADCs independently, Ambient Scientific calibrates them together in a loop such they can converge with no independent reference. “The DAC and ADC are done in conjunction.”

Fig. 7: The Ambient Scientific in-memory compute approach. SRAM cells are used to store and present the values for computation in digital form, transforming to analog for the actual math. Source: Ambient Scientific

Finally, Imec and GlobalFoundries collaborated on a research project whose goal was to achieve 10,000 TOPS/W efficiency. While they got close, they did so with an SRAM-based approach that’s not very silicon-efficient. “The objective of the program to understand which of [the bit-cell] options is best for this particular application,” said Diederik Verkest, distinguished member of technical staff at Imec. They are exploring many new architectural ideas that will warrant more detailed discussion if results prove them out.

The GlobalFoundries/Imec work appears different from other IMC architectures. “Activations are input to the DAC, which translates them into pulse width on the activation line,” said Ioannes Papista, R&D engineer at Imec. “Depending on the pulse width and the weight loaded into the compute cell, we get the charge on the summation line, which is translated to a six-bit word on the ADC.”

Fig. 8: Imec and GlobalFoundries have collaborated on an IMC architecture that, for now, leverages SRAM. Source: GlobalFoundries

New NVM cells like PCRAM and RRAM are being considered for IMC, but there is still a lot of development work required to prove out their long-term stability and reliability. This promises to be an area of continued exploration.

Feed-forward and feedback
Another characteristic of architectures relates to the direction of flow. Vision-based applications tend to be feed-forward. That means that the outputs of one layer move only to the next layer. When represented as a graph, such graphs are called “directed acyclic graphs,” or DAGs. The “acyclic” part means there are no “cycles” – that is, nothing feeds back to come through again.

Other networks, like long short-term memory (LSTMs) and recurrent neural networks (RNNs) do involve some amount of feedback. The feedback might involve activations remaining in place within a node for further processing (while also feeding forward), or they may skip back more than one layer.

In some cases, the hardware itself will handle the feedback. Atlazo’s architecture is expressly intended for these kinds of applications, so while the company hasn’t disclosed the details, it’s likely the hardware expressly builds this in.

Blaize, on the other hand, uses a different approach. The data remains in place for use in another node, but the control is sent back to the host for rescheduling. As long as the feedback isn’t more than a few layers (the exact number isn’t public), then the graph metadata allows the control circuits to anticipate the feedback.

Accelerator vs. processor
Some chips are intended to be used as accelerators. Such devices operate under the control of a host, receiving their workload and kernel assignments from that host.

Others act as self-sufficient processors. While a host may help to control configurations or start and stop processing, the hardware gets direct access to the data, and the host remains in the background unless some control situation requires intervention.

Such devices may require specialized I/Os if the data to be processed is coming directly from sensors, And some of those sensors, like microphones, may be analog.

Fig. 9. An example processor with an AI accelerator on-chip. Source: Maxim Integrated

Whole application vs. network-only
Inference is typically only part of an overall application. With vision, for example, images may need some pre-processing to normalize sizes or color schemes or even cleaning up the images with DSP algorithms.

For an inference-only architecture, those functions are handled by the host or some other processor. “A lot of times the data that goes into the accelerator needs to be massaged somehow, and that’s really what the RISC-V is there for in our architecture,” said Kris Ardis, executive director of micros, security and software business unit at Maxim Integrated.

Other architectures are intended to handle the entire application. This mostly tends to be possible in dynamic architectures, where hardware is being repurposed not only for different layers of a network, but for other non-network functions as well. Blaize, for example, can run either network kernels or other code through their processing units.

Similarly, Ambient Scientific has a full-system chip, and it has focused the low-power efforts on that whole chip, not just the AI part. But it’s not re-using the same circuits for both AI and other functions. When inference is running, for example, the host Arm core will shut down until a result is available.

Dealing with the memory wall
The biggest challenge for many of these architectures is to lower power by improving data fetches. That means keeping data local as much as possible, although there are widely varying ways to do that. In some cases, it’s about where the data is located – so-called “near-memory” computing. In other cases, it’s about hiding the fetches. “We have been able to hide the fetch problem behind the compute cycles,” said Krishna Rangasayee, CEO at SiMa.ai. While that hides latency, it may not hide the power required to move so much data, however.

Meanwhile, some networks apparently don’t have this limitation. “We’re compute-bound, said Pete Warden, technical lead of TensorFlow Micro at Google, speaking about the company’s tinyML approach. “We don’t need more memory accesses. We load a few activations and weight values at a time, and then do a lot of different combinations of multiply/adds on them in registers.”

Compare directly at your own risk
We wanted to summarize these different architectures in some succinct manner, but it’s very hard to do. Apples-to-apples comparisons are almost impossible, unless a common model on a common network is run on different architectures. Benchmarks are intended to provide this, but there simply aren’t benchmark results available for most of these architectures. And even if there were, the existing benchmark networks are considered too small to give meaningful results.

What we’re left with are marketing numbers. They aren’t necessarily evil, but they do have to be viewed with a clear eye. We’ve pulled a few simple parameters together into a table. They’re best for getting a sense of which applications the different devices are intended for. They’re not useful for directly comparing which device is better. All data (unless otherwise noted) comes from the respective company, edited only for brevity.

The three main parameters are performance, power, and efficiency:

  •  TOPS (tera-operations per second) is a common, but not universal, speed metric. It’s a very imperfect measure, but unlike latency it is a feature of the hardware rather than the application. It’s also incomplete. A clever architecture with a lower TOPS number may be faster than a less-clever one with a higher TOPS rating.
  • Power reflects the general power range. It’s not a guarantee, but it helps to position the architecture in the chart shown in the first part of this survey.
  • Efficiency is performance per watt. This isn’t the arithmetic combination of the first two numbers, because a high-performance configuration typically is different from a low-power configuration. Efficiency needs to reflect performance and power in the same configuration. That’s why this is listed separately.

We also list a few other considerations:

  • The “brand” is the device, architecture, or technology brand, which may not be the same as the company name.
  • Differentiation will be a few words summarizing “what’s different about” the architecture.
  • The “how sold” box reflects how the architecture is available. At least one is an IP block; the others are silicon. Some are available by the chip, others are available only on modules or cards. There are various different card formats available. The table doesn’t detail that. It details whether cards are an option. “Cards” and “modules” are both called “cards” in the table.
  • Application focus is the company’s take on the kinds of tasks they’re addressing. Different architectures can be used for other things, because it’s often a question of market focus rather than technology. But a good vision architecture might not do well for non-vision applications. And even within vision, not all applications are alike.

* Not a full AI engine, but a micro-GPU for use in an engine
Fig. 10: A brief summary of the architectures covered. Not appropriate for apples-to-apples comparisons. Source: Bryon Moyer/Semiconductor Engineering

This should remain an extremely dynamic space, with lots of growth for those that do well. “AI accelerators are likely to get to more than 2X growth over the next four to five years,” said Hiren Majmudar, vice president and general manager, computing business unit at GlobalFoundries, adding this will keep it an attractive market.

Edge-Inference Architectures Proliferate
What makes one AI system better than another depends on a lot of different factors, including some that aren’t entirely clear.
Customization And Limitations At The Edge
Experts at the Table: Why power, performance, and security will define edge devices.
Compiling And Optimizing Neural Nets
Inferencing with lower power and improved performance.
Winners And Losers At The Edge
No company owns this market yet — and won’t for a very long time.


Leave a Reply

(Note: This name will be displayed publicly)