Inferencing with lower power and improved performance.
Edge inference engines often run a slimmed-down real-time engine that interprets a neural-network model, invoking kernels as it goes. But higher performance can be achieved by pre-compiling the model and running it directly, with no interpretation — as long as the use case permits it.
At compile time, optimizations are possible that wouldn’t be available if interpreting. By quantizing automatically, merging nodes and kernels, and binding variables into constants where feasible, substantial efficiency improvements can be achieved. That means inference will run more quickly with less power.
“Running a neural network efficiently and achieving good performance on an edge device have two significant challenges — processing compute intensive convolutions and manipulating large amounts of data,” said Steve Steel, director of product marketing, machine-learning group at Arm. “Both of these challenges must be solved in order to realize a balanced system design.”
A typical engine
Systems at the edge have resource constraints that data centers don’t have. While power is important for all AI inference, it’s of greater importance at the edge, particularly where batteries may drive the device. Memory and computing power are also much more limited, creating an opportunity to make inference engines more efficient. “If you’re designing an IoT device or something of that nature, the networks are all known beforehand,” said Suhas Mitra, product marketing director for Tensilica AI products at Cadence. “And if your model is known a priori, you can go through a tool chain ahead of time.”
A typical configuration has the inference engine running a limited version of one of the machine-learning frameworks, like TensorFlow Lite. That engine takes a relatively abstract version of the neural-network model and interprets the required execution, calling on small, self-contained programs called “kernels” that perform various functions. A convolution would be run by a kernel. An activation function might be another kernel. The run-time interpreter invokes the kernels as it processes the network.
But interpretation adds a layer of computing to the whole problem. In addition to doing the actual inference, additional computing is needed to reduce the high level of the model to specific execution instructions. That takes time and energy to do.
Interpretation is useful when developing a model, because it would be changing frequently. Having a generic engine that can take any model and run it saves time and effort. But once in production, some are finding there are benefits to compiling the model into native code rather than running an interpreter.
Interpretation also makes sense for applications that may run on a variety of platforms. Android smartphones provide an example. While they share the Android commonality, the underlying hardware — the processors, accelerators, memory, and other critical pieces of the hardware architecture — may be different in each phone.
In this case, compilation wouldn’t make sense. You would need to compile for each phone model, which is impractical. Interpretation gets you onto all of the variants. “The whole idea is that you should be able to run any application that comes through and you have to be able to run that,” said Pulin Desai, product management group director, Tensilica vision and AI DSP IP at Cadence. “That’s the flexibility that the internal network API gives.”
Interpretation also may be necessary when the workload isn’t locked down. “Whether to pre-compile or interpret the machine learning workload will depend on the use case, and some edge devices may do both,” said Steel. “If the ML workload is known in advance, then pre-compiling and deploying a binary will usually give the most efficient execution in terms of processor cycles and energy use.”
Even with interpretation, there’s a certain level of preparation that can be done before the first inference by using a newly loaded model. “I can understand what the different levels and what the graph are, and I can optimize them, but it is not optimized to the extent that was optimized in the ahead-of-time compilation,” said Mitra. That preparation doesn’t have to be repeated for every inference. “You amortize the cost by doing the first level of preparation, followed by multiple levels of execution.”
The obvious benefit of compiling is that the model now can be executed directly in the native language of the processor. That removes layers of overhead and reduces the amount of code running. A model can be compiled in seconds or perhaps minutes, but it’s not a particularly time-consuming process.
Of course, if the inference engine doesn’t involve a processor, then there is no interpreted option. “The notion of compilation or interpretation applies only to processor-based engines,” said Geoff Tate, CEO of Flex Logix. “With an FPGA-based engine, every model is compiled. The fact that it runs in hardware makes it faster and more efficient.”
In addition to the removal of processing overhead, however, compile-time optimizations can simplify a model and provide further increases in performance. “It starts with your high-level graph,” said Markus Levy, director of machine-learning technologies at NXP. “And then it creates a low-level intermediate representation, which is more or less target-agnostic. So this is optimization of the model itself.”
Facebook has taken on such a task with its Glow project, an open-source approach that “lowers” the abstraction level of a neural-network graph. (“Glow” comes from “graph lowering”). A number of companies, including Cadence and NXP, have engaged with Glow, both by contributing to the open-source project and by writing optimizations for their specific platforms.
Fig. 1: An illustration of the compilation process for neural networks. Source: Cadence
Glow uses a static representation of the model that can be loaded all at once. “It works with a single statically allocated block, and all the tensors have known offsets inside it at any given point in time,” said Roman Levenstein, software engineering manager at Facebook. “So you save a lot on dynamic memory allocation at runtime, and you don’t need any operating system or services.”
This reflects the fact that the project originally focused on data-center computation, where memory is freely available in high quantities. That may be less the case for edge applications, and there may be more opportunities to improve things further. But the point is that memory can be allocated statically, and it can be used in a flexible fashion.
Cadence noted that memory must be allocated at some point, but the algorithm may not be in control of that. “We simply give you an API and say, ‘Give me a pointer to the blob that contains the input,’ which would be a camera input or whatever,” said Mitra. “‘Tell me where you want me to put down the outputs. And then give me some arena or working space.’”
With these pointers, it’s not necessary to allocate any new memory. It’s up to the developer to allocate those regions, and they can do it statically or dynamically — using malloc, for instance.
Arm noted that some applications can’t rely on a workload that’s known a priori. “How allocation is done depends on the device,” said Steel. “For example, an embedded device based on an Arm Cortex-M CPU and running a simple RTOS typically will be designed to perform a relatively simple, fixed ML task with the memory pre-allocated. Whereas a device based on an Arm Cortex-A CPU and running a platform OS, such as Linux with multiple concurrent ML applications, would have the memory allocated dynamically by the operating system.”
Optimizing the models
Two optimizations are frequently mentioned in neural-net discussions. One is pruning, and it describes the process of finding connections with weights so small that the connection can be removed without measurably affecting the accuracy of the model. Such connections are said to have been pruned away when they’re removed.
Fig. 2: Pruning removes connections that will have minimal impact on the result. Source: Bryon Moyer/Semiconductor Engineering
A second commonly cited optimization is layer fusion, where two consecutive layers of a model have their kernels combined so that what seems like working through one layer actually works through two layers.
Fig. 3: Layer fusion can result in traversal of a single fused layer once instead of twice, once each for two layers. Source: Bryon Moyer/Semiconductor Engineering
But there are many other opportunities for optimization, both at the graph level and then at the lower instruction level. Because run-time interpreters have to be able to handle any network that’s thrown at them, they have lots of parameters that can be set – like the numbers of inputs, the size of matrices, and the selected activations. But for any given static model, those aren’t variables. They’re known quantities. By eliminating the need to accommodate any neural network in favor of focusing only on one specific network, there’s a lot of efficiency to be gained.
High-level optimizations focus on graphs. “We just treat the graph like a sea of nodes that we’re going to optimize,” said Jordan Fix, research scientist at Facebook. Low-level optimizations focus on lower-level code. LLVM is one way to establish an intermediate representation (IR) with code at a relatively low level while still remaining hardware-agnostic.
That said, some optimizations can take advantage of specific platform characteristics – accelerators, special instructions, memory structures – and so both the high- and low-level optimizations can take place generically or with specific backends in mind.
High-level optimization
One optimization Glow performs is to merge fully connected nodes that share the same inputs. “In general, it’s often beneficial to merge dense linear algebra to use deeper sizes, so that we can better take advantage of vectorization/data parallel execution on an architecture,” said Fix.
Some platforms will have matrix operators that prefer a specific size of matrix, and that may not coincide with the size of matrices being processed. “If it’s best optimized or best utilized by having a 32×32 multiplication, and we have some really skinny matrices that are being multiplied, it’s possible that we could merge matrix multiplications together and get better utilization of the hardware that we’re running on,” said Fix. He also noted that if two such skinny matrices are multiplied by the same larger matrix, then they can simply stack the skinny ones (rather than creating “islands” as shown in the image below).
Fig. 4: If 8 x 8 matrices are handled more efficiently in a particular engine, then smaller matrices can be concatenated into an 8 x 8 matrix, and a single multiplication can replace two multiplications. Source: Bryon Moyer/Semiconductor Engineering
Quantizing is an important step for any model. “Once you quantize [floating point to integer], you immediately have a 4X reduction in memory size,” said Levy. Quantization can happen automatically by running workloads on the abstract version of the model and creating a histogram of the various values. If used directly, then the minimal and maximal values will establish the dynamic range. For a given bit quantization, a wider dynamic range will give less internal resolution than a narrow range. If the outer edges reflect outliers, then a user could choose to narrow the range in order to provide better resolution.
But that quantization can interact with other functions, possibly making them redundant. “Some models/operators use clipping internally as part of the operator,” said Fix. “In Glow, if the backend does not have a specialized implementation of Logit, we will ‘lower’ it to a series of simpler nodes, and, in this case, we create a Clip node as part of that lowering process. Thanks to this, we are able to apply optimizations that generically look for Clip nodes. For example, if the previous node before the Logit was a ReLU, then we can optimize this by merging the ReLU and Clip together.”
Similarly, if quantization is done in a fashion that includes only positive numbers, then that’s also redundant with clipping and ReLU functions. “Let’s say you have a convolution where the possible values coming out of it are -5 to 5, and then you have a ReLU that comes right after in the graph,” said Fix. “We can remove the ReLU and change the convolution to have a minimum range of zero, because we would end up cutting off all those negative values regardless. So we get better resolution on the quantized values that are coming out of the operation, and we also don’t have to perform the ReLU.”
There are also situations where matrices need to be transposed. “If we have a weight that we see is transposed in the graph, then we can transpose it at compile time and not have to do that at runtime,” said Fix. These will be constant matrices that are known at compile time.
“There is another related optimization, where the graph may contain a chain of transpose operations applied to a given tensor, which doesn’t need to be constant,” noted Levenstein. “Such chains of transposes may arise as a result of other graph transformations. The net effect of such a chain of transpose operations can be often represented as a single transpose operation.”
Some, although not all, of these higher-level optimizations may impact the accuracy of the model. Quantization and pruning, for example, explicitly remove information from the graph on the assumption that what has been taken out will have minimal impact on inference results. But that might not be the case.
In order to be sure, the model must be run with selected optimizations to ensure that accuracy remains high enough. “We give different knobs, and we try to make sure that we never go beyond 1% accuracy loss,” said Cadence’s Mitra. “The tool chain does this search and gives you numbers on a runtime basis telling you what the accuracy numbers look like for these optimizations.” This may involve a design of experiments with different optimizations to find the combination that provides the best savings in execution time, power, and footprint while maintaining the required level of accuracy.
Alternatively, retraining might be necessary. Knowing how the model will be quantized in advance can help to avoid retraining, since the original training can be done with quantization in mind. “We’re always trying to recommend customers to do quantization-aware training because it helps to maintain the accuracy,” said Levy.
Lower-level optimizations
Working at the lower level brings in more concrete real-world notions. “When we lower this high-level graph into an instruction stream, we introduce the concept of memory,” said Levenstein. “For each input, or for each unique tensor in the graph, we create a memory buffer and instructions. They operate now with side effects.”
One available option for this is buffer sharing. Tensor buffers are pre-allocated, but tensor lifetimes have an effect on how those buffers are used. “If we see that two tensors have non-overlapping lifetimes, we can reuse the same block of memory for both of them,” said Levenstein. In that fashion, buffers aren’t required for every tensor that will ever exist during processing – only for the number that will be active at the same time.
It also may be possible to manipulate tensor lifetimes to increase the opportunities for buffer sharing. If a particular tensor is used early and late, then between uses the buffer remains occupied and idle. “We don’t want this tensor to be alive for such a long period of time so that we can re-use its memory as soon as possible,” said Levenstein. “We may move some computations around if we can prove that they won’t change the semantics of the overall result.”
There is one exception to this as implemented now — all weights are allocated statically as constant tensors. If the model will be used for multiple successive inferences, this is an efficient way to do things, providing there is enough memory available. It means those weights will never need to be fetched from storage. They’re always in place.
But if memory is in short supply, there is an opportunity to re-use the weight storage. As processing moves through the layers of the network, the weights associated with the first layers will no longer be relevant, and their storage could be re-used for successive layers. Done carefully, those buffers could be reloaded eagerly before being used, hiding at least some of the latency in fetching the new weights. “We currently use a model where we’ve got the all the weights in the beginning, but [re-using model buffers] could have been easily done if required, so it’s not an architectural constraint,” said Levenstein. This does, however, involve more data movement, which can increase power.
Another technique is referred to as “kernel stacking.” If several kernels operate in succession, there’s an opportunity to generate significant efficiencies because these are tensor operations – usually nested loops. “If you have a sequence of element-wise operations, like multiplication and addition followed by maybe a ReLU, then, normally, all of them have their standalone separate kernels, and you would call them in sequence,” said Levenstein. “And this would work. The problem is, in terms of performance, it’s not ideal. If you have big tensors, it will trash your cache because you will iterate over each tensor multiple times, so instead we do loop fusion.”
Rather than doing multiple loops, each with one operation, those loops can be combined into a single loop, where all of the kernel functions are performed in succession on each entry of the tensor. All of the data is then still in the cache. When the loop moves to another entry, the prior entries are complete and their cached values are no longer needed.
Fig. 5: Individual loops on the same tensor may require reloading the cache on each loop. By combining them into a single loop, all data remains local for a given iteration. Source: Bryon Moyer/Semiconductor Engineering
This is also the level where the known dimensions of tensors have great value. “We statically know the shapes [of the tensors],” said Levenstein. For a generic engine that must handle any tensors thrown at it, the bounds on loops and other operations remain as parameters, and there is code to validate the values that occur during operation. “Inside the convolution kernel, you have six or seven loops that implement the convolution. They have bounds checks and stuff like that,” he said.
But if the sizes of the tensors are known in advance, which they will be for a specific model, then those variables can be replaced with constants, and the validation and other related code can be eliminated. “If you substitute in the exact bounds as constants, then LLVM does a pretty decent job in vectorizing those kernels.”
Fig. 6: By replacing variable loop limits with constant limits, validation and possibly other code can be removed. Source: Bryon Moyer/Semiconductor Engineering
Another source of code savings involves the libraries provided by hardware platform builders. These libraries provide access to functions optimized for that platform. But not every model uses every function in the library. It’s typical for a small subset to be used. And yet the entire library would typically be linked into the project, including the pieces that aren’t used.
“Resnet, as an example of the famous computer vision network, mostly uses convolution, smart moves, and a couple of other operations,” said Levenstein. “But we have 40 or 50 kernels available [in the library] for other operations as well. Now, because we have them in the same LLVM IR module, it will see that most of them are not involved, and it will just strip away their code,” shrinking the size of the code base with no impact on operation.
Other classic software-optimization techniques like dead-code and common subexpression elimination can also be performed to clean up the algorithm. Cadence also has a large number of such code-level optimizations in addition to what Glow provides.
Making it platform-specific
By performing these operations at the LLVM level, they can be done generically while possibly tuning for platform characteristics like the optimal matrix-multiplication size. At that point, as long as the target hardware has support for LLVM, the IR can be compiled all the way down to machine code. In fact, “We can compile the network literally into a self-contained object file,” noted Levenstein.
But different architectures may also have their own specific tricks and optimizations that they can use. And how those are done may impact the order of other optimizations. So Glow is modular enough to where someone writing a backend for a specific piece of hardware can add, remove, or even reorder optimizations from the default, giving platform makers the opportunity to put their hardware to best use. “There is a default order, which would do the normal Glow flow, but back-ends are free to redefine it,” said Levenstein.
Mitra concurred. “People build their own tool chain where they break the model into smaller parts,” he said. “And then they do graph optimizations, and then they figure out what their back-end is, and they optimize it for the back-end.” He mentioned pruning as an example. “We do the pruning logic for our own hardware because we know what will best fit. So we don’t have to use the pruning from them as is.”
One set of results
NXP engaged with Glow, optimizing it for microcontrollers. The company took some data, using both TensorFlow Lite and using Glow, and compared out-of-the-box Glow with Glow after optimizing the backend for their MCUs. Finally, NXP compared different MCUs – some using only the CMSIS-NN library and others using the Tensilica HIFi4 DSP.
The compiled model, using an optimized CMSIS-NN library, ran 3.1 times faster than the TensorFlow Lite implementation. For the compiled version, NXP’s optimized Glow implementation using the CMSIS-NN library operated at about twice the speed of that using un-optimized Glow. Using the HiFi4 DSP instead made for a 15X improvement over TensorFlow Lite.
Fig. 7: Performance improvements as a result of compilation using Glow. Source: NXP
With respect to power, Levy noted that it doesn’t affect power so much, but it does affect energy. “Getting the inference done quicker equals energy savings. Less memory transfers equals energy savings,” he said.
Glow is by no means the only compilation option available. There’s also Apache TVM, TensorFlow MLIR, and Microsoft’s ELL. NXP picked Glow because it was more mature and the most flexible. But continued development on all of these fronts may make compiled neural network models more accessible, making for better edge inference.
Leave a Reply