Implementing AI Activation Functions

Why flexibility, area, and performance are traded off in AI inferencing designs.

popularity

Activation functions play a critical role in AI inference, helping to ferret out nonlinear behaviors in AI models. This makes them an integral part of any neural network, but nonlinear functions can be fussy to build in silicon.

Is it better to have a CPU calculate them? Should hardware function units be laid down to execute them? Or would a lookup table (LUT) suffice? Most architectures include some sort of hardware implementation, but different neural processors make different choices on how that hardware should look. It’s all about the applications that the processor will serve, which sets up the relative importance of cost, performance, and generality.

“Activation functions react nonlinearly to their inputs, controlling how the information flows between layers, which is somewhat akin to the activation or firing of the neurons in the brain,” said Sandip Parikh, senior software architect in Cadence’s Silicon Solutions Group.

Without a nonlinear function in the pipeline, a basic deep neural network would be one giant linear system that could be solved in one (large) go, but it would reveal only linear relationships. “Nonlinearities are what make neural networks do interesting things,” said Parikh. “Without them, a neural network, no matter how deep or large, would be unable to capture complex relationships between its inputs and outputs.”

One way to think about this is that training fits data to a curve. Activation functions specify those curves, and the question is which curve gives the best fit? A linear function would fit the data only to a straight line, whereas nonlinear functions can capture more interesting behaviors. “The curve of your activation function is introducing that nonlinearity, and what’s going to map it to your end data points will be the weights at the different levels,” explained Russ Klein, program director for the High-Level Synthesis Division at Siemens EDA.

Different layers can have different activation functions, but a given layer will have a fixed function established prior to training.

Picking the right nonlinear function
Although activation functions must be nonlinear, not just any nonlinear function will do. Activation functions have at least three basic requirements beyond nonlinearity:

  • They must be relatively easy to compute;
  • They must be smoothly differentiable for training, and
  • The derivative must be relatively easy to compute.

A more subtle requirement is that they should not suffer from what’s called the vanishing gradient problem, where training can slow or stop due to some weights being exponentially smaller than others after repeated multiplications. Early examples such as the sigmoid and hyperbolic tangent functions suffered from this issue.

That aside, model architects play with different functions to improve accuracy. The process appears to be at least partially empirical, and over time, different functions have risen to the fore. ReLU turned out to work pretty well in early models, and it’s still popular now. It’s a simple function that zeros out the result when the input is negative, and passes on the input when positive. It’s easy to calculate and, while it’s not technically smooth at zero, it’s still easy to determine the derivative on either side of zero.

“In the early days of neural networks, sigmoid and tanh were the common activation functions with two important characteristics — they are smooth, differentiable functions with a range between [0,1] and [-1,1], respectively,” said Parikh. “However, it was the simple rectified linear unit (ReLU) that ushered in the current revolution, starting with Alexnet. A key advantage of ReLU over sigmoid and tanh was overcoming their vanishing gradient problem.”

Many improved activation functions have been used, keeping the basic idea of ReLU, using a self-gating function between 0 and 1, but with a smooth function and allowing negative values. “These include functions like SiLU, Swish, and GELU,” Parikh said. “More recently, gated activations have gained popularity by introducing an additional gating mechanism to control information flowing through the network. These are similar to the gates used in recurrent neural networks, and include GLU, GEGLU, and SwiGLU.”

Functions vary widely
Table 1 shows several popular activation functions, including two that suffer from the vanishing gradient problem. It’s from formulas such as these that the question of implementation arises.

Table 1: Examples of activation functions, ranging from simple to complex. Source: Semiconductor Engineering

As the table shows, the complexity of the functions varies widely. The GELU function, in particular, has alternative approximations that simplify calculation. However, their utility depends on how critical accuracy is. Swish combines both sigmoid and tanh functions.

“The large and growing number of activation functions poses a challenge, particularly for low-power embedded accelerators and DSPs, which focus on efficient fixed-point inference,” noted Parikh. “The hardware needs to be both flexible enough to handle evolving activation functions with good accuracy while remaining efficient.”

In theory, it would be nice if an augmentation to the training process is able to monitor the effectiveness of the chosen function and suggest an alternative. “Could you have AI that’s trained to find the optimal activation function for a particular design or for a particular layer?” Klein said. “It could watch the training process and say, ‘If we wiggle these one way or another, we can get an activation function that is going to better fit this set of data.’”

No such capability is available today.

Three implementation options are available, all of which have been employed in different chips — having a CPU perform the calculation, building the functions in hardware, and approximating the functions using lookup tables (LUTs). The best one depends both on the environment — data center vs. deeply embedded — and the requirements of the application.

Three alternative ways to execute
A CPU can handle any function, with accuracy limited only by the number system and CPU function units. Generality is the main benefit of this approach. If a new activation function becomes popular, even existing deployed units could migrate to it through a software update.

In a data center, this can be viable. “If we’re in the data center and we’ve got lots of CPUs at our disposal, the activation functions usually aren’t a big part of the overall computation, and doing them on CPUs is quick,” noted Klein.

With edge implementations, however, performance will suffer depending on the primitives available in the CPU architecture, and a limited number of CPUs would likely be a bottleneck for any sizeable model. Finally, power would suffer, given the number of cycles necessary for some of the functions. From a silicon-area standpoint, it’s likely the system already will have at least one CPU, so it would come for free but likely be a bottleneck.

Hardened circuits can turn out results far more quickly than a CPU can — potentially in as little as a single clock, depending on the circuit design and clock rate. But they require dedicated silicon area. For a chip whose target application is tightly focused, a single activation function can serve with as many parallel units as necessary to complete the activations quickly.

“Hard silicon is useful when you’re trying to get performance and power,” said Sharad Chole, chief scientist and co-founder at Expedera. “We call this a fusion of operators, and then harden the fusion. That’s beneficial when you know that the activation function is going to be 40% of the execution time, and I need to get it down to 10%.”

Accuracy is determined by the number format and circuit design. Power also would likely benefit since, in theory, the much larger and more energy-hungry CPU could even sleep while the circuits compute the functions, and a single function unit should require far less energy than a CPU. Once shipped, however, that chip will not be able to pivot to a new activation function in the future.

ReLU is cheap and easy
The ReLU function is particularly attractive in hardware because it consists of a multiplexor. This helps explain why ReLU is so popular, even if other functions might give slightly more accurate results. “When ReLU is our activation function, it just says, ‘Is the number greater than zero? Great, use the number, or else use zero,’” said Klein. “That’s three lines of Verilog, and it works in a bunch of cases. But sometimes it doesn’t. What I’m doing with our customers is helping them migrate those algorithms off the CPU into dedicated hardware. If we have to do a lot of floating-point math, you end up with really big, slow implementations. So the activation function that I most often use is plain old ReLU.”

As a result, many models use nothing but ReLU, and that was standard for many of the early vision models. “With the exception of YOLO models, you have only ReLUs,” said Nigel Drego, co-founder and CTO of Quadric.

The third option is to approximate any function using a LUT. Since this can be just a simple small-SRAM lookup — especially for embedded models that have been quantized to fewer inputs — performance should be fast. The benefit of a LUT is that it provides the ultimate in flexibility — possibly more than a CPU. So the ability to reprogram a LUT to a new function gives the generality necessary for inference engines that anticipate a long life with varied workloads.

This came in handy for those YOLO models Drego mentioned. “When you get into YOLO, where they have GELU and its variants, you can implement them without going through the floating-point math using a lookup table,” he said.

In embedded models, LUTs may require more precision than the matrix multiplications. “Convolutions, matrix multiplies, and other linear operations in large neural networks can easily be quantized to 8b or even 4b,” noted Parikh. “But nonlinear functions are much harder to quantize and often require higher precision.”

Larger LUTs may not be a free lunch, however. “As you’re scaling the lookup to higher bit precision values or different data types, [a LUT] might not always work based on the curve fit,” said Chole. “If it’s not always a monotonically increasing or decreasing curve, then the accuracy becomes challenging. At higher precisions, the table can get large, and so it’s worth investigating techniques to reduce the table size without significantly trading off accuracy. Fusing activation functions with preceding computation operators like convolution, matrix multiply, elementwise ops, etc., allows the use of higher precision for the activation computations without increasing storage or memory bandwidth.”

Variations on LUTs
One could even tier the LUTs based on accuracy and choose accordingly. “You can have a lookup table that’s super-fast and power-efficient,” explained Chole. “Then you can have second-level table that’s maybe slightly slower, but you can still get better precision. Then you can build a third table that’s exact. Depending on the layer-specific precision needs, you can decide what type of lookup table to use.”

Rewritability also would be helpful in architectures not structured around a specific model. In such cases, no hardware is associated with any particular layer, so resources must be alterable as layers progress. If two layers had different activation functions, then a LUT serving both layers would require rewriting when switching functions. “If we’re doing a lookup table, then we would populate that lookup table on a per layer basis,” said Klein.

If, on the other hand, either a single function served every layer in the model or hardware were allocated by layer, a ROM LUT could be possible. Such an implementation wouldn’t permit future updates to the activation function, however.

For non-trivial embedded models, CPUs aren’t typically employed because of the performance hit, even though activation functions make up a small percentage of the overall inference work. That leaves hardened circuits and LUTs as the most common edge approach.

“If we’re taking the entire implementation down into hardware, we don’t have a floating-point multiplier anymore,” explained Klein. “We don’t have a math co-processor lying around. We want to take advantage of things that will easily translate into hardware or simple lookup tables, and simple operations on the outputs of those lookup tables, and that’s going to give us a very fast, efficient implementation down in hardware.”

It can come down to the notion of “good enough,” as Expedera found for vision models. “You can run your images and then look at the output and see if the accuracy is sufficient,” said Prem Theivendran, director of software engineering at Expedera. “If you have to put in an FP16 function just for that activation layer, it takes up area, and that often isn’t worth it, so designers say, ‘We’ll just use the LUT.’”

Targeted vs. general
AI can serve for a wide range of computing tasks, and the list is growing by the hour. The number of activation-function options is also growing, although some of the newer functions are motivated by new types of neural networks, such as the prevalent LLMs today. Deciding on a suitable implementation will depend strongly on the nature of the application.

Some systems specify narrowly targeted behavior. An inspection tool requiring AI to recognize images as representing good or bad units has a very specific brief. It must do one thing for all its life. The model itself could be updated with new weights that provide better accuracy over time, but as long as the chosen activation function remained the same, a hardware implementation could serve well, gaining higher performance and lower power at the expense of flexibility that it doesn’t need.

At the other end of the workload spectrum, an inference engine for use in a data center must take on all comers, and there’s no predicting what will be in the models. Any of the approaches could work there.

Edge architectures could also straddle the hardware and LUT techniques, and provide function units for the most popular functions and an extra LUT for anything new that might arise. The function units would be more accurate, but the LUT could be good enough to keep the engine relevant and extend its lifetime. This would come, of course, at the expense of silicon area. But given the cost of some of these chips, a little extra silicon could pay for itself by putting off the necessary purchase of a new board.

No one right answer
In practice, architects make the hardware/LUT decision differently — according to their perception of their target market, and particularly at the edge. Some embedded systems include hardware units, others turn to LUTs. Clearly, each believes in the chosen approach. And as of now, we don’t appear to be done finding new functions. The question comes back to how easy they will be for edge implementation.

“We’re constantly going to be inventing new ones, because you could slightly modify that curve and get a slightly better answer out of the network,” said Klein. “The data scientists are always going to want a more complicated curve, and they’re going to want to adjust it a little bit to get that slight additional accuracy. On the deeply embedded side, you’re going to look at it the other way and say, ‘Well, how can we simplify this?’”

That said, some say that newer functions will build off the mathematical pieces of older ones. “If you look at the ONNX operator list, it hasn’t been growing even linearly in the last six years,” said Chole. “So we’re not creating new operators or new math. What we’re doing is changing the way the operators are composed.”

At least one internet source lists more than 80 possible activation functions, most of which are obscure. If the industry settles on a few standard functions, then hardware would likely provide the most accurate, highest-performance result. An SRAM LUT can still provide quick answers, and it may give the architect a better night’s sleep knowing that, were a change to come, they could handle it.

But it’s still early days. There’s still much research underway, not just about activation functions, but about everything involved in AI. So it’s likely we’ll see a mix of implementations for quite some time.

Related Reading
What’s The Best Way To Sell An Inference Engine?
The hardware choices for AI inference engines are chips, chiplets, and IP. Multiple considerations must be weighed.
Normalization Keeps AI Numbers In Check
It’s mostly for data scientists, but not always.
GPU Or ASIC For LLM Scale-Up?
LLMs are just getting started.



Leave a Reply


(Note: This name will be displayed publicly)