New AI Data Types Emerge

Several PPA considerations mean no single type of data is ideal for all AI models.

popularity

AI is all about data, and the representation of the data matters strongly. But after focusing primarily on 8-bit integers and 32‑bit floating-point numbers, the industry is now looking at new formats.

There is no single best type for every situation, because the choice depends on the type of AI model, whether accuracy, performance, or power is prioritized, and where the computing happens, with the edge being less forgiving.

While AI training will continue with 32‑bit floating point numbers (FP32), AI inference favors other types. Eight-bit integer (INT8) was the darling of convolutional neural network (CNN), with smaller (or bigger) integers employed where possible either to boost accuracy or improve memory bandwidth. Floating-point formats work better for large language models (LLMs) that are so popular today. But other types tread new ground.

Developers are still figuring out what to use where, since there is no one best format for every application. “It’s not physics,” said Ian Bratt, vice president of machine-learning technology and fellow at Arm. “There’s not the one true data type.”

Data footprint versus accuracy
Single-precision floating-point (32 bits) is considered the gold standard for accuracy, which is why AI training relies almost exclusively on that format. It requires more computing, storage, and bandwidth, but in theory, training happens once in a resource-rich data center, so one can justify the expenditure to get a model off to a good start. By contrast, inference happens repeatedly, so inefficiencies will be felt more strongly.

Choosing the optimum data format is highly important because the data format affects the cost of an engine as well as the critical operational parameters of performance, power, and accuracy.

FP32’s accuracy comes at a cost of additional hardware, slower computation, and a higher energy expenditure per outcome.

The latter result is of growing concern. “Power used to be a secondary thought,” said Todd Koelling, senior director of product and solutions marketing at Synopsys. “But with AI driving the power up and up and up, it’s now at the forefront of everybody’s mind.”

Given the wide variety of inference use cases — data center, on-premise server, phone, IoT device — inference needs more flexibility in reducing its impact.

Since training relies exclusively on FP32, dealing with other formats is a job that falls only to those looking to implement inference. In addition, the tradeoffs may be different, depending on whether that inference is occurring in a beefy server in a data center or in a smart watch. So it’s not a problem to be solved once for all instances of a given model.

INT8 became the default standard for CNNs. There were then discussions on getting down to INT4, INT2 — and even binary neural networks (BNNs) — but none of those formats have achieved commercial acceptance.

Computation versus storage
It’s important to distinguish the data type used for computation from that employed for storage. The data types used here mostly have been the same, but accuracy is driven only by the data type used for computing. Memory bandwidth and capacity concerns can be met through compression. “AI involves storing and moving a lot of data” noted Russ Klein, program director, high-level synthesis division at Siemens EDA.

For instance, INT8 data employed for computing may be stored as INT4 or some other value, but it’s decompressed before being used. This loophole allows a level of accuracy with less of a burden on bandwidth and capacity if the cost of compression and decompression is minimal, and the compression is lossless. Computing performance and power aren’t affected, but the energy consumed by moving data around drops.

Some of the newer formats, while theoretically usable for computing, are being viewed primarily for their ability to save on storage and, more importantly, data communication. Some have irregular numbers of bits, requiring custom logic for computation. For the time being, many designers are turning to new formats only for the compression they provide, decompressing when fetching and recompressing when storing.

As an example, Bratt noted the available options for one of the newer six-bit formats: “You are free to keep everything always 6-bit and then build 6-bit data paths. You could also have it stored in your internal structure as 6-bit and then unpack it to 8-bit and work it on 8-bit data paths. Or you can even unpack it when you bring it in from memory into some internal structure.”

Floating point adoption renewed
Which data type is “good enough” is typically determined empirically. This is done by running a series of models with different data types and comparing them to what FP32 would yield to decide the accuracy cost. This must be done with many models, since they won’t all be the same. This is how INT8 became the CNN favorite (but not INT4 et al).

Once attention-based models arrived, INT8 fell out of favor. “For CNNs, it was accepted that you can quantize to INT8 and not lose much accuracy, with a few exceptions,” said Gordon Cooper, product manager for ARC AI processors at Synopsys. “Transformers started to change that.”

For one thing, functions such as softmax operate on floating-point numbers, which by itself put INT8 on notice. That aside, such model types have stuck with floating point throughout. “The issue with INT is that, as you’re progressing through the model, the error accumulates at each layer, and when it comes to the end, that’s where you lose those few percentage points of correlation,” explained Prem Theivendran, director of software engineering at Expedera.

Floating point storage is different from integer storage in that it has three rather than two fields. Integers have a sign bit and a number. Floating-point numbers have a sign bit, an exponent, and a mantissa. The exponent specifies a power of two for a given number, while the mantissa (also called the fraction or magnitude) is the remainder. The 31 bits remaining after the sign bit are split up, and the standard representation employs 8 bits for the exponent and 23 bits for the mantissa. Such a representation is sometimes notated as E8M23.

That split is important in that the exponent determines the dynamic range of representable numbers. The more exponent bits, the higher the possible exponent and the larger possible range. But those extra exponent bits come at the expense of the mantissa, and fewer mantissa bits means less precision and bigger gaps between representable numbers.

Although a half-precision FP16 format exists, it cuts down both the exponent and mantissa to get to 16 bits: E5M10. That reduces both the dynamic range and the precision. An alternative being explored for AI is the Brain float 16 format (bfloat16 or BF16). With an E8M7 structure, it preserves the FP32 dynamic range at the expense of precision as compared with FP16.

“Operating on billions of parameters in 32‑bit floating point will hit the memory limit in no time,” said Arif Khan, senior product marketing group director in Cadence’s Silicon Solutions Group. “Switching to 16‑bit floating point is a naïve but inaccurate solution as the loss of precision would be excessive to be useful. Google proposed the bfloat16 format used for most calculations in the LLMs and for memory operations related to them.”

That said, BF16 hasn’t yet pervaded the industry. “Not too many people use bfloat,” said Theivendran. “Our customers don’t ask us for it.”

Other lesser-known floating-point variants include NVIDIA’s TensorFloat-32 format, with an odd 19‑bit size (E8M10), AMD’s FP24 (E7M16), and even a PXR24 format from Pixar (E8M15). Moving further, NVIDIA has proposed an FP8 format. It’s not simply a further truncation of FP32, because there would be insufficient bits for that. Instead, it comes in two versions, E5M2 for higher dynamic range and E4M3 for higher precision. Few other companies, if any, have joined onto this format, however. “I can’t recall getting a specific request for FP8,” said Paul Karazuba, vice president of marketing at Expedera. 

Logarithms join the party
A few new data formats exploit logarithms. The obvious benefit with exponents is that multiplication becomes addition, meaning you don’t need multipliers. The challenge is the addition, though. Brute force would require transforming to logarithms for multiplication and transforming back into the linear domain for addition. But that’s not practical for high performance.

Recogni (pronounced like “recognize” without the “z”) is launching such a format, which it calls Pareto. It’s working on its second generation, with 8- and 16-bit versions. The approach takes a floating-point number and jettisons the mantissa, keeping the exponent. But rather than an integer exponent, which FP formats have, it allows fractional exponents.

For the first generation, it dealt with addition by using the so-called Mitchell approximation, which states that

log2(1+x) ≈ x for 0<x<1

The company also originally relied on quantization-aware training to train in this domain, but with several necessary training passes, such a technique is impractical for enormous models such as those necessary for generative AI.

That meant it had to move to post-training quantization, making the data format invisible to the users. It also needed a lot of research into how to deal with the small errors from the Mitchell approximation, which can accumulate in these large models. The company isn’t revealing its solution, as this gives it competitive differentiation as a data-center-class inference engine with no multipliers in its MAC array. “You look at where the Mitchell approximation was applied, and then depending on how big that x was, you apply a different kind of error correction,” according to Gilles Backhus, co-founder and vice president of AI and product at Recogni. Backhus declined to provide further details.

The company claims significant benefits. “When doing a multiply and add, you’ll probably save around 3X [67%] in area and power,” he said.

For that reason, Recogni won’t attempt to standardize its format. “We intend to keep this proprietary, at least for the foreseeable future,” said RK Anand, founder and chief product officer of Recogni.

Posits are a completely different logarithmic approach, using a dynamic scheme to provide better accuracy between -1 and 1. However, this approach has not yet achieved commercial traction. “Certainly no one has come asked us for posit support,” observed Nigel Drego, co-founder and chief technology officer at Quadric.

Block floating-point formats
A newly proposed data type originates with the fact that, if an adjacent range of floating-point numbers all share the same exponent, then the storage is inefficient owing to the repeated storage of that same exponent. In a new data-format proposal, that exponent would be segregated and stored separately — once. “They’re what traditionally has been called a block floating point format, so you have a shared exponent for a block of values,” said Arm’s Bratt. Microsoft and Meta have proposed MX data types that have this feature.

Specific MX formats identified in the paper are MX4, MX6, and MX9, selected partly with hardware and memory efficiency in mind. All three variants share the same size exponents, whereas the mantissas differ, having 7, 4, and 2 bits, respectively, for the three formats. Although designers are theoretically free to use any bit-width they wish, they’re more likely to stick with one or two standard ones unless using more makes a big difference. “The diversity of data formats makes for a complex space given subtle trade-offs between efficiency, accuracy, and friction,” noted Rouhani et al of Microsoft, Meta in their paper.

One of the major benefits claimed for the MX data types is that they can serve both for training and inference. That means inference hardware designed to support the format requires no quantization or data translation from the original trained model. “You can do the training in these smaller data types,” said Bratt. “And when you’re looking at training these gigantic models, you can save some energy and storage doing that.” 

Higher complexity for higher efficiency
Training and inference have come a long way using some very basic data types, principally FP32 and INT8. But we’re moving from an early-maturity stage where we’re just trying to figure out how to make training and inference work to a more advanced stage where we now want to perform those operations more efficiently. That effort is moving away from simple, intuitive data types to more complex versions.

In addition, some companies have moved beyond picking one for an entire model. “You’re starting to see mixed formats in the same network, layer by layer,” said Jason Lawley, product marketing director for AI IP at Cadence. “But your software tools have to support the ability to do that.”

Technology comes with a natural tension between doing things simply and doing them efficiently. There are many examples of more efficient ideas that never took hold because they were too complex, while the simple ones were good enough. What happens with AI depends on how well purveyors of AI technology can embed the data types in a way that ensures users don’t need to understand them in detail.

AI is already a field where a few data scientists wrestle with the intricacies of the technology. Commercial success typically comes to those companies that can abstract away those details through software, simplifying the process for users. Most engineers should be able to use the technology without opening the hood and peering inside.

If data types can be similarly abstracted, then complex ones may be able to take hold. If engineers implementing inference must manually decide which data types to employ for which layers or portions of a model, the resulting friction may discourage their widespread use. The challenge for the AI community lies in building tools that can bring in these efficiencies without requiring users to become experts.

What seems clear, however, is that the search for even better data types isn’t over. “In the CNN domain, we found that INT8 was all that was required, and everyone standardized on that,” said Quadric’s Drego. “In the area of large language models, it would not surprise me at all if we ended up with a dominant data format. I just don’t think we’ve seen it yet.”

Arm’s Bratt agreed: “There will be more data types in the future, I’m sure.”

Related Reading
Mass Customization For AI Inference
The number of approaches to process AI inferencing is widening to deal with unique applications and larger and more complex models.
AI Drives IC Design Shifts At The Edge
Rollout of artificial intelligence has created a whole new set of challenges, along with a dizzying array of innovative options and tradeoffs.
Higher Density, More Data Create New Bottlenecks In AI Chips
More options are available, but each comes with tradeoffs and adds to complexity.



Leave a Reply


(Note: This name will be displayed publicly)