Advantages Of BFloat16 For AI Inference

Getting a good balance between inference throughput, accuracy, and ease of use.


Essentially all AI training is done with 32-bit floating point.

But doing AI inference with 32-bit floating point is expensive, power-hungry and slow.

And quantizing models for 8-bit-integer, which is very fast and lowest power, is a major investment of money, scarce resources and time.

Now BFloat16 (BF16) offers an attractive balance for many users. BFloat16 offers essentially the same prediction accuracy as 32-bit floating point while greatly reducing power and improving throughput with no investment of time or $.

BF16 has the exact same exponent size as 32-bit floating point, so converting 32-bit floating point numbers is a simple matter of truncating (or more technically, rounding off) the fraction from 23 bits to 7 bits.

With this conversion, a model can be run quickly on any accelerator that supports BF16. Compared to 32-bit floating point, throughput will approximately double with approximately half the memory bandwidth (and power). It might seem that dropping so many fraction bits would cut prediction accuracy, but Google said in a recent article: “Based on our years of experience training and deploying a wide variety of neural networks across Google’s products and services, we knew when we designed Cloud TPUs that neural networks are far more sensitive to the size of the exponent than that of the mantissa.”

Note that accelerators that support FP16 do not have an easy conversion since the exponent size is less. To convert a FP32 model to FP16 will require an effort similar to INT8 quantization.

The silicon savings are even more significant, as Google said in a recent article: “The physical size of a hardware multiplier scales with the square of the mantissa width. With fewer mantissa bits than FP16, the bfloat16 multipliers are about half the size in silicon of a typical FP16 multiplier, and they are eight times smaller than an FP32 multiplier!”

Google first invented BF16 for its 3rd-generation TPU and the list of companies supporting it in their accelerators now includes ARM, Flex Logix, Habana Labs, Intel and Wave Computing.

BF16 won’t eliminate INT8 because INT8 can again double throughput at half the memory bandwidth. But for many users, it will be much easier to get started on an accelerator with BF16 and switch to INT8 later when the model is stable and the volumes warrant the investment.

With the advantages of BF16 it is likely the adoption will increase to 100% for all accelerators shipped as PCIe or other card formats.

For inference IP for integration in SoCs, all options available are INT except for Flex Logix nnMAX which offers BF16 as well as INT.

Leave a Reply

(Note: This name will be displayed publicly)