SPONSOR BLOG

Advantages Of BFloat16 For AI Inference

Getting a good balance between inference throughput, accuracy, and ease of use.

October 3rd, 2019 - By: Geoff Tate

Essentially all AI training is done with 32-bit floating point.

But doing AI inference with 32-bit floating point is expensive, power-hungry and slow.

And quantizing models for 8-bit-integer, which is very fast and lowest power, is a major investment of money, scarce resources and time.

Now BFloat16 (BF16) offers an attractive balance for many users. BFloat16 offers essentially the same prediction accuracy as 32-bit floating point while greatly reducing power and improving throughput with no investment of time or $.

BF16 has the exact same exponent size as 32-bit floating point, so converting 32-bit floating point numbers is a simple matter of truncating (or more technically, rounding off) the fraction from 23 bits to 7 bits.

With this conversion, a model can be run quickly on any accelerator that supports BF16. Compared to 32-bit floating point, throughput will approximately double with approximately half the memory bandwidth (and power). It might seem that dropping so many fraction bits would cut prediction accuracy, but Google said in a recent article: “Based on our years of experience training and deploying a wide variety of neural networks across Google’s products and services, we knew when we designed Cloud TPUs that neural networks are far more sensitive to the size of the exponent than that of the mantissa.”

Note that accelerators that support FP16 do not have an easy conversion since the exponent size is less. To convert a FP32 model to FP16 will require an effort similar to INT8 quantization.

The silicon savings are even more significant, as Google said in a recent article: “The physical size of a hardware multiplier scales with the square of the mantissa width. With fewer mantissa bits than FP16, the bfloat16 multipliers are about half the size in silicon of a typical FP16 multiplier, and they are eight times smaller than an FP32 multiplier!”

Google first invented BF16 for its 3rd-generation TPU and the list of companies supporting it in their accelerators now includes ARM, Flex Logix, Habana Labs, Intel and Wave Computing.

BF16 won’t eliminate INT8 because INT8 can again double throughput at half the memory bandwidth. But for many users, it will be much easier to get started on an accelerator with BF16 and switch to INT8 later when the model is stable and the volumes warrant the investment.

With the advantages of BF16 it is likely the adoption will increase to 100% for all accelerators shipped as PCIe or other card formats.

For inference IP for integration in SoCs, all options available are INT except for Flex Logix nnMAX which offers BF16 as well as INT.

Geoff Tate

(all posts)
Geoff Tate is the founder and CEO of Flex Logix. Tate has more than three decades of experience in technology. He is the former CEO of Rambus, and a current board director at Everspin Technologies. He received his BSc in computer science from the University of Alberta, and an MBA from Harvard Business School.

Knowledge Centers
Entities, people and technologies explored

Shift Left Is The Tip Of The Iceberg

A transformative change is underway for semiconductor design and EDA. New languages, models, and abstractions will need to be created.

by Brian Bailey

Partitioning In The Chiplet Era

Understanding how chiplets interact under different workloads is critical to ensuring signal integrity and optimal performance in heterogeneous designs.

by Ann Mutschler

NAND Flash Targets 1,000 Layers

New techniques go beyond improved deposition and etching, but challenges stack up, too.

by Bryon Moyer

3.5D: The Great Compromise

Pros and cons of a middle-ground chiplet assembly that combines 2.5D and 3D-IC.

by Ed Sperling

AI’s Role In Chip Design Widens, Drawing In New Startups

Focus is on letting engineers do much more with the same or fewer resources — and less drudgery.

by Karen Heyman

What Comes After HBM For Chiplets

The standard for high-bandwidth memory limits design freedom at many levels, but that is required for interoperability. What freedoms can be taken from other functions to make chiplets possible?

by Brian Bailey

Memory Fundamentals For Engineers

eBook: Nearly everything you need to know about memory, including detailed explanations of the different types of memory; how and where these are used today; what's changing, which memories are successful and which ones might be in the future; and the limitations of each memory type.

by The SE Staff

Why Small Fab And Assembly Houses Are Thriving

Megafabs churning out the most advanced processors are not the only game in town.

by Bryon Moyer

Advantages Of BFloat16 For AI Inference

Geoff Tate

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers
Entities, people and technologies explored

Related Articles

Shift Left Is The Tip Of The Iceberg

Partitioning In The Chiplet Era

NAND Flash Targets 1,000 Layers

3.5D: The Great Compromise

AI’s Role In Chip Design Widens, Drawing In New Startups

What Comes After HBM For Chiplets

Memory Fundamentals For Engineers

Why Small Fab And Assembly Houses Are Thriving

Sponsors

Recent Comments

About

Navigation

Connect With Us

Advantages Of BFloat16 For AI Inference

Geoff Tate

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers Entities, people and technologies explored

Related Articles

Shift Left Is The Tip Of The Iceberg

Partitioning In The Chiplet Era

NAND Flash Targets 1,000 Layers

3.5D: The Great Compromise

AI’s Role In Chip Design Widens, Drawing In New Startups

What Comes After HBM For Chiplets

Memory Fundamentals For Engineers

Why Small Fab And Assembly Houses Are Thriving

Sponsors

Newsletter Signup

Popular Tags

Recent Comments

About

Navigation

Connect With Us

Knowledge Centers
Entities, people and technologies explored