Knowledge Center
Knowledge Center

Tensor Processing Unit (TPU)

Google-designed ASIC processing unit for machine learning that works with TensorFlow ecosystem.


A tensor processing unit (TPU)—sometimes referred to as a TensorFlow processing unit—is a special-purpose accelerator for machine learning. It is processing IC designed by Google to handled neural network processing using TensorFlow. TPUs are ASICs (application specific integrated circuits) used for accelerating specific machine learning workloads using processing elements—small DSPs with local memory—on a network so these elements can communicate with each other and pass the data through.

TensorFlow is an open-source platform for machine learning used in image classification, object detection, language modeling, speech recognition, among others.

TPUs have libraries of optimized models, use on-chip high bandwidth memory (HBM) and in each core have scalar, vector, and matrix units (MXUs). The MXUs do the processing at 16K multiply-accumulate operations in each cycle. 32-bit floating point input and output is simplified via Bfloat16. Cores execute user computations (XLA ops) separately. Google offers access to Cloud TPUs on their servers.

Google says TPUs are useful for:

  • Models dominated by matrix computations
  • Models with no custom TensorFlow operations inside the main training loop
  • Models that train for weeks or months
  • Larger and very large models with very large effective batch sizes

Otherwise, CPUs and GPUs are better suited to quick prototyping, simple models, small and medium batch sizes, pre-existing code that cannot be changed, some math problems, among others. *See more at Cloud Tensor Processing Units (TPUs).

It became apparent to Google in 2013 that they would have to double the number of data centers they had unless they could design a chip that could handle machine learning inferencing. The resulting TPU, Google says, has “15–30X higher performance and 30–80X higher performance-per-watt than contemporary CPUs and GPUs.”

“The fundamental trend that drives that phenomenon is specialization versus general-purpose. Using a GPU from Nvidia for an ML application is about 84% inefficient. You waste 84% of that part. If you’re deploying millions and millions of graphics processors at Google, you’ve got a pretty big incentive to go build a TPU instead of buying a GPU from Nvidia. That’s true across the board,” said Jack Harding, eSilicon.

TensorFlow processing unit architecture. Source: Google

The latest Google TPU contains 65,536 8-bit MAC blocks and consumes so much power that the chip has to be water-cooled. The power consumption of a TPU is likely between 200W and 300W.

Versions of the TPU include single device and pod configurations:

  • Cloud TPU v2
    • 180 teraflops
    • 64 GB HBM
  • Cloud TPU v3
    • 420 teraflops
    • 128 GB HBM
  • Cloud TPU v2 Pod (beta)
    • 11.5 petaflops
    • 4 TB HBM
    • 2-D toroidal mesh network
  • Cloud TPU v3 Pod (beta)
    • 100+ petaflops
    • 32 TB HBM
    • 2-D toroidal mesh network
  • Edge TPU Inference Accelerator

Pods are multiple devices linked together. See Google’s TPU pages for more information.

Google TPU module, 2019

Google source pages:




Inferencing Efficiency