SPONSOR BLOG

No Fooling With Voxel Pooling

Porting functions that don’t exist in common graph manipulation tools to an NPU.

March 13th, 2025 - By: Steve Roddy

A variety of new and complicated transformer models have emerged in the past 18 to 24 months as new “must have” networks in advanced automotive use cases. These novel architectures often introduce new network operators or novel ways of combining tensors – often from different types of sensors – in ways to enhance detection and recognition of objects in L3 / L4 / L5 ADAS and autonomous driving software stacks. Examples include: Deformable Attention Transformer (DAT) and DAT++; SegFormer; Detection Transformer (DETR); and Bird’s Eye View Depth Estimation – BEVdepth. These new functions and operations in these advanced networks can pose extreme porting challenges for SoC programmers trying to run the models on existing NPU accelerators!

New operators on fixed-function accelerators?

If the SoC team – or the car platform team – is lucky, a new graph operator in a novel AI network can be supported by the existing hardwired, less-than-fully programmable NPU. If they are unlucky, then the operation needs to fallback to run on the slow companion CPU or DSP – which usually means the performance is so slow as to be unviable. But what if the breakthrough function is so new, so novel that it doesn’t exist in the common graph manipulation tools? What if the new function is not a graph operator at all?

Voxel Pooling – A custom CUDA function

One such function is the Voxel Pooling operation found in the BEVdepth network. (Available on Megvii’s public Github repo.) A trusty AI search engine describes Voxel Pooling as: “a technique where 3D point cloud data from multiple camera views is aggregated into a bird’s-eye view (BEV) feature map by dividing the space into voxels (3D grid cells) and combining the features of points falling within each voxel, essentially creating a unified representation of the scene in a 2D grid format, which is then used for further processing like object detection.” Sounds complicated. In fact, in the specific case of BEVdepth, the complex voxel pooling function is written as a custom CUDA function, because it has no equivalent built-in graph operator in PyTorch today, and certainly not in the commonly used interchange formats used by NPU vendor toolchains: ONNX and TFlite.

So how does the algorithm porting team handle porting of Voxel Pooling onto an embedded NPU if the function is not represented in the graph interchange format supported by the NPU vendor’s graph compiler toolchain? With other NPUs, the team is stuck – there’s simply no way to get that important new function ported quickly to the NPU, and very likely it can never be ported. BEVdepth might simply not be able to run on the platform at all.

If only your NPU supported a dialect of C++, just like CUDA!

There is one embedded AI processor on the market that does support novel, custom operator functions in a C++ language. Quadric’s Chimera GPNPU – general purpose NPU – is programmed in C++. The advanced Chimera Graph Compiler (CGC) ingests networks in the ONNX format – from PyTorch, Tensorflow, or any other training framework. CGC then converts the entire network into optimized C++ for later compilation by the LLVM compiler. But for functions like Voxel Pooling that are written in pure CUDA no direct compilation is possible – not with our CGC graph compiler – nor any other NPU graph compiler. The solution for Voxel Pooling is to capture the function in a custom C++ representation using our CCL dialect of C++, which is exactly how BEVdepth Voxel Pooling was written in CUDA for the Nvidia platform.

But is writing Voxel Pooling – or other similar functions – difficult? Not for a skilled programmer who can already write CUDA code. The entire Voxel Pooling kernel is some sixty lines of CCL code (including comments!), shown here:

Quickly stitch it back together

The C++ shown above is quickly stitched together with the auto-generated C++ output by the CGC graph compiler, and the full network is now ported completely on the Chimera GPNPU – none of the graph needs to run on any other IP block – no CPU load, no companion DSP needed.

What about performance?

The skeptical reader (you!) is probably already thinking: “OK, fine, you can run Voxel Pooling. But at what speed? If it’s not fast, it’s not useful!” Fear not! The implementation of Voxel Pooling on the Chimera GPNPU is actually faster than the original CUDA code running on a 450 Watt Nvidia RTX3090 board (with GT102 GPU chip). The Chimera GPNPU core – burning just a couple of Watts of power dissipation (your mileage will vary depending on the process node and clock frequency, and choices of off-chip memory interface) outperforms the 450W full chip GPU by a factor of 2X.

The skeptical reader might also wonder, “How can that be true?” Simple. GPUs – such as the RTX3090 – are designed for ease of use in model training and drawing polygons, mostly aimed at data centers where electricity use is not priority #1. GPUs use cache-based architectures (caches burn power!) and hardware-heavy Warp & Thread programming models. The Chimera GPNPU processor is, by contrast, designed for embedded use. Quantized models, ahead-of-time offline compilation, and DMA-based memory hierarchy all save gobs and gobs of power.

Want to know more about the implementation of Voxel Pooling on Chimera cores? We have a detailed Tutorial – including source, performance profiling and more data – for registered users of the Quadric Dev Studio.

The fully programmable, universal, high-performance GPNPU solution

The Chimera GPNPU runs all AI/ML graph structures. And runs non-graph code – so you can run things currently captured in CUDA, C++ (for DSP functions) and even some Python! The Chimera GPNPU processor integrates fully programmable 32bit ALUs with systolic-array style matrix engines in a fine-grained architecture. Up to 1024 ALUs in a single core, with only one instruction fetch and one AXI data port. That’s over 32,000 bits of parallel, fully-programmable performance. Scalable up to 864 TOPs for bleeding-edge ADAS applications, Chimera GPNPUs have matched and balanced compute throughput for both MAC and ALU operations so no matter what type of network you choose to run they all run fast, low-power and highly parallel. Learn more at www.quadric.io.

Steve Roddy

(all posts)
Steve Roddy is the chief marketing officer at Quadric.io. Previously, he was vice president of the Machine Learning Group at Arm, and before that he served as vice president for IP licensing businesses at Tensilica (acquired by Cadence), and Amphion Semiconductor. He also held product management roles at Synopsys, LSI Logic, and AMCC.

No Fooling With Voxel Pooling

New operators on fixed-function accelerators?

Voxel Pooling – A custom CUDA function

If only your NPU supported a dialect of C++, just like CUDA!

Quickly stitch it back together

What about performance?

The fully programmable, universal, high-performance GPNPU solution

Steve Roddy

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers
Entities, people and technologies explored

Related Articles

RISC-V’s Increasing Influence

3D-IC For The Masses

Chiplets Add New Power Issues

Development Flows For Chiplets

New Data Center Protocols Tackle AI

Chiplet Tradeoffs And Limitations

Implementing AI Activation Functions

Future-proofing AI Models

Sponsors

Recent Comments

About

Navigation

Connect With Us

No Fooling With Voxel Pooling

New operators on fixed-function accelerators?

Voxel Pooling – A custom CUDA function

If only your NPU supported a dialect of C++, just like CUDA!

Quickly stitch it back together

What about performance?

The fully programmable, universal, high-performance GPNPU solution

Steve Roddy

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers Entities, people and technologies explored

Related Articles

RISC-V’s Increasing Influence

3D-IC For The Masses

Chiplets Add New Power Issues

Development Flows For Chiplets

New Data Center Protocols Tackle AI

Chiplet Tradeoffs And Limitations

Implementing AI Activation Functions

Future-proofing AI Models

Sponsors

Newsletter Signup

Popular Tags

Recent Comments

About

Navigation

Connect With Us

Knowledge Centers
Entities, people and technologies explored