An emerging AI architecture for embedded autonomy challenges edge efficiency.
The AI model type capturing the most attention across robotics and autonomous vehicles right now is the vision-language-action model, or VLA. At embedded AI conferences this year, particularly the recently held Embedded Vision Summit, VLAs were a main topic of discussion – not as a research curiosity, but as the architecture that teams building autonomous systems are actively targeting. If you design silicon for robots or autonomous vehicles, you will encounter them and will need to react to VLAs as they continue to rapidly evolve.
This post covers what VLAs are, how they’re built, where they become hard to deploy on embedded hardware, and what it takes to run one efficiently at the edge.
A vision-language-action model is an end-to-end neural network that takes sensor inputs—camera images, joint positions, natural-language instructions—and outputs a sequence of physical actions. VLAs are an alternative to a perception-planning-control stack with a single unified model that learns to do all of it.
The VLA name is literal. Three components chain together:
Each of those components is built from transformer layers—multi-head attention and feed-forward networks—that SoC architects are likely already familiar with from ViT and LLM work. The novelty in VLAs is in how the three components connect, how they’re trained jointly, and a small number of new operators that appear in the action expert. Let’s examine one specific VLA for more detail.
Pi-0.5 is a 3.3-billion-parameter open-source VLA released by Physical Intelligence. The model weights and architecture are public, and there is published performance data to compare against, which makes it a good concrete reference to examine in our quest to better understand VLAs.

The architecture runs in three stages:
Stage 1 — SigLIP encoder (~400M parameters). A vision transformer (ViT) that processes 16×16-pixel patches from each camera image and produces a set of 256 patch tokens per camera. This stage of the model is compute-dominated.
Stage 2 — Gemma 2B language model (~2.6B parameters). A decoder-only LLM that takes the vision tokens, a text instruction describing the task, and the robot’s current joint positions, and builds a representation of the situation. Structurally, this is the prefill phase of a standard LLM inference request: both compute-intensive and bandwidth-intensive, since loading 2.6B parameters from DRAM for each inference is unavoidable.
Stage 3 — Action expert (~300M parameters). A transformer decoder that starts with a noise vector representing randomly initialized candidate actions, then uses a cross-attention mechanism against the Gemma output to iteratively refine those actions. The refinement loop runs approximately 10 times per inference—this is called flow matching. Each iteration asks: given what the model understands about the scene and the task, how should these candidate actions be corrected? After 10 iterations, the model outputs roughly 50 action tokens describing what the robot should do in the next short time window.
The action expert in Pi0.5 introduces one graph operator generally not found in previous standard transformer stacks: AdaRMSNorm (adaptive RMS normalization). AdaRMSNorm conditions normalization parameters on context—specifically, on which refinement step the flow-matching loop is currently executing. The dilemma posed by AdaRMSNorm is that because it is uncommon in prior state of the art vision and language transformers, it is highly unlikely to be supported in the fixed-function NPU accelerators found in virtually every other heterogenous NPU architecture. Because AdaRMSNorm does not map to fixed-function logic in those existing NPU accelerators the operation will need to Fallback to run on the legacy CPU or DSP that is paired with the accelerator. AdaRMSNorm therefore becomes a critical performance bottleneck on all other NPU solutions.
The three stages of the Pi0.5 model put different demands on hardware. The vision encoder needs sustained compute throughput. The language model needs compute throughput and memory bandwidth. The action expert’s flow-matching loop is lighter per step but runs 10 times, and its AdaRMSNorm requires general-purpose compute capability that fixed-function accelerators generally don’t provide.
Running Pi-0 on an Nvidia RTX 4090 takes 73 milliseconds (Black et al., arXiv 2410.24164, three cameras, BF16). The RTX 4090 draws 450 watts. This is the server-class baseline. (See comparison table below for details.)
NVIDIA’s Jetson Thor brings the GPU approach to an edge-adjacent class of compute: roughly 120–130 watts with roughly 517 INT8 TOPS. A savvy reader will understand that 130 Watts of power dissipation is not generally suitable for an “embedded” device that is expecting a 10W or 20W maximum power dissipation for the main SoC. 120+ watts exceeds the power budget of most embedded autonomous systems. A delivery robot, a drone, a sensor module in a production vehicle typically won’t have 120-watt thermal headroom for a single inference processor.
A published roofline analysis for Pi-0 on Jetson Thor (Jiang et al., arXiv 2602.18397) puts the theoretical best-case latency at approximately 53 milliseconds. In practice, real systems run at 75–85% of roofline efficiency after extensive and often manual optimization effort, which puts likely Jetson Thor performance in the 62–70 ms range under working conditions.
The common embedded AI inference approach pairs a fixed-function NPU with a CPU or DSP. The NPU efficiently handles matrix multiplications and selected graph operators hardwired into silicon; anything the NPU can’t execute falls back to the general-purpose processor.
For a VLA like Pi-0.5, this creates a compounding problem. Every transformer layer mixes MAC-heavy matrix operations with non-MAC operations: softmax in every attention block, layer normalization throughout, and—in the action expert—AdaRMSNorm. In most NPU architectures very few of the non-MAC operators run in the fixed-function accelerator and must rely instead on the legacy programmable core paired next to the accelerator.
Research on LLM inference with heterogeneous NPU architectures (Xu et al., “Fast On-device LLM Inference with NPUs,” arXiv 2407.05858) measured NPU utilization on a 1.8B-parameter LLM and found the NPU idle 37% of the time due to CPU fallback overhead. With careful and often tedious manual scheduling that idle time can be reduced, but not eliminated. And even if eliminated, a heterogenous multicore solution still imposes a significant power penalty in shuffling data non-stop between engines. For Pi-0.5 specifically, heterogenous NPU operator partitioning generates 712 round-trips between NPU and CPU per inference, along with 762 MB of extra memory transfers. That overhead accumulates across all 10 iterations of the flow-matching loop.

Heterogenous NPU partitioning problem.
The scheduling problem doesn’t get easier as models evolve. Operators that don’t exist today will appear in next year’s VLA releases, and more the year after. Rinse and repeat. A fixed-function NPU today locks in today’s operator set. The fallback path becomes the default execution path for whatever is new.
Quadric’s Chimera GPNPU (General-Purpose Neural Processing Unit) is a fully programmable AI acceleration solution. The Chimera core is a single processor pipeline containing an array of processing elements (PEs). Each PE contains multiply-accumulate units, a complete 32-bit scalar ALU, local memory, and a mesh interconnect to neighboring PEs. The entire array operates under software control with no hardware-managed cache and no separate fallback processor.
Every operator in Pi-0.5—including AdaRMSNorm—runs natively on Chimera cores.
Mapping the vision encoder. SigLIP produces 256 patch tokens per camera image (from 16×16 patch decomposition). A Chimera QC Perform processor has 256 PEs. The mapping is direct: one PE per patch token. Through the entire attention block—projections, dot products, softmax—data stays in place on the PE array. No reshuffling between stages within a layer. The larger Chimera Ultra processor simultaneously processes 4 of those 16×16 patches across its 32×32 array of PEs
Weight tiling for the language model. Gemma’s feed-forward weights are too large to fit entirely in on-chip SRAM. Software tiles the weights across the PE array, processes each tile, and uses DMA to prefetch the next tile while the current tile is computing. Memory latency hides behind compute. This works under software-managed memory control in Quadric’s SRAM-based, DMA-driven memory architecture. In competing heterogenous NPU solutions hardware-managed caching in the programmable CPU or DSP introduces non-deterministic eviction behavior that often breaks the prefetch/compute overlap.
Custom operators in place. AdaRMSNorm runs on the same PE array in the same data layout already established from the preceding attention block. There is no dispatch to a separate processor and no data movement to initiate the operation.
The table below compares three platforms running Pi-0 or Pi-0.5:
| RTX 4090 | Jetson Thor | GPNPU (8× QC-U) | |
| Model | Pi-0 (3B) | Pi-0 (3B) | Pi-0.5 (3.3B) |
| INT8 TOPS | ~1,300 | ~517 | 445 |
| DDR Bandwidth | 1,008 GB/s | 273 GB/s | 273 GB/s |
| Power | 450W | 120–130W | ~11W (cores) |
| E2E Latency | 73 ms (measured) | ~53 ms (roofline) | ~45 ms (simulated) |
The Jetson Thor latency of 53 ms is a roofline: a theoretical ceiling assuming fully optimized implementation. Actual measured performance will likely achieve 75–85% of roofline after significant optimization effort, which puts real Jetson Thor latency in the 62–71 ms range. The Chimera figure of 45 ms comes from real code running on a cycle-approximate simulator. It reflects the current implementation, not a ceiling. There is still optimization headroom.
Note that the Chimera GPNPU is running Pi-0.5 (3.3B parameters) while the other two columns show Pi-0 (3B parameters). The larger model on Chimera is already faster than the theoretical maximum for the smaller model on Jetson Thor.
DDR bandwidth figures are matched between Jetson Thor and the Chimera configuration at 273 GB/s, holding that variable constant. The power difference is approximately 10:1—roughly 11 watts for the GPNPU cores versus 120–130 watts for Jetson Thor. Note, of course, this is an apples to oranges comparison. The other elements of a full SoC including the DDR memory interfaces will add another 10+ Watts of total power dissipation in addition to the power consumed by the Quadric processors; but a 20W or 25W chip that outperforms the 130W Nvidia device is a clear winner.
The deployment challenge for VLAs is not purely a TOPS problem. A system can have enough raw compute throughput and still fail on VLAs because it can’t execute the full operator graph without partitioning across heterogeneous processors. Every partition boundary adds round-trip latency and memory transfer overhead that compounds through the inference pipeline.
VLA architectures will keep evolving. Physical Intelligence will update Pi-0.5. Other groups are building competing VLAs with different action expert designs and conditioning mechanisms. The operators that distinguish future models from today’s SOTA networks likely won’t be in any fixed-function NPU designed today, condemning the chip design team that chooses a fixed-function NPU to respin silicon to adapt to the next model breakthrough.
A processor that runs the full graph natively with software-programmable execution and no fallback path handles that evolution by recompiling, not by re-spinning silicon. That processor is Quadric’s Chimera GPNPU.
If you’re evaluating processor IP for robotics, autonomous vehicles, or other embedded autonomy applications, contact us to discuss VLA benchmarks and Chimera GPNPU performance data in detail.
Contact Quadric: https://quadric.ai/contact
Leave a Reply