Packet-based architecture enables out-of-order execution to optimize hardware utilization without retraining the model.
Cloud AI enables features like voice assistants and recommendations via centralized data centers, but it relies on consistent network connectivity, which often fails in real-world conditions. Edge-native AI shifts inference to devices such as phones, cars, and sensors, enabling real-time processing, enhanced privacy, and operational resilience.
Edge AI addresses key limitations of cloud systems by improving latency, privacy, and cost. Cloud inference faces variable network delays and resource contention, while edge processing occurs where data is generated, delivering predictable sub-millisecond responses without connectivity.
Data remains on-device to reduce exposure risks, and distributing compute across endpoints lowers infrastructure demands compared to hyperscale facilities.
Cloud environments support high-power GPUs with cooling and abundant memory, accepting low utilization rates of 20-40% through raw compute capacity. Edge NPUs must operate under battery, thermal, and memory constraints, which prevent the direct deployment of large cloud models.
Mismatches between layer dimensions and hardware blocks cause inefficiencies: smaller layers leave compute idle, larger ones require fragmentation with repeated memory accesses.
Reshaping networks offers one approach, involving retraining to adjust layer shapes—heights, widths, and depths—to better align with NPU compute blocks such as matrix engines. This can reduce idling or fragmentation by redesigning convolutions or transformer components to better fit hardware dimensions.
However, reshaping requires significant effort to maintain accuracy under edge constraints and is limited by the fixed nature of layer-based architectures, often yielding only marginal gains, with utilization typically around 50%.
Expedera’s packet-based NPU architecture divides layers into context-preserving packets, enabling out-of-order execution to optimize hardware utilization without retraining the model. This achieves 60-80% utilization and reduces DDR memory accesses by 75-79% for models such as Llama 3.2 and Qwen2.
The Origin Evolution platform supports customization through workload analysis, selection of attention, vector, and feed-forward blocks, and tailored on-chip SRAM sizing, with reported peak utilization of 90% in production.
Smartphone OEMs have implemented these advances: one design delivered a 20x throughput improvement and a 50% power reduction at 11.6 TOPS/W across over 10 million devices; another achieved a 2x throughput increase and a 60% power reduction at 16 TOPS/W within strict area limits.
Successful implementations prioritize edge-native design over cloud augmentation.
This approach positions teams to integrate edge AI effectively into product roadmaps. For a deeper dive into this, see my article The Coming Breakup Between AI And The Cloud.
Leave a Reply