SPONSOR BLOG

How To Start Building Edge-Native AI

Packet-based architecture enables out-of-order execution to optimize hardware utilization without retraining the model.

June 11th, 2026 - By: Sharad Chole

Cloud AI enables features like voice assistants and recommendations via centralized data centers, but it relies on consistent network connectivity, which often fails in real-world conditions. Edge-native AI shifts inference to devices such as phones, cars, and sensors, enabling real-time processing, enhanced privacy, and operational resilience.

Why edge AI outpaces cloud

Edge AI addresses key limitations of cloud systems by improving latency, privacy, and cost. Cloud inference faces variable network delays and resource contention, while edge processing occurs where data is generated, delivering predictable sub-millisecond responses without connectivity.

Data remains on-device to reduce exposure risks, and distributing compute across endpoints lowers infrastructure demands compared to hyperscale facilities.

Edge physics: Power and efficiency hurdles

Cloud environments support high-power GPUs with cooling and abundant memory, accepting low utilization rates of 20-40% through raw compute capacity. Edge NPUs must operate under battery, thermal, and memory constraints, which prevent the direct deployment of large cloud models.

Mismatches between layer dimensions and hardware blocks cause inefficiencies: smaller layers leave compute idle, larger ones require fragmentation with repeated memory accesses.

Reshaping networks: Tradeoffs and limits

Reshaping networks offers one approach, involving retraining to adjust layer shapes—heights, widths, and depths—to better align with NPU compute blocks such as matrix engines. This can reduce idling or fragmentation by redesigning convolutions or transformer components to better fit hardware dimensions.

However, reshaping requires significant effort to maintain accuracy under edge constraints and is limited by the fixed nature of layer-based architectures, often yielding only marginal gains, with utilization typically around 50%.

Packet-based innovation unlocks potential

Expedera’s packet-based NPU architecture divides layers into context-preserving packets, enabling out-of-order execution to optimize hardware utilization without retraining the model. This achieves 60-80% utilization and reduces DDR memory accesses by 75-79% for models such as Llama 3.2 and Qwen2.

The Origin Evolution platform supports customization through workload analysis, selection of attention, vector, and feed-forward blocks, and tailored on-chip SRAM sizing, with reported peak utilization of 90% in production.

Proven gains in production

Smartphone OEMs have implemented these advances: one design delivered a 20x throughput improvement and a 50% power reduction at 11.6 TOPS/W across over 10 million devices; another achieved a 2x throughput increase and a 60% power reduction at 16 TOPS/W within strict area limits.

Industry roadmaps shift edge-first

Consumer devices: On-device language models for low-latency, private interactions.
Automotive: Reliable driver monitoring and ADAS systems independent of cloud.
Industrial: Local predictive maintenance and anomaly detection.
Healthcare: Privacy-focused continuous monitoring.
Retail: Edge-based vision for behavior analysis and loss prevention.

Successful implementations prioritize edge-native design over cloud augmentation.

Actionable playbook for technical teams

Select Efficient Hardware: Evaluate platforms for high utilization (>60%), low memory bandwidth needs, and performance-per-watt rather than peak TOPS.
Optimize Models: Apply quantization and pruning; consider layer reshaping alongside hardware-adaptive techniques like packet execution.
Conduct Pilots: Test in latency- or privacy-constrained applications, measuring metrics like response time and energy use.
Engage Solution Providers: Partner with experts providing support from evaluation kits through production scaling.

This approach positions teams to integrate edge AI effectively into product roadmaps. For a deeper dive into this, see my article The Coming Breakup Between AI And The Cloud.

Sharad Chole

(all posts)
Sharad Chole is chief scientist and co-founder at Expedera.

How To Start Building Edge-Native AI

Why edge AI outpaces cloud

Edge physics: Power and efficiency hurdles

Reshaping networks: Tradeoffs and limits

Packet-based innovation unlocks potential

Proven gains in production

Industry roadmaps shift edge-first

Actionable playbook for technical teams

Sharad Chole

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers
Entities, people and technologies explored

Related Articles

Flash Getting Stacked High-Bandwidth Version

Can Edge AI Keep Up?

Chiplets Need A New Workflow

Agentic AI Is Changing Data Center Architectures

Gates Add Functionality, But Wires Create Problems

Where Does Quantum Computing Stand?

A New Era For Co-Processing

AI Is Rewriting The IP Playbook

Sponsors

Recent Comments

About

Navigation

Connect With Us

How To Start Building Edge-Native AI

Why edge AI outpaces cloud

Edge physics: Power and efficiency hurdles

Reshaping networks: Tradeoffs and limits

Packet-based innovation unlocks potential

Proven gains in production

Industry roadmaps shift edge-first

Actionable playbook for technical teams

Sharad Chole

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers Entities, people and technologies explored

Related Articles

Flash Getting Stacked High-Bandwidth Version

Can Edge AI Keep Up?

Chiplets Need A New Workflow

Agentic AI Is Changing Data Center Architectures

Gates Add Functionality, But Wires Create Problems

Where Does Quantum Computing Stand?

A New Era For Co-Processing

AI Is Rewriting The IP Playbook

Sponsors

Newsletter Signup

Popular Tags

Recent Comments

About

Navigation

Connect With Us

Knowledge Centers
Entities, people and technologies explored