Achieve low-latency, human-like dialogue without sending data outside the local environment.
Cloud-based AI dominates the headlines, but responsive and private interaction lies at the edge. This blog post shows how to build a fully offline, real-time voice assistant using the Arm-based NVIDIA DGX Spark platform. The system integrates open-source components such as faster-whisper and vLLM. It delivers low-latency, human-like dialogue without sending data outside the local environment.
Get started now. You can find the complete example and step-by-step instructions from this blog post on the Arm Learning Path.
In many enterprise environments, technical staff need fast access to internal documents or real-time assistance. However, relying on cloud APIs introduces three critical bottlenecks:
To solve this, we have designed a pipeline on the DGX Spark, which is built on the Grace-Blackwell GB10 architecture. The pipeline treats the CPU as an active, latency-optimized engine.
To ensure maximum flexibility and performance, the entire system runs using the following open-source tools:
| Component | Software / Model | Software / Model | License / Accessibility |
| Audio Capture | PyAudio | Real-time 16 kHz microphone streaming. | MIT License |
| Speech Detection | WebRTC VAD | 30 ms frame‑based voice/silence detection. | BSD‑style |
| Speech‑to‑Text | faster‑whisper | Efficient, high‑quality transcription on Arm CPU. | MIT License |
| Inference Engine | vLLM | GPU‑accelerated LLM serving with quantized model support. | Apache‑2.0 |
| Language Model | Mistral‑7B-Instruct / Llama-3-70B (GPTQ) | Local reasoning and natural language response. | HF Model License / Model Card Terms |
The system captures 16 kHz mono audio and uses WebRTC voice activity detection (VAD) to detect speech in 30ms frames. This approach ensures that we process only valid utterances and ignore background noise and gaps.
Instead of offloading short, latency-sensitive tasks to the GPU, we use the high-performance Arm CPU complex (the Cortex-X and A cores).
After transcription, the text moves into vLLM. DGX Spark uses Unified Memory so the CPU and GPU share a single memory space. This design lets the GPU directly access CPU output and removes the need for explicit data transfers or PCIe copy overhead.
This flowchart illustrates a high-performance heterogeneous pipeline on DGX Spark. Tasks are allocated to the most efficient compute units to reduce latency.
This pipeline uses Arm Cortex-X and A- CPU cores to handle latency-sensitive audio capture and speech-to-text transcription. This approach delivers response times below 100ms. The system uses Unified Memory so the GPU can directly access transcribed data in shared DRAM. This removes traditional PCIe transfer overhead. The process ends with the NVIDIA GPU executing the vLLM engine to generate intelligent responses. This delivers a high-throughput and private conversational experience.

Fig. 1: DGX Spark Heterogeneous Pipeline. Arm CPUs handle STT transcription and GPU generate the response to minimize interaction latency.
We validated the system using a multi-turn subscription cancellation scenario. The system produced verified, grounded answers without hallucinations.
The following data tracks the exact time from the end of the user’s speech to the start of the LLM’s response (response latency).

| Dialogue Turn | Speech End Time | vLLM Response Start | Response Latency (s) |
| Turn 1 | 00:10 | 00:13 | 3 seconds |
| Turn 2 | 00:24 | 00:28 | 4 seconds |
| Turn 3 | 00:41 | 00:45 | 4 seconds |
| Turn 4 | 00:54 | 00:59 | 5 seconds |
Observation: All turns achieved an average response latency of four seconds. This performance is competitive with cloud-based solutions and provides stronger privacy with no connectivity requirements.
We believe the best way to understand the power of Arm-based AI is to experience it firsthand. We have prepared a comprehensive, step-by-step Learning Path that helps you deploy this pipeline.
The tutorial shows you how to:
Build an offline voice chatbot with faster-whisper and vLLM on DGX Spark Learning Path.
Leave a Reply