What are VLMs, why do they matter, and what can they do?
Just when you thought the pace of change of AI models couldn’t get any faster, it accelerates yet again. In the popular news media, the introduction of DeepSeek in January 2025 created a moment that captured headlines in every newspaper and website heralding comparisons to the Sputnik moment of 1957. But rapid change is also happening in many quarters that are hidden from view of the Chat-App-using general public. The rapid emergence of Vision Language Models (VLMs) in the automotive/ADAS sector is one of those under-the-public-radar changes shaking up a different industry.
What are VLMs?
Vision Language Models are a rapidly emerging class of multimodal AI models expanding in importance in the automotive world. Market leader NVIDIA has a concise definition of VLMs: Vision Language Models are multimodal AI systems built by combining a large language model (LLM) with a vision encoder, giving the LLM the ability to “see.”
In practical terms that means a VLM can describe in text form – hence understandable by the human driver – what a camera or radar sensor “sees” in a driver assistance system in a car. A properly calibrated system deploying VLMs and multiple cameras could therefore be designed to generate verbal warnings to the driver, such as “A pedestrian is about to enter the crosswalk from the right curb 200 meters down the road.” Or such scene descriptions from multiple sensors over several seconds of data analysis could be fed into other AI models which make decisions to automatically control an autonomous vehicle.
Rapid Explosion of VLMs since 2023
How new are VLMs? Very, very new. A Github repo that surveys and tracks automotive VLM evolution (https://github.com/ge25nab/Awesome-VLM-AD-ITS) lists over 50 technical papers from arxiv.org describing VLMs in the autonomous driving space, with 95% of the listing coming from 2023 and 2024. In our business at Quadric – where we have significant customer traction in the AD / ADAS segment – vision language models were rarely mentioned 18 months ago by customers doing IP benchmark comparisons. Only in 2024 did large language models (LLMs) in cars become a “thing” and did designers of automotive silicon begin asking for LLM performance benchmarks. Now, barely 12 months later, the VLM is starting to emerge as a possible benchmark litmus test for AI acceleration engines for auto SoCs.
How Can You Design an Accelerator if the Reference Benchmark Keeps Moving?
Imagine the head spinning changes faced by the designers of hardware “accelerators” over the past four years. In the 2020-2022 time period, the state of the art benchmarks that everyone tried to implement were CNNs (convolutional neural networks). By 2023 the industry had pivoted to Transformers – such as SWIN transformer (shifted window transformer) as the Must Have solution. Then last year it was newer transformers – such as BEVformer (birds eye view transformer) or BEVdepth – plus LLMs such as Llama2 and Llama3. And today, pile on VLMs in addition to needing to run all the CNNs and vision Transformers and LLMs. So many networks, so many machine learning operators in the graphs! And, in some cases such as the BEV networks, functions so new that the frameworks and standards (PyTorch, ONNX) don’t support them and hence the functions are implemented purely in CUDA code.
Run all networks. Run all operators. Run C++ code, such as CUDA ops? No hardwired accelerator can do all that. And running on a legacy DSP or legacy CPU won’t yield sufficient performance. Is there an alternative?
The Fully Programmable, Universal, High-Performance GPNPU Solution
Yes, there is a solution that has been shown to run all those innovative AI workloads, and run them at high-speed! The revolutionary Chimera GPNPU processor integrates fully programmable 32bit ALUs with systolic-array style matrix engines. Up to 1024 ALUs in a single core, with only one instruction fetch and one AXI data port. That’s over 32,000 bits of parallel, fully-programmable performance. Scalable up to 864 TOPs for bleeding-edge ADAS applications, Chimera GPNPUs have matched and balanced compute throughput for both MAC and ALU operations so no matter what type of network you choose to run – or whatever style of network gets invented in 2026 – they all run fast, low-power and highly parallel. Learn more at www.quadric.io.
Leave a Reply