Processing input data from multiple modalities on mobile and embedded devices.
Transformer-based models have rapidly spread from text to speech, vision, and other modalities. This has created challenges for the development of Neural Processing Units (NPUs). NPUs must now efficiently support the computation of weights and propagation of activations through a series of attention blocks. Increasingly, NPUs must be able to process models with multiple input modalities with acceptably low latency, power, and area.
Transformers accept any form of tokenized data. Multimodal Large Language Models (LLMs) make use of this feature by processing multiple data types as inputs. This frequently improves the quality of the input data. Text adds context to an image; an image illustrates what may be otherwise ambiguous or unclear in text.
Multiple modalities are necessary for many tasks. Consider this example from the authors of MiniCPM-V, a multimodal LLM designed for mobile deployment:
Source: MiniCPM-V GitHub
In the captioned image of the bolt, neither the image nor the text alone conveys sufficient information. One could try to capture the detail of the image in many words or illustrate the question in a set of images. But the question is better posed – of higher quality with less data – by including both image and text.
Multimodal LLMs contain an encoder, LLM, and a “connector” between the multiple modalities. The LLM is typically pre-trained. For instance, LLaVA uses the CLIP ViT-L/14 for an image encoder and Vicuna for an LLM decoder. Vicuna fine-tunes LLaMA on conversations from ShareGPT.
Both the ViT encoder and the Vicuna decoder were pre-trained. Only the connector, a single linear layer, was trained. Other examples of multimodal LLMs include OpenAI’s GPT-4, OpenGVLab’s InternVL, and Alibaba Cloud’s Qwen-VL.
OpenAI’s multimodal CLIP model learned to classify images via raw captions rather than explicit labels. The model attained comparable accuracy to ResNet-50 on ImageNet without being trained on any of the images in the dataset. The CLIP architecture works with different image encoders but attains best performance using the vision transformer architecture (ViT).
ViT encodes images by splitting them into patches and generating embeddings for these patches and their respective positions. ViT then processes these embeddings through the original transformer architecture: multi-headed self-attention with layer normalization and skip connections:
Source: “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale” (Dosovitskiy et al. 2020)
The transformer is remarkably simple. A “tokenizer” converts source data such as text or pixels into model inputs. The original vision transformer used 16×16 patches as tokens, but other sizes are possible. Once tokenized, any form of data can be converted to embedding vectors and fed through an encoder and decoder to generate predictions.
The attention blocks in the encoder and decoder use matrix multiplications to learn model parameters. The rest of the attention layers are vector operations. For instance, MobileViT-XS (a smaller version of ViT) contains add, mean, softmax, reshape, transpose, and other vector layers in the attention blocks.
The LLM component of multimodal models has the same general transformer architecture. The connector in LLaVA is a straightforward matrix multiplication translating image features (the output from the visual encoder) into tokens (the input to the language decoder).
Transformers can handle different tasks with minimal retraining. Scientists at Google developing the first transformer noticed that the same architecture used for machine translation could also generate fake Wikipedia articles. The name GPT alludes the pre-trained nature of OpenAI’s foundational models.
Transformers are thus ideally suited for multimodal learning. They can accept any tokenized input and can be tuned to generate outputs across diverse domains. For instance, OpenAI’s GPT-4o generates text, image, and audio outputs from text, image, audio, and video inputs.
GPT-4 captured people’s attention by building websites from whiteboard diagrams and (controversially) generating one-click lawsuits to sue robocallers, among other mind-blowing applications. Multimodal AI is already impacting industries from healthcare to autonomous driving. Google AI applied its multimodal PaLM model to medical question-answering (Med-PaLM) and robotics (PaLM-E). In autonomous driving, Multimodal LLMs have been used for scenario generation (GAIA-1) and perception (Talk2BEV).
Foundation LLMs contain billions of parameters and require massive amounts of power to train from scratch. However, multimodal models often need to be deployed with low power and latency, in sufficiently small systems. One solution is to deploy a model with fewer weights and activations such as MiniCPM-V or TinyLLaVA. These models attain sufficiently high accuracy for some mobile and embedded applications while reducing compute and memory footprint.
It is also possible to apply software and hardware optimizations to the model. Structured sparsity and weight compression reduce the number of cycles necessary to compute matrix multiplications. Quantization is necessary to deploy any multimodal model efficiently. LLMs can even be quantized to 4 bits using QLoRA and other methods.
An NPU designed for multimodal inference must be able to handle sparsity and compression and support the necessary range of input data precisions. With larger multimodal models, computation is typically bottlenecked by the writing and reading of weights to and from external memory. As models become more efficient, choosing a high-utilization NPU becomes more critical.
An effective NPU must also support all the necessary operators in software. Since multimodal learning is a rapidly growing field, future-proofing is often critical. A well-designed NPU will be adaptable to the requirements of future models, in addition to attaining state-of-the-art performance on current models. While the precise design of these models is a work in progress, it is almost certain that an increasing number of models will process input data from multiple modalities.
Leave a Reply