An option for when cost, efficiency, speed, and ease of deployment are prioritized.
While Large Language Models (LLMs) like GPT-3 and GPT-4 have quickly become synonymous with AI, LLM mass deployments in both training and inference applications have, to date, been predominately cloud-based. This is primarily due to the sheer size of the models; the resulting processing and memory requirements often overwhelm the capabilities of edge-based systems. While the efficiency of Expedera’s Origin NPU continues to increase dramatically (along with some other NPUs), most experts believe that memory will remain the bottleneck for some time in large-scale edge deployments of LLMs. In Expedera’s customer engagements, we have yet to find an instance where a desired edge LLM deployment was not memory-bound; the amount of memory required for reasonable LLM training or inference performance has greatly exceeded the edge device’s power, cost, and size budgets.
While memory catches up, the AI industry has focused on a new model type, the Small Language Model (SLM). Like LLMs, SLMs are language models. However, unlike LLMs which can have hundreds of billions of parameters, SLMs have significantly fewer parameters, often only a few million to several hundred million.
SLMs offer several significant advantages compared to LLMs when considering deployment on edge devices, including:
However, SLMs do offer drawbacks compared to their larger brethren, including:
Even with the drawbacks, SLMs are seen as a near-term ‘path forward’ for edge deployment of language models. Here is a breakdown of some example SLMs:
Model | Number of Parameters (in M) | Description | Use Cases |
DistilBERT | 66 | A smaller, faster, and lighter version of BERT, reportedly retaining about 95% of BERT’s language understanding while being 60% faster and smaller | Text classification, sentiment analysis, and question-answering tasks |
TinyBERT | 14.5 | Optimized for efficient inference, with further compression than DistilBERT | Intent recognition, voice assistants, and contextual search in apps |
ALBERT | 12 | ALBERT reduces BERT’s size by sharing parameters across layers and using factorized embedding parameterization; more lightweight and memory-efficient. | Document classification, named entity recognition (NER), others |
MiniLM | 22 | MiniLM is a distilled version of Microsoft’s Transformer models | Text summarization, machine translation, and search engines |
Reformer and Longformer | 41 (Longformer) | Reformer and Longformer are optimized to handle long text sequences more efficiently than traditional transformers, allowing small- to medium-sized models to handle large inputs without major memory usage. | Document analysis, summarization, and handling long transcripts or legal documents in customer service and content moderation. |
Ada and Babbage | 350 (Ada) | Smaller versions of OpenAI’s language models | Classification, text completion, and basic conversational AI tasks. |
T5-Small and T5-Base | 60 (T5-Small) | Smaller variants of the T5 (Text-To-Text Transfer Transformer) models | Summarization, translation, and other language generation tasks |
Small Language Models may be preferred over Large Language Models for edge deployments, where cost, efficiency, speed, and ease of deployment are prioritized. SLMs offer enhanced privacy by enabling on-device processing, requiring significantly less energy and prolonging battery life. They are also easier to fine-tune for specific tasks and more manageable to maintain, making them ideal for high-volume, real-time, or specialized edge applications where advanced language comprehension isn’t essential.
Leave a Reply