Language’s Role In Embodied Agents

How language-trained models can improve the abilities of mobile robots and self-driving vehicles.


Large Language Models (LLMs) and models cross-trained on natural language are a major growth area for edge applications of neural networks and Artificial Intelligence (AI). Within the spectrum of applications, embodied agents stand out as a major developing focal point for this AI. This article will address developments in this space and how the application of language-trained models improves the abilities of mobile robotic entities and self-driving vehicles. This article will not focus on the communicative advantages with human operators — those benefits are more self-evident and are better considered as a contextualized decision based on the intended use case. Instead, it will focus on three benefits embodied agents can realize for their own operation and function.

The first benefit of cross-training embodied agent models on language is communicative improvements; namely, the ability to translate real-world instructions delivered by humans into efficient action taken by the agent. There are numerous ways to accomplish this, such as OpenPAL’s hybridization of reinforcement learning with more traditional LLMs to yield a joint representation capable of efficient responsive reactions to arbitrary instructions and environments. However, this result hides a more significant improvement that has been replicated in other models before — adding language improves embodied agent reasoning, even in non-linguistic tasks and settings. The clearest case study on this is the Dynalang model developed at UC Berkeley on the foundation of Danijar Hafner’s earlier Dreamer models. Dreamer was a fully non-linguistic model that competed on the same terms as other non-linguistic models with iterative learning. Dynalang took the same structure as Dreamer, but through cross-training on language, the model significantly improved its ability to navigate ambiguous foreign environments not matching the training set (as well as achieving the original intent to allow the model to learn from textual data). It also demonstrated its ability to reason when the stated commands and goals were not accomplishable within the provided environment. This versatility and improvement in reasoning is a core part of how language integration in embodied agents improves their performance.

The second benefit is that cross-training on language lets these models use language as a kind of abstracted lossy memory and data compression. In many of these models (Dynalang again being a good example), the cross-training on language allows the model to store representations of what it has visually seen in a much smaller format. The model does not need to store or reference hours of video if it is trying to navigate a situation that it has encountered before, nor must its analysis and appraisal of the situation be entirely de novo every time — much the same as we humans use language-based identifiers for recalling and conveying directions. This benefit manifests itself in other settings as well: Microsoft’s Recall feature takes in a stream of visual data in the form of frequent screenshots, but its reference database of the user’s past activity is represented as natural language in text — allowing it to maintain an extensive knowledgebase acquired from gigabytes of visual data but compressed to a tiny on-disk representation of text which is far faster to access and reference.

The third benefit of integrating language in embodied agents at the edge is that it allows for the efficient repackaging of visual data for subsequent analysis. This means that a language narrative can be the synthesis of the visual or environmental inputs, thus allowing for two lower complexity models to be trained with lower training costs than a unified architecture. Typically, this means a vision-to-language model to interpret the environment and a language model to act as the decision-maker for the actions to take. This can be seen in practice with the LINGO-2 model for self-driving cars. Additionally, there have been numerous academic explorations of a similar concept, taking an off-the-shelf foundation model such as LLaMA (rather than building the model themselves) and fine-tuning it for use as a decision maker for embodied agents — excellent further discussion of the concept can be found in the discussion Choi et al.’s “LoTa-Bench” paper for benchmarking LLMs as task-planners. Language forms a natural abstraction of the inputs, reducing them to the important salient features, and it gives the potential to mitigate the burdensome training associated with reinforcement learning methods. It also can allow the distribution of processing requirements in a way that may be easier to represent in hardware at the edge since the two coordinated models can be hosted separately and the only data that needs to traverse between the two is a text stream, thus allowing separate processors to optimize for each workload.

Natural language in embodied agents is an integral part of the trajectory of the AI field — Expedera views that language’s role as a way of structuring thoughts and sharing collective memory is transferable to AI models in much the same way that biological structures formed the inspiration for neural networks. Including language in these models is not just a method to enable users to talk to AI models, but it also allows those AI models to be more efficient and capable even at tasks where language is not a critical component, such as autonomous driving.

Leave a Reply

(Note: This name will be displayed publicly)