Voice Recognition’s Role In Safer, More Secure Car Design

Making voice recognition work in the car requires getting the best of both local processing and the cloud.


By Soshun Arai, ARM, and Mark Sykes, Recognition Technologies

Look around the dashboard of a modern car and you will see dials, buttons and knobs everywhere. While each has its own purpose, they can confuse and distract people, especially when a driver should be paying attention to the road. Add to this new laws that promise harsher punishments for drivers using mobile devices, and you can see that things need to change inside the vehicle. Voice-activated in-vehicle infotainment (IVI) is the front line in offering a better user experience to drivers and passengers, for a safer, more comfortable journey. But, to date, technological challenges have limited our ability to deliver on a seamless, safe and secure solution.

This partly due to the successes that cloud-based technologies have brought to bear on IVI and other mobile features. The cloud has brought us any number of powerful, useful recognition technologies, such as speech, speaker, and facial recognition and natural language understanding. These rely on the remarkably efficient distributed processing that the cloud delivers. But the cloud has its limitations, especially with regard to delivering real-time automotive cockpit functionality that requires user privacy, security and reliable connectivity.

Voice recognition needs to improve for it to become the de facto IVI interface

But advances in IP, SoC and system design are breaking down barriers to delivering that real-time cockpit capability. Indeed, a new model of collaborative embedded-cloud operation is available that promotes uninterrupted connectivity and addresses emerging cloud challenges for the cockpit. We call this the embedded cloud recognition glue layer.

What is an Embedded Recognition Glue Layer?
An embedded recognition glue layer is a native service that provides functional redundancy for improved recognition performance and reliability. It does so by collaborating as a counterpart with cloud API services. The embedded glue layer provides the localized control, user security and privacy, configuration and connectivity redundancy that cloud computing cannot guarantee.

With cloud computing, it is always possible to depend on bandwidth allocation and connectivity. With an embedded glue layer, capturing and processing voice or visual data can be performed locally and without complete dependence on the cloud. In other words, the glue layer is an embedded service that collaborates with the cloud-based service to provide native on-device processing. Where user or vehicle security, privacy and protection are required, the glue layer allows mission-critical voice tasks to be processed natively on the device while ensuring continuous availability. Non-mission-critical tasks, such as natural language understanding, can be processed in the cloud using low-bandwidth, textual data as the mode of bilateral transmission. The embedded recognition glue layer provides nearly the same level of scope as a cloud-based service, albeit natively.

Easing the path to a fully connected car
So why is this hybrid approach—cloud plus native compute—so important? According to VDC Research, major threats against connected cars fall into two categories: safety and data privacy. All connected devices are at risk of some form of attack. Transmitting user voice data from a car over a wireless network presents serious user safety and privacy concerns. Attackers, able to penetrate either the car or cloud provider, would be able to access users’ personal biometric information. Once biometric information is compromised, it cannot be changed like a password.

Additionally, cloud-based recognition systems have two streaming limitations when transmitting voice audio and visual data over a wireless network. First, the data rate or the size of the audio file per second of data is transmitted on average at 128 kbps, meaning that each one-second chunk of user voice audio comprises about 128 kilobits of data. This is achieved using Ogg Vorbis, MP3, or other lossy compression techniques. Second, the bandwidth or the connection speed to the cloud controls the native application’s ability to transmit audio to the cloud. Cloud providers have to pay for their bandwidth usage and have to balance high-quality performance with network bandwidth costs. Transferring high amounts of audio data in real time is problematic from a cost and reliability perspective.

In contrast, the bandwidth required for transmitting textual data on the slowest voice modem is minimal. For example, if we take the fastest speaker in the world who speaks at a rate of 586 words per minute, the output would come to fewer than 10 words per second. The average word has 4.5 letters, making the data rate 45 bytes per second or about 0.36 kilobits per second. Of course, the average person does not speak that quickly yet the bandwidth needed to transmit remains realistic.

Separating mission critical tasks for greater security and performance
A hybrid recognition model incorporating an embedded recognition glue layer with cloud computing offers several significant advantages over cloud computing alone. It remedies several limitations of the cloud service model and helps to boost interaction utility, user security and privacy and uninterrupted connectivity. The implementation of an embedded recognition glue layer offers native processing options for user data captured locally, with tasks separated as follows:

  • Mission-critical tasks: The most mission-critical tasks or sensitive user data is captured and processed locally. The embedded recognition glue layer only stores statistical models about the user and their biometrics and not the biometrics itself. Therefore, the user’s biometrics cannot be compromised over network transmission.
  • Non-mission-critical tasks: These tasks are pushed to the cloud where they leverage the distributed computing power for processing intensive tasks.

Prioritizing and modularizing both mission- and non-mission-critical processes make it possible to extend system functionality, increase system efficiency such as bandwidth optimization and provide critical user security, privacy and protection.

Voice recognition engines, like automotive engines, are not all built equal. As automakers embrace new cockpit technologies, natural language voice interaction provides a better user experience for both drivers and passengers. This solution offers both drivers and passengers a better hands-free experience and increased safety. You can learn more about how this hybrid model works in the white paper,  Reimagining Voice In The Driving Seat.

And if you want to know more about these and many other topics, check out the upcoming ARM Research Summit, Sept. 11-13, in Cambridge, UK. You can help define the topics, propose presentations, and register here.