Full speech recognition will require fundamental innovations that allow processing at very high performance per watt.
Speech recognition has become an increasingly important feature in a wide range of devices. Wakewords such as Alexa or OK Google or Siri have now become a standard feature of wearables, smart-speakers, mobile phones, and even laptops. These devices have already shipped in millions of units and consumers are getting better at utilizing this feature. The wakeword recognition feature is slowly evolving into keyword personalization. Wakeword personalization enables user to set their own word to wake the device. A further extension of this feature involves command recognition. Command recognition enables devices to recognize dozens of spoken commands.
Speech can come from a wide range of environments that could be noisy, or windy. Emerging speech-related use cases deal with background noise elimination, or speech enhancement, or active noise cancellation. An example of such a use case would be a device eliminating vacuum cleaner noise altogether, while running in the background. Some device manufacturers are even thinking about enabling complete speech recognition capability on a device. A device with such a feature would be able to listen to questions from the user, understand context, and provide an answer. For example, one could ask a microwave about the settings needed to best microwave popcorn, and the device could come back and describe settings.
Speech processing is computationally expensive. The range of compute for speech applications can range from MegaOPS to GigaOPS (or even higher if there’s no compression) on the edge. Any chip supporting speech application must provide the necessary compute within the performance, power, and cost envelope dictated by these devices. This is providing AI chip companies a new market to grow.
There are several challenges these chips must overcome to make this a sustainable, long-term business. First, there’s the performance per watt limitation that is particularly critical for battery-powered devices. The chip must provide the highest possible compute within the available energy to enable efficient processing of speech. The chip also should conform to performance requirements such as latency within the required performance per watt envelope.
Today’s popular chips for wakeword detection are based on CPU/DSP architectures. Companies such as Syntiant, Synaptics, and Ambiq that have shipped multi-million units use CPU/DSP architectures. Some other vendors such as Analog Devices use systolic arrays for AI algorithm acceleration. However, it will be extremely hard, if not impossible, for these architectures to scale within the energy envelope for the compute required for full speech recognition, given the limitations imposed by semiconductor node physics. To get to the level of compute needed to enable full speech recognition, some fundamental innovations may be necessary. New architectures that allow processing at very high performance per watt, such as Processing in Memory (PIM) or Legandre Memory Unit, might be necessary.
Then there’s the Bill Of Materials (BOM) restriction. Speech is one of the many features available on these devices, and OEMs must balance the available budget with silicon spend. It is unclear if OEMs would be willing to pay separately for a new class of chips that enable speech functionality and, if so, how much. This might put a limit on Average Sale Price (ASP) of such chips. OEMs might demand maximum functionality at the lowest possible price. For example, a complete speech recognition functionality at, say, $1 in comparison to only wakeword detection at same price might be requested. Today the industry stands at the latter price point.
Additionally, there’s the challenge of evolving algorithms. Speech applications could use classic, shallow machine learning algorithms or modern, deeper neural network-based algorithms. Speech application pipelines could also demand support for speech decode/encode and DSP. All these concepts are relatively new and evolving. Mapping these software algorithms to a given chip architecture poses a great challenge. Compressing and optimizing for the best performance poses an even larger challenge.
The trend is nevertheless positive as of 2022 and OEMs are announcing products and vendors funding rounds. To make the most of the opportunity, the chip world must overcome several challenges and keep innovating. In time, we will know whether speech applications will lead to a new category of AI chips.
Leave a Reply