Recognizing someone based on the characteristics of their speech instead of the words being spoken.
In recent years, the way we interact with our TVs has changed. Multiple button presses to navigate an on-screen keyboard have been replaced with direct interaction through our voices. While this has resulted in significant improvements to the Digital Television (DTV) user experience, more can be done to provide immersive and engaging experiences.
Imagine you say, “recommend me a film” or “what is my calendar looking like today” and your TV provides a personalized answer because it knows who is talking. This is what speaker identification can offer. Speaker identification is the process of recognizing someone based on the characteristics of their speech instead of the words being spoken.
Current AI trends are pushing new models to be larger and more powerful than ever before. While these large models produce impressive results, the cost, potential security risks, and latency issues means edge AI can be a more attractive solution.
Speaker recognition is a field which aims to solve two main tasks: speaker identification and speaker verification. Speaker verification is concerned with verifying whether a speaker is who they claim to be and is centered around security applications. Meanwhile, speaker identification looks to detect the speaker identity from an enrollment database. Speaker identification is an open set problem, where the speaker can be a known user enrolled on the system or an unknown, or “guest,” speaker.
Speaker identification has been around for several years. However, more recently deep neural networks (DNNs) have been employed, showing significant performance improvements in recognition accuracy.
This blog post shows the results of a research project into speaker identification using a DNN that can run on Arm devices without the need for cloud-based inference. A high-level overview of how speaker identification systems work is provided, before diving into the design and implementation used in this project. The key performance results are discussed, as well as a section on quantization. The conclusion section gives an overview of the findings and highlights potential future work.
The image below shows a typical speaker identification system. As shown, the raw audio input is fed into a pre-processing stage which extracts the key features of the audio. The pre-processed audio is then fed to the model which performs embedding extraction. The model outputs a vector representation of the input audio called an embedding.
Fig. 1: A typical speaker identification system.
Embeddings can be compared using a scoring function, typically cosine similarity, the higher the score, the more similar two embeddings are. If the score is above a certain threshold, we can say the two speakers are the same.
When researching suitable models for this project, there were several considerations:
The model used in the final implementation is “TitaNet Small” based on the paper “TitaNet: Neural model for speaker representation with 1D depth-wise separable convolutions and global context”, Koluguri et al., 2021. The model is based on the “ContextNet”, Han et al., 2020, encoder-decoder architecture. The model was trained end-to-end with additive angular margin loss, which helps to optimize the cosine distance between speaker embeddings. More information on the model is available in the linked paper above, as well the model card available here.
The TitaNet Small model is more than half the size of TitaNet Large and the paper reports an error of 1.15 and 0.68 percent respectively. This means TitaNet Small provides an excellent trade-off between model size and accuracy, while offering very good inference times even in resource constrained environments.
TitaNet Small – like most speech-based machine learning models – does not accept raw audio as input. Pre-processing is needed to compute features in the audio. In this case, Mel Spectrogram audio processing is used.
In speaker identification systems, it is useful to support audio inputs of different lengths. Particularly during enrollment, where users can stop the audio recording at any time. This means machine learning (ML) models that are designed to process audio often have dynamic axes as inputs, enabling them to handle different audio length inputs.
Unfortunately, many of the available TFLite inference backends have either limited support for dynamic input axes, or do not support them at all.
This application is designed around a model with dynamic inputs. However, a fixed axes model was also exported to enable testing inference on the fixed-axes-only GPU delegate for TFLite. Results for this, as well as comparisons between the fixed and dynamic axes models, is found in the performance results section later in this post.
Figure 2 shows a high-level block diagram of the application structure. The app was developed for Android, but the software stack could easily be ported to other popular DTV operating systems.
Fig. 2: A high-level block diagram of the application structure.
The majority of the app logic and performance critical work is handled by the C++ layer of the application.
The TitaNet Small model was originally developed using the PyTorch framework. While PyTorch is suited to testing and developing models, to maximize on-device inference performance, this model was converted to TensorFlow Lite (TFLite).
Conversion of ML models is often challenging. While there are many available tools, finding the correct one that supports the operators in your model can be an issue. In this project, the model was first exported to ONNX format, then the “onnx-tensorflow” tool was used to convert to a TensorFlow model.
from onnx_tf.backend import prepare onnx_model = onnx.load(f"{os.path.join(export_path, export_name)}.onnx") tf_rep = prepare(onnx_model, device='CPU') tf_rep.export_graph(os.path.join(export_path, export_name))
Once in TensorFlow format, the TFLiteConverter provided by the TensorFlow library was used to export a TFLite model.
A quantized model was also produced using TFLite’s post training quantization optimizations. More information is available in this post under Quantization.
The app supports both pre-recorded audio and live audio. Below is a screenshot of the pre-recorded data page. As shown, when two clips of the same speaker are chosen, the system is able to verify this. Likewise, when two different speakers are selected, they are reported as different speakers.
Fig. 3: The pre-recorded data page.
In live data mode, the app can identify speakers from live audio input. The microphone continuously records incoming audio, and when enough audio is recorded (every 2s), the identification pipeline is executed. New users can be enrolled into the system. If the speaker is not recognized, the system will show that there is an unknown speaker.
Fig. 4: New user enrolment and identification.
The key performance metrics for this application are pipeline execution time and CPU/GPU utilization.
Pipeline execution time is important as the faster the pipeline executes, the more time there is for the processors to handle other tasks running on the DTV.
Understanding CPU/GPU utilization helps to paint a picture of how well the application is using the resources available when running inference. It also shows what is going on between inference runs.
When testing the pipeline, it was found that inference time dominates the time taken for the pipeline to execute. This is illustrated in the figure below which shows normalized pipeline timings. Therefore, it is important to understand how different devices and backends affect inference. Arm Streamline Performance Analyzer was used to benchmark timing performance as well as to capture CPU/GPU utilization.
Fig. 5: Normalized pipeline timings.
To compare pipeline execution times between devices, the application was run on a quad core Cortex-A76 CPU and quad core Cortex-A55 CPU processor, using the XNNPack backend for TFLite. The test was then repeated using only a single core on each of the processors. These results are summarized in the graph below.
Fig. 6: Results of pipeline execution time between devices.
As expected, the average pipeline execution time on the Cortex-A76 is faster than on the Cortex-A55, showing more powerful hardware results in improved execution times. The image also shows significant gains from quad core execution compared to the single core. There was an average speed increase of up of 3x with the quad core Cortex-A76 and 3.8x with the quad core Cortex-A55. While not all the pipeline is parallelizable, a large part of the inference is resulting in these performance improvements.
The less time spent running the pipeline, the more time there is for the CPU to run other tasks, yielding a better overall user experience. These tests also prove that it is possible to run speaker identification on a single Cortex-A55 core and still achieve acceptable inference times. That being said, there were noticeable impacts on the UI responsiveness.
Arm Mali GPUs can also be leveraged for ML workloads, which further reduces CPU utilization and enables more headroom for other tasks. As mentioned previously, a fixed axes model must be used with the GPU delegate for TFLite (at the time of writing). The graph below shows a comparison of the fixed axes model. Running on a quad core Cortex-A76 compared to split execution between the GPU delegate and the quad core Cortex-A76.
Fig. 7: Pipeline execution time vs audio length.
Certain operators in the TitaNet Small model are not supported for GPU execution, meaning these operators fall back on CPU execution. As a result, average inference times are just over 20 percent (21.8 percent) slower when using the GPU delegate. However, when using the GPU delegate, CPU utilization is reduced by just over 50 percent (51.2 percent). This is illustrated by the graph below.
Fig. 8: Average CPU utilization during pipeline execution.
These results show that in cases where the GPU may be less active, part of the workload can be offloaded, which reduces the strain on the CPU. Running inference on the GPU offers the ability to perform load balancing and make use of all the available hardware.
The inference times presented here are collected from running on an Arm Mali-G610 MC4 GPU, using v2.16 of the GPU delegate. These times will not be the same across all hardware, models and applications, as exact configuration and operator support directly impact the execution speed.
Model quantization is an optimization technique that reduces the computational and memory costs when running inference. This is done by representing the weights and activations with smaller data types, for example, 8-bit integer instead of 32-bit floating point. More information is found in this blog post.
Post training quantization is an optimization that can be made during the model conversion process. As the name suggests, it enables quantization of already-trained models and can be split into three main techniques:
In this blog post, only dynamic range quantization is explored as this technique does not require a representative dataset during the conversion process.
Dynamic range quantization is where only the model weights are quantized to integers. The model executes with operations that mix integer and float computation where possible. If this is not possible, execution will fall back to float32. More information on post training quantization for TFLite is found here.
The TitaNet Small quantized model showed a speedup of around 10 percent, and memory footprint reduction of around 50 percent. Post-training quantization can result in a small loss in quality, but the benefits can often outweigh the drawbacks. In our experiments, there was a negligible reduction in the accuracy of the model.
Fig. 9: Normalized memory usage of quantized model vs non-quantized model.
The key takeaway from this blog is that Arm-powered DTVs are capable of running edge AI workloads and improving user experiences, while maintaining data privacy.
Speech is the natural way we communicate. Understanding not only what is being said, but who is saying it, is an important part of streamlining the way users interact with their TVs.
Our results show that higher performance Arm CPUs offer the opportunity to combine this technology with other AI based speech tasks, such as diarization, speech-to-text and in future on-device large language models (LLMs).
Do not miss the resources and blogs available under the AI and ML section on developer.arm.com
Leave a Reply