Major gaps hinder AI algorithm proof-of-concepts from becoming real hardware deployments.
The field of artificial intelligence (AI) moves swiftly, with the pace of innovation only accelerating. While the software industry has been successful in deploying AI in production, the hardware industry – including automotive, industrial, and smart retail – is still in its infancy in terms of AI productization. Major gaps still exist that hinder AI algorithm proof-of-concepts (PoC) from becoming real hardware deployments. These drawbacks are largely down to small data problems, “non-perfect” inputs, and ever changing “state-of-the-art” models. How can software developers and AI scientists overcome these challenges? The answer lies in adaptable hardware.
Internet giants, such as Google and Facebook, routinely collect and analyze massive amounts of data every day. On the other hand, in the hardware industry the availability of big data is much more limited, resulting in less mature AI models. Naturally there is a major push to collect more data and complete “online” to perform training and inference on the same deployed hardware to continuously improve the accuracy.
To address this, adaptive computing – such as FPGAs and adaptable SoCs that are proven on the edge – can run both inference and training to constantly update themselves to the newly captured data. Traditional AI training requires the cloud or large on-premise data centers and takes days and weeks to perform. The real data, on the other hand, is generated mostly at the edge. Running both AI inference and training in the same edge device not only improves total cost of ownership (TCO) but also reduces latency and security bleaches.
While it’s becoming easier to publish an AI model PoC, to show something like better accuracy of COVID-19 detection using X-ray images as an example, these PoCs are almost always based on well cleaned-up input pictures. In real-life, camera and sensor inputs from medical devices, robots, and moving cars will have random distortion such as dark images and various angled objects. These inputs first need to be processed by sophisticated preprocessing to clean up and re-format before they can be fed into AI models. Postprocessing is very important to make sense out of the AI model outputs and calculate the proper decision making.
Indeed, some chips may be very good at AI inference acceleration, however these are almost always accelerating only a portion of a full application. Using smart retail as an example, pre-process includes many-stream video decode followed by conventional computer vision algorithms to resize, reshape, and format convert the videos. Post-processing also includes object tracking and database look-up. End customers care less about the speed the AI inference runs at but whether they can meet the video stream performance and/or real-time responsiveness of the full application pipeline. FPGAs and adaptable SoCs have a proven track record of accelerating these pre and post processing algorithms using domain specific architectures (DSAs). Plus, adding an AI inference DSA will allow the whole system to be optimized to meet the product requirements from end-to-end.
Fig. 1: DSA needed to accelerate AI and non-AI.
The AI research community is arguably the most active with new AI models being invented daily by top researchers around the world. These models improve accuracy, reduce computational requirements, and address new types of AI applications. This fast innovation continues to put pressure on existing semiconductor hardware devices, demanding newer architectures to efficiently support the modern algorithms. Standard benchmarks, such as MLPerf, prove that state-of-the-art CPUs, GPUs, and AI ASIC chips fall well below 30% of the vendor advertised performance when running real-life AI workloads. This is constantly pushing the need for new DSA to keep up with the innovation.
There are several recent trends that are pushing the need for new DSAs. Depthwise convolution is an emerging layer that requires large memory bandwidth and specialized internal memory caching to be efficient. Typical AI chips and GPUs have fixed L1/L2/L3 cache architecture and limited internal memory bandwidth resulting in very low efficiency.
Researchers are constantly inventing new custom layers for which chips today simply do not have native support for. Because of this, they need to be run on host CPUs without acceleration, often becoming the performance bottleneck.
Sparse Neural Network is another promising optimization where networks are heavily pruned, sometimes up to 99% reduction, by trimming network edges, removing fine-grained matrix values in convolution etc. However, to run this efficiently in hardware, you need specialized sparse architecture, plus an encoder and decoder for these operations which most chips simply do not have.
Binary / Ternary are the extreme optimizations, making all math operations to a bit manipulation. Most AI chips and GPUs only have 8 bit, 16 bit, or floating-point calculation units so you will not gain any performance or power efficiency by doing extreme low precisions. FPGAs and adaptable SoCs are perfect, as a developer can develop the perfect DSA and reprogram the existing device for the very workload for the product. As a proof point, the latest MLPerf included a submission by Xilinx, collaborating with Mipsology, that achieved 100% of the hardware datasheet performance using ResNet-50 standard benchmark.
Historically, the biggest challenge for FPGAs and adaptable SoCs has been the need for hardware expertise to implement and deploy DSAs. The good news is that now there are tools – like the Vitis unified software platform – that support C++, Python, and popular AI frameworks like TensorFlow and Pytorch, closing the gap for software and AI developers.
In addition to more development in software abstraction tools, open source libraries, such as the Vitis hardware accelerated libraries, are significantly boosting adoption within the developer community. In the most recent design contest, Xilinx was able to attract more than 1,000 developers and published many innovative projects, from hand-gestured controlled drone to reinforcement learning using a binarized neural network. Importantly, most of the projects submitted were by software and AI developers who had no previous experience with FPGAs. This is proof that the FPGA industry is taking the right steps to enable software and AI developers to solve real-world AI productization challenges.
Fig. 3: Adaptive Intelligence of Things.
Up until recently, unlocking the power of hardware adaptability was unattainable for the average software developer and AI scientist. Specific hardware expertise was previously required but thanks to new open source tools, software developers are now empowered with adaptable hardware. With this new ease of programming, FPGAs and adaptable SoCs will continue to become more accessible to hundreds of thousands of software developers and AI scientists, making these devices the hardware solution of choice for next-generation applications. Indeed, DSAs will represent the future of AI inference with software developers and AI scientists harnessing hardware adaptability for their next-generation applications.
Leave a Reply