Using AI To Speed Up Edge Computing

Optimizing a system’s behavior can improve PPA and extend its useful lifetime.


AI is being designed into a growing number of chips and systems at the edge, where it is being used to speed up the processing of massive amounts of data, and to reduce power by partitioning and prioritization. That, in turn, allows systems to act upon that data more rapidly.

Processing data at the edge rather than in the cloud provides a number of well-documented benefits. Because the physical distance is shorter between where data is generated and where it gets processed, latency is significantly reduced. That also reduces the amount of infrastructure needed to move data, because there is less of it to route after the initial processing. And it reduces the amount of power needed to move that data, as well as the cost of storing it. Yet all of those benefits can be extended by leveraging some form of artificial intelligence.

“The cloud certainly will play a role,” said Thomas Rosteck, Connected Secure Systems Division president at Infineon. “But there has to be some intelligence to reduce amount of data that goes to the cloud, concentrating it, and then getting an answer back. That’s an architectural issue.”

AI is a relatively new twist on edge design, where it is being used to identify and prioritize resources at both the chip and system levels. So while edge computing already is broadly deployed in many different sectors — including multiple layers of processing, which can span everything from within a sensor to layers of on-premise and off-premise servers — there is a recognition that much more value can be extracted from data and the systems that process it.

“If the wall box for my car can communicate with my solar system on the roof, they can agree when it’s a good time to load up the battery in the car,” Rosteck said. “If it’s cloudy, maybe that’s not the best time, so I only load it to 50% and wait until I have better weather conditions for the rest. Or if I know I’m driving to Los Angeles tomorrow, I will need a full battery so I can override my system and say I need more than 80%. In a building, we save energy by controlling the blinds to keep the sun out.”

According to the State of the Edge Report 2021, cumulative investments up to $800 billion will be spent between 2019 and 2028 on edge computing. About half of those investments will be on edge devices, with the other half on edge infrastructure.

Fig. 1: Market segments utilizing edge computing.

Fig. 1: Market segments that use edge computing.

Some of those expenditures will include AI/ML, which helps to optimize the compute systems at the edge, particularly on the inferencing side. While most experts believe training will continue to be done in large data centers using huge datasets, inferencing can be done locally using a variety of processing elements, including GPUs, FPGAs, eFPGAs, NPUs, and accelerators/co-processors. This local variety is important for a wide range of applications where bandwidth is limited or inconsistent, and where processing is constrained by the size or type of battery.

“Edge computing has many applications in the consumer space, such as virtual conferencing, either using a laptop or conferencing platform,” said Ashraf Takla, founder and CEO of Mixel. “During virtual conferencing, edge processing can be used for face detection, background blur, gesture control, intelligent muting, and object detection. Consumer wearables also could benefit from edge computing to detect important objects and sounds around the user, including local voice commands, advanced wake words to simplify device UI, and for facial and object recognition. However, because the resources — including power — are limited at the edge, AI is key to identifying and prioritizing the most relevant information to be further processed or transmitted to the cloud.”

Fig. 2: Comparison of edge and cloud computing. Source: Mixel

Fig. 2: Comparison of edge and cloud computing. Source: Mixel

The initial idea for IoT devices has evolved over time, merging into the broader edge computing concept. But rather than just a collection of dumb sensors that send data off to the cloud for processing through some gateway, many edge devices are now much more advanced.

“We have seen our customers use MIPI for SoCs powering smart devices like security cameras or other IoT devices relying on video and audio inputs. The SoC needs to interface with external sources (video and audio) and send the data to a pre-processing unit to make the image and audio more usable for the neural network,” Takla said. “The neural network clusters and SRAM are where the main processing occurs, including segmentation, identification, inferences, and where other functions take place.”

Many of these devices can locally partition and prioritize data, and they can do it using very little power. But they are more difficult and costly to design and manufacture.

“There are multi-level simulations needed,” said Roland Jancke, design methodology head in Fraunhofer IIS’ Engineering of Adaptive Systems Division. “You have a complex model that reflects all the functionality, and for different parts of the model you go deeper into the details. You don’t need that kind of detail if you’re just bringing in the data or seeing how it connects to other parts, but you do need to decide which parts to model in detail.”

Designing AI into chips
AI architectures frequently are designed around high data throughput with numerous processing elements working in parallel, often with small, localized memories. And for complex edge devices, which could include anything from a car to a smart phone, design tools utilize AI to create better AI chips, which are often combined with other chips in a package.

“It’s not just the silicon or the device that matters,” said John Lee, vice president and general manager of the Ansys Semiconductor business unit. “It’s also the software that goes along with that. The challenge that we see in this area is dynamic thermal management, where you’re designing silicon and you want maximum performance at some point. You need to throttle back that performance because temperature limits are being exceeded. And if you don’t do dynamic thermal management properly, then the performance of your system may be substandard. The only way to do that well is to understand what the real workloads are, which is the software itself, and then use emulation to boot up and run a complete workload. So it’s extremely important. AI/ML techniques are being used to address those challenges.”

Still, working with AI/ML adds other issues. “Our customers utilize AI/ML in various capacities to add value to their ADAS systems,” said Paul Graykowski, senior technical marketing manager at Arteris IP. “These systems use the capabilities of cameras, lidar, and other sensors to build a view of the world around the vehicle. For instance, when reading a speed limit sign, machine learning may detect the presence of a new speed limit. A new limit may then be fed to the cloud for additional learning, and then that data can be pushed back down to other vehicles on the road.”

But what is good enough accuracy for these AI systems, which typically is measured in probabilities based on data distributions, may vary by application and by user. “In the case of striving toward automation with ADAS, we must ensure compliance with standards such as ISO 26262,” Graykowski said. “Safety islands in SoCs, redundancy, and failure mode analysis are just a few of the techniques in play. All of these must be tested accordingly.”

The higher the accuracy, in general, the more compute resources and energy required to achieve it. This is especially evident in automotive, where design teams often are facing conflicting goals. “The real problem is meeting the environmental requirements in Europe, the United States, and Japan, because they all have different emissions targets,” said David Fritz, vice president of hybrid-physical and virtual systems for automotive and mil/aero at Siemens Digital Industries Software. “At the same time, chipmakers need to reduce the power consumption and add more compute capability as these cars get more intelligent. So you have to balance those two things. But how do you make it smarter without consuming more power?”

Many of these systems have a fixed power budget, and that can determine how a device is used or even what kind of batteries are used.

“If you put something like a camera up on the front of your house, you don’t want to have to go and change the batteries constantly,” said Rob Aitken, R&D fellow at Arm (at the time of this interview). “You just want it to work. That’s a fairly isolated application, but it’s representative of what’s going to happen in these other situations. If you have three different battery types in a car that operate in different modes at different times, you’ll have to design the system that includes the sensors, and whatever local processing you do or don’t do on those sensors that feeds that back into some kind of centralized compute system. All of that stuff is going to have some battery profile, but it’s operating on there. And the way that the computer is run — basically the partitioning of tasks on that stuff, the orchestration of them — is going to depend on what that battery is. From an Arm standpoint, we were mostly in the position of, ‘Tell us what your battery does and we can figure out maybe something that will optimize its performance. And we can produce a combination of hardware/firmware/software that will allow you to tailor your operation to whatever it is that you’re trying to optimize in your particular battery world.'”

Pre-trained models reduce ML training time
AI can help in all of these cases. The problem is that training AI to do what you want is time-consuming. A proliferation of pre-trained models has simplified this process, even if they are not fully optimized for a particular device. Integrating a pre-trained model is more cost-effective than ML training from scratch. For a retail application, for example, it would take extensive vision training for the edge device to learn what a human being looks like and how to count humans as they move around.

It is not unusual for ML to go through millions of images to do accurate predictions. Additionally, programming GPUs may also need to go through a learning curve. By using a model pre-trained on YOLOv4 object detection, developers potentially can bypass most of the training process. YOLOv4 is a flexible state-of-the-art framework written in a low-level language to detect real-time objects.

“AI-based edge device developers often encounter the challenge of coming up with a cost-effective approach to training the devices,” said Sam Fuller, senior director of inference marketing, Flex Logix. “With pre-trained models, developers can reduce the design cycle and go to test much quicker. Using the pre-trained EasyVision platform with an X1M chip (50 FPS of YOLOv4 detection), ML can yield great results. As a rough comparison, this combination will produce a performance 80 times greater than the same algorithm running on an Intel i5 processor without acceleration.”

Understanding PPA requirements is key
One of the big challenges for design teams is figuring out all the various possibilities and tradeoffs for a particular design, and understanding how a device will be used. Unlike in the past, when chips were primarily designed for a specification, even fabless design teams often have a good understanding of how and where a chip will be used. In some cases, they are working directly with the manufacturer to fine-tune a solution.

“Edge deployment scenarios can result in various potential solutions,” said Suhas Mitra, product marketing director for Tensilica AI products at Cadence. “To decide which hardware solution is more appropriate for an application depends a lot on understanding the key power, performance, and area (PPA) requirements during the design phase. This can result in many different possibilities or variants. For instance, a battery-operated tiny edge device (hearable or wearable) may require very low power and energy, but may not require high throughput. A well-tuned AI accelerator, DSP or MCU class of hardware could all suffice, but the final choice could depend on the area and power budget for the SoC.”

AI adds a whole new level of optimization possibilities, as well as some challenges.

“For software, end users will have to train their network with data to fine-tune AI workloads,” Mitra said. “Many deployment scenarios use open-source AI models that can alleviate the need to create and iterate on new AI models. At a network compile stage, there are two fundamental flows for processing AI workloads. One involves ahead-of-time (AOT) compilation, and the other one is more similar to a run-time based flow. Both of those flows exist today in various deployment scenarios based on the application needs. With programmable IP like DSPs, the end product could also receive over-the-air (OTA) updates in the same way our phones and various electronics do. This way, as the AI algorithm gets more refined/tuned/accurate, it could be sent over to the end product, leading to an overall better experience and performance improvement.”

AI design also involves numerous parameters. Developers have to consider what Arm or RISC-V control codes to use to run the operating system. Then the embedded CPUs, including DSP vectors and NPU accelerators, must be taken into account. The big question is how to optimize the partitioning to achieve maximum performance.

One approach is to integrate all of these functions into a single IP with code optimization. “Integrating the NPU, DSP, and the real-time CPU into a single IP saves developers a great deal of programming time and headaches,” said Steve Roddy, CMO of Quadric. “Digging into a function to determine the best partitioning across three different processor IP blocks takes a lot of effort. It is more efficient to run the performance-critical control and the DSP and ML graph codes on a single core.”

Security at the edge
Edge devices connected to sensors and networks are often vulnerable and need to be secured. The best way is to incorporate solutions with built-in security, and that can be both passive and active. It also can affect overall performance and power, and ultimately cost.

“It is expected that most edge devices will be self-contained,” said Gijs Willemse, senior director of product management at Rambus. “Nevertheless, they are vulnerable due to their interfaces and ability to receive software updates. Secure boot, device authentication, and secure communication, along with protection of provisioned key material are critical for devices that operate in the public domain and/or could be confiscated. This requires a hardware root of trust, and depending on the performance and latency requirements of the application, hardware acceleration to encrypt/decrypt the data transferred over its interfaces. These hardware security cores should include anti-tamper protections to guard against side-channel and fault injection attacks.”

Edge security is a growing concern. More connected devices widen the attack surface for other connected devices, as well as the device where the initial breach occurs.

“We don’t want someone to interfere with devices sitting at the edge,” said Arteris’ Graykowski. “One of the techniques we employ, which is built into our network-on-chip, is the ability to have firewalls embedded with the network to ensure only intended traffic reaches critical systems.”

Latest AI-based edge design test methods
The AI-based edge design process is complex because it has many moving parts. It involves machine learning, training, inferencing, and selecting the best AI chips/solutions and sensors. In addition, an AI system by definition is supposed to adapt and optimize. Choosing the correct test methods and models to reduce errors early in the design cycle is crucial.

This becomes more challenging the longer chips are in use in the field, and more important as they are used for longer periods of time in safety- or mission-critical applications. So a chip that adapts over a couple decades may look very different than when it was first manufactured. And if it is interfacing with other systems, its behavior may be difficult to predict at the outset.

“Starting even at the design stage, how do we add these in-silicon monitors into the chip itself — whether it’s a single die or a multi-die system — and then collect data as we go from production to in-field operation?” asks Bari Biswas, senior vice president for the Silicon Realization Group at Synopsys. “That falls into that space of software-defined hardware, where not only do we monitor, but we actually optimize there. We do a similar type of optimization that we do with our EDA design software. There are autonomous design systems that will optimize the design creation process. Now imagine those kinds of systems operating in the field, and then optimizing the variables that allow configuration of GPUs and CPUs.”

Still, there are a lot of moving pieces in this equation, and figuring out how a device will behave over time is difficult. “In general, testing for AI edge design involves various aspects, starting from model training to inference and the deployment phase,” said Cadence’s Mitra. “The goal is to design better, more robust AI networks. Monitoring various KPIs during pilot run phases and detecting anomalies are important prior to the deployment phase. Adopting a more continuous cycle of testing and monitoring helps in understanding how to make better networks by collecting and monitoring both normal and adversarial use cases.”

As chips perform more and more of edge computing and AI functions, it is important to reduce errors in the design process. Today register transfer level (RTL) is still the most popular language used in designing SoC, FPGA, and ASIC. Whether it is an edge only or an AI-based edge design, the ultimate goal is to achieve performance, power, and area (PPA) optimization. If errors can be caught early in the design cycle, it means major cost savings.

Digital hardware design has moved from the gate level to the register transfer level. Today, High-Level Synthesis (HLS) is used to synthesize the algorithm design in C++ or SystemC code to RTL. In a system design, when the functional verification is performed at the RTL successfully, the error rate will be reduced to the absolute minimum.

“In AI-based edge computing, decisions at the edge, including smart IoT, are made based on sensor input and analytics. This way, the cloud servers do not need to be involved unless extensive computations are required, cutting down on the cloud traffic,” commented Anoop Saha, senior manager, Strategy and Business Development, Siemens EDA. “While there are benefits to AI-based edge computing, designing such systems can be challenging. Because of the complexity of AI and AI chips, errors can be introduced in the design process. To reduce the cost of redesign, it is important to use the right verification tools such as HLS to carefully perform pre-HLS and post-HLS verifications. By taking this approach, the designers will be able to eliminate errors in the AI-based algorithm and architecture.”


Fig. 3: A good HLS test method includes both pre-HLS and post-HLS verifications. Source: Siemens EDA

Fig. 3: A good HLS test method includes both pre-HLS and post-HLS verifications. Source: Siemens EDA

Edge computing has many benefits, including low latency, reduced cloud traffic, local decision making, and overall cost reduction. While embedded AI will enhance edge computing performance, its deployment presents some challenges. The necessity to implement ML training and inferencing is now being aided by the use of pre-trained models, standard-based models, and integrated IP. Prioritizing security and using the latest AI-based edge design test methods will also help.

The bottom line: AI is expected to be an increasingly integral part of edge computing.

—Ed Sperling contributed to this report.


MIPI in Next Generation of AI IoT Devices at the Edge | Mixel, Inc.

State of the Edge Report 2021 – State of the Edge

Download SNUG presentation: Using machine learning for characterization of NoC components (

Leave a Reply

(Note: This name will be displayed publicly)