Optimizing AI Workloads For Edge Computing

Performance enhancement, cost reduction, data security, and improved energy efficiency are the end goals for optimizing AI workloads at the edge.

popularity

Experts At The Table: Semiconductor Engineering gathered a group of experts to discuss how some AI workloads are better suited for on-device processing to achieve consistent performance, avoid network connectivity issues, reduce cloud computing costs, and ensure privacy. The panel included Frank Ferro, group director in the Silicon Solutions Group at Cadence; Eduardo Montanez, vice president and head of PSOC Edge Microcontrollers & Edge AI Solutions, IoT, Wireless and Compute Business at Infineon; Alexander Petr, senior director at Keysight EDA; Raj Uppala, senior director of marketing and partnerships for Silicon IP at Rambus; Niranjan Sitapure, central AI product manager at Siemens EDA; and Gordon Cooper, principal product manager at Synopsys. What follows are excerpts of that discussion. To read part one, click here.

L-R: Cadence’s Ferro, Infineon’s Montanez, Keysight’s Petr, Rambus’ Uppala, Siemens’ Sitapure, and Synopsys’ Cooper.

SE: When moving models from the data center to the edge, what is done to allow them to run optimally? How are design teams migrating those models?

Sitapure: A lot of my PhD work was on these models for industrial applications. You have to build very fast models for controllers and optimizations that run on manufacturing floors and so forth. Typically, there’s an axis of what type an ML (machine learning) or AI (artificial intelligence) model it is, such as an ML/RL (reinforcement learning), an LLM (large language model), an SLM (small language model), or a VLM (vision language model). Then, there is the axis of how big it is. If it’s ML/RL, they prune it, so instead of having a 100-million-parameter model, they prune the layers or the specific weights that are not actually required, to make it a smaller model that gives 90% of the same result. In the visual space, this is what Tesla does. In Tesla’s FSD (full self-driving) they do quantization, where there is a big vision model that, when it’s trained on H100, it’s trained on full floating point, 32-bit operations. It’s terabytes of model, but then you bring it down to 8-bit floating point or 4-bit. Instead of having 4,096 colors, you can have 256 colors, but it still makes sense because it’s a convolutional neural network (CNN) at some point, as well. Those two I’ve seen. One that has come up in the last one or two years, and you might have seen a few months back, Meta launched a series of models where they use distillation. Here, you take a bigger model, which is more like a teacher model and is a more expensive LLM, maybe an 8-, 20-, or 30-billion parameter model. You can then train a model that is less than 1 billion up to 1 or 2 billion [parameters], which can run on an iPhone memory, like 4 GB for memories. Long story short — quantization, pruning, and distillation.

Petr: This means we have to understand what the application needs. Not everything needs a 30-billion-parameter massive LLM. There are techniques to create small LLMs, and there’s a very clear trend that those are getting better and better over time. We now have the first generation of reasoning models that are SLMs, which can be deployed. They are 8 or 20 billion parameters. You can run those on small GPUs. You don’t need a massive amount of compute, and if you have NPUs or TPUs, it can even go to battery-driven devices. There’s also the whole domain of neural networks, which are a trained solution for a designated purpose, and those don’t have to be massive. They’re sometimes really small because they serve a very designated purpose, like ‘I have a certain amount of data. Please fit this data. Give me something which I can use to model the system.’ And they can be really small. We’re talking about megabytes, not even gigabytes. So, depending on where you go with this, you don’t need a data center, and the pruning techniques mentioned are very simple. For example, if you sit in your car, you don’t need to know who the 23rd president of the U.S. was. Who cares? You want to know whether it’s a pedestrian or not.

Montanez: There is a partitioning of the use case, which we can fine-tune at the edge. Specifically, if you’re putting a small language model on a fitness watch, for example, then it’s going to care about cardiovascular fitness. It’s not going to care about who the president was. There’s an aspect of reducing that model for the particular use case that you’re trying to provide at the edge.

Uppala: To add to that, there are also constraints in every application. This means that the kind of power you supply to the application translates into architectural choices. For example, you have to pick the right kind of memory if power is a constraint, where LPDDR might be a good fit, etc. Similarly, you also have to look at what kind of flops you need to address the compute needs of this edge device. Some may be more, some may be less, and all of that translates into the requirements of the end application, and how you architect around that based on those constraints.

Ferro: The first challenge is to get the model parameters down. Until now, it’s been pretty brute force. We’re going to just throw GPUs at the problem, and now we can’t do that anymore, especially at the edge. So how do we rearchitect both the CPU, in terms of compute power, and the memory subsystem? The compute has pretty much outpaced the memory in terms of bandwidth. Of course, the models have grown faster than everything. Typically, you’ll see, even in a data center, things are very memory bandwidth limited, like in the old classic memory wall, or you see the roofline models. What we’re seeing now, as Raj mentioned, is how we move from just DIMMs in a data center to, say, LPDDR. Once Nvidia announced using LPDDR, everybody clustered around that. This includes using different form factors, like for lower power and smaller size form factors, instead of using DIMMs, and so that’s really where we’ve been. We’ve seen a lot of activity around how to get similar bandwidth, similar memory capacity, but using lower power memories, lower cost memories.

Cooper: From a hardware point of view, when you go from a bank of GPUs to inference on the edge, you’ve got power and area [to consider]. What performance can I get for a certain power budget, because I have a battery? What performance can I get for a certain area? This means cost. I’m going to have a lot of multiples. So for the cost I have a little memory, which means you have to get the algorithms down. What’s interesting is, you look at the bandwidth versus compute, and you have to balance that as well, because if you have just language, maybe you need more bandwidth but not a lot of compute. But now you’re adding LMMs (large multimodal models), and you have vision, so you need more compute again. Getting that mix right is challenging on the edge, and there are all sorts of interesting things that haven’t shaken out yet. With a convolutional neural network, we kind of defaulted to INT-8. Everybody said, ‘All right, INT-8. We kind of agree on that.’ But I don’t think there’s agreement yet. There’s maybe floating point 16 times INT-4 for coefficients, or maybe it’s floating point 8. I don’t think we’ve resolved that yet. There’s a lot of churn on the hardware side to figure out that sweet spot as well.

Petr: CNNs might not be the best solution for everything. Different architectures may require different hardware.

Cooper: That’s true. But as we move away to transformers and then large language models, you have to move away from INT-8 only and try to figure out what the right mix is.

SE: Let’s jump to security, because this is such an important aspect. What is the security impact of moving an AI application to the edge, and how does that affect the architecture?

Uppala: Data centers are typically very secure. There are a lot of security checks that happen before anyone can get into the data center, so it’s protected in that sense. Once you start putting some of these devices on the edge, the attack surface widens across the board. You could do things like side-channel attacks, spoofing, and many other things where you don’t have these sorts of issues in the data center, but once you get to the edge, it becomes a huge challenge. Security also adds cost. So there’s that balance of where you’ll need to work toward security, but at what cost and what capability? It boils down to the application’s needs on what it can deliver at a certain cost point. To summarize, you have a much bigger attack surface compared to what you have at the data center, and then you have to address all these requirements and standardizations and things like that, which would definitely help for certain applications and IoT.

Montanez: The surface attack area grows at the edge. One approach to think about is how to integrate a secure element. That’s something we’re doing at a microcontroller level, making sure that we have a secure boot, that we have the ability to encrypt and authenticate, and secure those keys from side-channel attacks. What’s super important about edge security, as well, is the fact that there’s a lot of IP in corporations. We’re going to be developing these very valuable models that make our products unique, and the ability to secure those and make sure that our IP and our customers’ data is protected is a really critical aspect. We need to have integrated-level security to reduce that attack surface area.

Petr: I want to throw a bit of distinction into this discussion. What has just been discussed speaks to the hardware itself. Data center security versus you now have your own edge device. That can be a watch, a phone, a data center in your company. For that hardware piece, we heard good arguments. But there’s also a software piece to this whole discussion. The software piece, which we keep seeing, is that there are a handful of providers who basically offer you LLM access to very big LLMs. Those LLMs are run on the cloud, and those companies have access to everything you do, even though they say it’s secure. We’ve seen breaches from Google, from OpenAI, where they indexed questions from a private company and made it public, and it became public in the open space. We’ve seen where emails from CEOs got passed and published in the news. So there’s another aspect here, which is not the hardware piece, but who owns and who runs the LLMs? The LLMs, which are run in the big companies in the cloud, are inherently insecure because those companies, frankly, run at a speed that no one can keep up with. Security is not necessarily the most important factor when they look at metrics of how we move forward. A lot of companies that have IP and want to protect the IP shy away from using those big LLMs, which are run on those cloud systems, and that also drives the move to the edge. There are companies like Anthropic that offer Claude or other solutions, which you can also get into your premises. But, still, there’s a certain amount of connectivity back to the big mothership, which is still considered a risk. When we talk about, for example, all of us having IP in-house, would you upload that IP to a big LLM to the cloud? The answer most likely is no. So for those kinds of applications, we also see a massive shift to where our customers say, ‘We don’t want them. You need to provide us with a solution that we can ideally put in an air gap solution, so that this thing never talks back to the internet.’ That also drives edge compute. But here we’re not talking about battery-powered compute. We’re talking about massive, on-prem high-performance cloud compute systems, which are owned and secured by those companies to have their own large language models or small language models, which consume that information. You also see that in governments, there’s a whole gov cloud initiative, which basically decouples from the big clouds for the very same reason. No security institute wants to upload any of this data to the big guys. I just want to make that distinction. There’s a distinction between hardware security, but there’s also a distinction between LLM security and who owns it.

SE: There’s a level above the device level that is important here. How does this look today in the context of an organization’s IP and still differentiating and keeping up with the competition?

Petr: What I can tell you is that my team and I are building AI solutions for customers, and no one wants to use the cloud. They are completely afraid of it. There have been too many bad examples, so we have to build solutions that operate at the edge, and that makes it difficult and that makes it slow. People ask us when we are going to have a solution, and say we are significantly lagging behind the big guys, but they scrape the internet, and they keep pushing out stuff. For us, security is the most important KPI in that whole discussion.

Cooper: I certainly agree that from a security point of view, we have to look at the security of the algorithms. For example, if I do a ton of effort on training, and I’ve got to secure that, it’s much harder to do on the edge than in the data center because there’s more access there. There’s the security of the user’s data that’s a problem, and at least in the hardware side, there are hardware approaches. However, there are also system-level approaches — not necessarily inside the NPU or inside the AI accelerator, but certainly at the chip level — that have to be handled. And there are IP providers, like us, that have licensable security. But security has to be addressed, and it has to be a system-level issue for somebody who’s making the hardware.

Uppala: Eduardo touched upon the security boot and these kinds of applications, and Alex touched upon the higher-level software side of things. But there’s also the supply chain aspect of the security that needs to be considered, all the way from where it’s manufactured to where it’s being delivered to the end user. That needs to be secure. Otherwise, you can introduce a lot of malicious actors into the process who could gain access to the devices.

Petr: We had massive internet breaches around SolarWinds and others. There have been incidents in the last couple of years, which highlighted that our network is not as secure as we think it is. It basically boiled down to the supply chain — especially that the software stack in a supply chain is not secure — and governments are now driving supply chains and supply chain software security for everything that’s used in the design of hardware, and also what’s used to operate the networks. We see network operations as a commodity. It needs to be protected against malicious actors. What Raj just pointed out is, like every stack in that system, the hardware and the software used to design those solutions need to be secured so that no one has the opportunity to inject something that shouldn’t be there. We’ve seen backdoors being injected. For us in the whole industry — which operates at every level, from transistors all the way up to data centers and network operations — every level has its own security requirement, and that has also been identified as one of the biggest investment needs.

Related Reading
Moving AI Workloads To The Edge
There are benefits and challenges of processing AI workloads on-device to enhance performance, reduce costs, and ensure data privacy.



Leave a Reply


(Note: This name will be displayed publicly)