AI/ML Workloads Need Extra Security

Faulty models, poison code, and corrupt data could cause widespread disruption.


The need for security is pervading all electronic systems. But given the growth in data-center machine-learning computing, which deals with extremely valuable data, some companies are paying particular attention to handling that data securely.

All of the usual data-center security solutions must be brought to bear, but extra effort is needed to ensure that models and data sets are protected when stored, both when being transferred to and from accelerator blades, and when processing on a system that hosts more than one tenant at the same time within the same server.

“Inference models, inference algorithms, training models, and training data sets are considered valuable intellectual property and need protection — especially since these valuable assets are handed off to data centers for processing on shared resources,” said Bart Stevens, senior director of product marketing for security IP at Rambus, in a recent presentation.

Any tampering with AI training data can cause the creation of a faulty model. And any changes to a well-trained model can result in incorrect conclusions being drawn by the AI engine. “All three main types of learning (supervised, unsupervised, and reinforcement) use weighted calculations to produce a result,” said Gajinder Panesar, fellow at Siemens EDA. “If those weightings are stale, corrupted, or tampered with, then the outcome can be a result that is simply wrong.”

The implications of an attack on an AI workload will depend on the application, but the result will never be good. The only question is whether it will cause serious damage or injury.

While attacks are the main focus for protection, they are not the only areas of concern. “The ‘threats’ fall into two broad categories — intentional interference by a bad actor and unintentional problems, which generally can be thought of as bugs, either in the hardware or the software,” said Panesar.

The security foundation
There are fundamental security notions that apply to any computing environment, and AI computing is no exception. While special attention must be paid to certain aspects of an AI workload, it is not just that workload that must be protected. “We have to think about the integrity of operation of the whole system, not just the particular chip or on-chip subsystem we’re dealing with,” said Panesar.

As outlined by Stevens, there are four aspects of security that must be handled. First, the data and computing must be kept private. Second, it should not be possible for an attacker to alter any of the data anywhere at any time. Third, all entities participating in the computing must be known to be authentic. And fourth, it should not be possible for an attacker to interfere with the normal operation of the computing platform.

This leads to some basic security concepts that hopefully will be familiar to anyone involved in secure-system design. The first of these is the protection of data in three phases:

1. Data at rest, which includes any stored data;
2. Data in motion as it’s communicated from one place to another, and
3. Data in use, which is active and alive in the computing platform as it is being worked on.

Yet another familiar requirement is the trusted execution environment (TEE). This is a computing environment limited to highly trusted software and accessible to the rest of the computing platform only through highly controlled and trusted channels. Any critical hardware or other assets that cannot be compromised will be placed in this environment and will not be directly accessible outside the TEE.

The TEE provides a fundamental way of handling critical security operations in a way that’s far less subject to interference by outside software. It keeps application software separate from lower-level security operations. It also manages the boot process to ensure it proceeds securely and reliably, catching any attempts to boot inauthentic code.

There is a wide range of operations required for secure computing. Authentication ensures that entities with whom one is communicating are truly who they say they are. Encryption keeps data safe from prying eyes. Software and other data artifacts can have their provenance vouched for by hashing and signing operations. And all of these functions require keys of sufficient strength to protect against brute force hacking, and that makes effective key provisioning and management essential.

Additional protections are provided by ensuring that TEEs and other critical security circuits are protected from attempts either to break in or to disrupt operation. Side channels must be protected to ensure that there is no way to snoop data or keys by measuring externally detectable electronic artifacts like power or electromagnetic radiation.

And finally, a further layer of protection can be provided by circuits that monitor the internal goings-on to raise an alert if something suspicious appears to be afoot.

Applying this specifically to AI
Keeping AI workloads secure starts with these basic security requirements, whether training or inferring, and whether doing so in a data center, a local server, or in edge equipment. But there are additional considerations specific to AI workloads that must be taken into account.

“Secure AI implementations are required to prevent the extraction or stealing of inference algorithms, models and parameters, training algorithms, and training sets,” explained Stevens. “This would also mean preventing the unintended replacement of these assets with malicious algorithms or data sets. This would avoid poisoning the system to alter the inference results, causing mis-classification.”

The new AI processing hardware architectures provide another part of the system that needs protection. “The heart of the system is obviously the array of powerful accelerator chips, ranging from a handful to a large matrix of dedicated AI processing units with their own pool of memory and with only one task, which is to process as much data as possible in the shortest time frame,” noted Stevens.

Designers must first account for the specific assets that need protection. Most obvious is the training or inference hardware. “Typically seen on blades is a gateway CPU, with a dedicated flash and DDR,” said Stevens. “Its task is to manage models, add the assets. and control accelerators. Then there is the connection to the fabric — a high-speed network or PCIe-4 or -5 interfaces. Some blades also have proprietary inter-blade links.”

Fig. 1: A generalized AI blade for a data center. In addition to the usual CPU, dynamic memory, and network connection, accelerators will do the heavy lifting, assisted by internal SRAM. Source: Rambus

Fig. 1: A generalized AI blade for a data center. In addition to the usual CPU, dynamic memory, and network connection, accelerators will do the heavy lifting, assisted by internal SRAM. Source: Rambus

In addition, there are various types of data to be protected, and those depend on whether the operation is training or inference. When training a model, the training data samples and the basic model being trained must be protected. When inferring, the trained model, all of the weights, the input data, and the output results need protection.

Operationally, this is a new, rapidly evolving area, and so debug is likely. Any debug must be performed securely — and any debug capabilities must be shut down when not in authenticated use.

And changes to code or any of the other assets must be delivered in well-secured updates. In particular, it’s likely that models will improve over time. So there must be a way to replace old versions with newer ones, while at the same time not allowing any unauthorized person to replace a valid model with an inauthentic one.

“Secure firmware updates, as well as the ability to be able to debug the system in a secure way, are becoming table stakes these days,” noted Stevens.

Risks of data breaches
It’s pretty obvious that the data must be protected against being stolen. Any such theft is clearly a confidentiality breach, but the ramifications of that are even more dire where government regulations are involved. Examples of such regulation are the GDPR rules in Europe and the HIPAA health-care rules in the United States.

But in addition to outright theft, manipulation of the data is also of concern. Training data, for example, could be altered either as a means of sleuthing out some secret or simply to poison the training so that the resulting model would work poorly.

Much of the computing — especially when training a model — will occur in a data center, and that may involve multi-tenant servers for lower-cost operation. “More companies and teams are relying on shared cloud computing resources for a variety of reasons, mostly for scalability and cost,” observed Dana Neustadter, senior product marketing manager for security IP at Synopsys.

That means multiple jobs co-existing on the same hardware. And yet those jobs must execute no less securely than if they were on separate servers. They must be isolated by software in a manner that keeps anything – data or otherwise – from leaking from one job to another.

“Moving computing to the cloud can bring potential security risks when the system is no longer under your control,” said Neustadter. “Whether mistaken or malicious, one user’s data can be another user’s malware. The users need to trust the cloud provider to meet compliance standards, perform risk assessments, control user access, and so on.”

Containerization usually helps to isolate processes in a multi-tenant environment, but it’s still possible for one rogue process to affect others. “A problem that causes an application to hog processing resources may affect other tenants,” noted Panesar. “This is especially important in critical environments such as medical reporting, or anywhere the tenants have a binding SLA (service-level agreement).”

Finally, while it may not affect the specific outcome of a computation or confidentiality of data, data-center operations must ensure that administrative operations are safe from tinkering. “Security should also be present to ensure proper billing of services and to prevent unethical use, such as racial profiling,” Stevens pointed out.

New standards will help developers to ensure that they’re covering all the necessary bases.

“The industry is developing standards like PCIe-interface security, with the PCI-SIG driving an integrity and data encryption (IDE) specification, complemented by component measurement and authentication (CMA) and trusted execution-environment I/O (TEE-I/O),” said Neustadter. “The assignable device interface security protocol (ADISP) and other protocols expand the virtualization capabilities of the trusted virtual machines used to keep confidential computing workloads isolated from hosting environments, backed by strong authentication and key management.”

Fig. 2: AI computing involves a number of assets, and each has specific security needs. Source: Rambus

Fig. 2: AI computing involves a number of assets, and each has specific security needs. Source: Rambus

Implementing protections
Given a typical AI computing environment, then, there are several steps that must be taken to lock down operations. They start with a hardware root of trust (HRoT).

An HRoT is a trusted, opaque environment where secure operations like authentication and encryption can be performed without exposing the keys or other secrets being used. It could be a critical component of a TEE. They are usually associated with a processor in a classic architecture, but here there is typically more than one processing element.

In particular, the newer hardware chips dedicated to AI processing don’t have built-in root-of-trust capabilities. “Many recent AI/ML accelerator designs — especially by startups — have focused mainly on getting the most optimal NPU processing on board,” explained Stevens in a follow-up interview. “Security was not the main focus, or was not on their radar.”

That means a system will need to provide an HRoT elsewhere, and there are a couple of options for that.

One approach, which focuses on data in use, is to give each computing element — the host chip and the accelerator chip, for example — its own HRoT. Each HRoT would handle its own keys and perform operations at the direction of its associated processor. They may be monolithically integrated on SoCs, although that’s not currently the case for neural processors.

The other option, which focuses on data in motion, is to provide an HRoT at the network connection to ensure that all data entering the board is clean. “For data in motion, the throughput requirements are extremely high, with very low latency requirements,” said Stevens. “The systems use ephemeral keys, as they typically work with session keys.”

“For authentication, a blade would need to get an identification number, which doesn’t necessarily need to be kept secret,” he continued. “It just needs to be unique and immutable. It can be many IDs, one for each chip, or one for the blade or appliance itself.”

These external HRoTs may not be needed when security is built into future neural processing units (NPUs). “Eventually, when the startups’ initial NPU proofs of concept have been shown to be successful, the architecture of their second spin of these designs will have root of trust capabilities in them, which will have more cryptographic capabilities to handle the larger workloads,” added Stevens.

Data moving from SRAM to DRAM, or vice versa, also should be encrypted to ensure it can’t be snooped. The same would apply to any direct side connection to a neighboring board.

With that much encryption embedded in an already intense computation, one runs the risk of bogging down operation. Secure operation is critical, but it serves no one if it cripples the operation itself.

“The network or PCI Express link to the fabric should be protected by inserting a high-throughput L2 or L3 protocol-aware security packet engine,” added Stevens. “Such a packet engine requires little support from the CPU.”

This can apply to memory and blade-to-blade traffic encryption as well. “The contents of the gateway CPU DDR and local AI accelerator GDDRs can be protected by an inline memory encryption engine,” he said. “If a dedicated blade-to-blade side channel exists, it can be protected by high-throughput AES-GCM [Galois/Counter Mode] link-encryption accelerators.”

Finally, standard security protections can be buttressed by ongoing monitoring that keeps track of actual operation. “You need to gather information from the hardware that can tell you how the system is behaving,” said Panesar. “This needs to be real-time, instantaneous, and long-term statistical. It also needs to be comprehensible (whether by a human or a machine) and actionable. Temperature, voltage, and timing data is all very well, but you also need higher-level, more sophisticated information.”

But this is no substitute for rigorous security. “The aim is to identify problems that might elude conventional security protections – but it’s not a substitute for such protection,” he added.

Hard work ahead
These elements aren’t necessarily simple to implement. That requires hard work. “Resiliency, the ability to securely update a system, and the ability to recover from a successful attack are real challenges,” noted Mike Borza, security IP architect at Synopsys. “Building systems like that is very, very tough.”

But as AI computing becomes more and more routine, engineers who aren’t specialists in data modeling or security increasingly will be turning to ML services as they work AI into their applications. They need to be able to count on the infrastructure, taking good care of their important data so the models and computations they’ll be using to differentiate their products don’t end up in the wrong hands.

Security Tradeoffs In Chips And AI Systems
Experts at the Table: How security affects power and performance, why AI systems are so difficult to secure, and why privacy is a growing consideration.
Security Research Bits
New security technical papers presented at the August 21 USENIX Security Symposium.
Always On, Always At Risk
Chip security concerns rise with more processing elements, automatic wake-up, over-the-air updates, and greater connectivity.
Security knowledge center
Top stories, white papers, blogs, videos about hardware security
AI Knowledge Center

Leave a Reply

(Note: This name will be displayed publicly)