New Uses For AI In Chips

ML/DL is increasing design complexity at the edge, but it’s also adding new options for improving power and performance.


Artificial intelligence is being deployed across a number of new applications, from improving performance and reducing power in a wide range of end devices to spotting irregularities in data movement for security reasons.

While most people are familiar with using machine learning and deep learning to distinguish between cats and dogs, emerging applications show how this capability can be used differently. Data prioritization and partitioning, for example, can be utilized to  optimize power and performance of a chip or system with no human intervention. And various flavors of AI can be used throughout the design and manufacturing flows to catch errors or flaws that humans cannot. But all of these new components and functions also make designing chips more complex, even at more mature nodes, as probabilities replace finite answers and the number of variables increases.

“As you move AI out to the edge, the edge starts to look like the data center,” said Frank Ferro, senior director of product management at Rambus. “There’s the baseband, which is doing a lot of the same processing function. And similarly on the memory requirements, we’re seeing a lot of 5G customers running out of bandwidth and looking to HBM at the edge of the network. But instead of going to the cloud, there is more configurability in the network, and you can manage the workloads. Balancing those workloads is very important.”

Still, nothing in the AI world is simple, as AI chip designers have learned. “In an AI design, there are many questions to be answered,” said Ron Lowman, strategic marketing manager at Synopsys. “What algorithm are you trying to process? What is your power budget? What accuracy are you trying to achieve? In an image recognition application, you may not need a 32-bit floating-point processor. A lower-cost 16-bit image chip may do just fine. If a 92% accuracy is all you need, a low-cost chip may cut down your overall budget. If you know what you want to achieve, taking the IP approach will have a great deal of advantages. You can select the right AI processors, right kind of memory (SRAM or DDR) I/O, and security. Selecting the right IP is important, but doing the modeling and benchmarking also will help developers to optimize the AI solutions and reduce errors.”

Design challenges can add up very quickly for any advanced chips, and more variables require better models, more process steps, and more time. “You start from a very complex idea of what the chip is going to be performing, and then you see if there are different requirements for different parts of the chip,” said Roland Jancke, design methodology head in Fraunhofer IIS’ Engineering of Adaptive Systems Division. “In the past, you would just design something, develop it, and tape it out, and see whether it works or not. That’s no longer feasible. You need really an integrated process. From the very beginning you need to think about what are possible failure modes. And you even need to start with with maybe finite element methods for simulation at the very beginning, which typically has not been done in the past, where you start with very rough models with functionality that you want to integrate. So if you have a MATLAB model, that does not reflect the physical interaction between the different parts of the chip. You need to integrate different models early in the development process — physical models, functional models — to see whether your concept is going to be functional enough.”

That becomes harder with more moving pieces, particularly when those pieces are customized or semi-customized for specific data types and use cases. But the upside is that better algorithms and compute elements also allow more data to be processed within a much smaller footprint, and with far less power than in past implementations. That, in turn, enables the processing to occur much closer to the sources of data,  where it can be used to determine which data is important, where that data should be processed at any particular point in time, and which data can be discarded.

A tipping point
Most of these changes by themselves are incremental and evolutionary, but collectively they enable inferencing and training across the edge, where a spectrum of heterogeneous architectures are beginning to emerge. By leveraging various types of neural networks, processing can be speeded up for targeted purposes, with varying levels of accuracy and precision for different applications.

Fig. 1: The complex AI process can be broken down into AI Stacks. Source: McKinsey & Co.

For any AI chips that perform complex algorithmic and computations, there are several key requirements. First, they need to be able to process data in parallel, using multiple compute elements and wide data paths to reduce latency. In many cases that also involves some localized memory in close proximity to compute elements, as well as high-bandwidth memory. Second, these devices need to be optimized for size, cost, and power budgets, which often requires high-throughput architectures that are sized according to projected workloads. That, in turn, requires a number of tradeoffs, which need to be balanced for the particular use case. And third, these architectures often involve a mix of processors to manage complex data flows and power management schemes, which can include CPUs, GPUs, FPGAs, eFPGAs, DSPs, NPUs, TPUs, and IPUs.

“In design, developers need to consider the requirements of training, inferencing, low power, connectivity, and security,” said Danny Watson, principal software product marketing manager for Infineon’s IoT, Wireless and Compute Business Unit, “This approach enables new use cases that need local fast decisions, while meeting the power budget of today’s IoT products.” Watson noted that the key is collecting the right data so applications can capitalize on that it, allowing them to take leverage technology improvements.

AI everywhere.
For chip companies, this is all a very big deal. According to the latest report by Precedence Research, the AI market as a whole will grow from $87 billion in 2021 to more than $1.6 trillion by 2030. That includes data centers as well as edge devices, but the pace of growth is significant. In fact, AI is such a hot field today that almost every major tech company is investing in or making AI chips. They include Apple, AMD, Arm, Baidu, Google, Graphcore, Huawei, IBM, Intel, Meta, NVIDIA, Qualcomm, Samsung and TSMC. The list goes on and on.

This market barely existed five years ago, and a decade ago most companies were thinking in terms of cloud computing and high-speed gateways. But as new devices are rolled out with more sensors — whether that’s cars, smart phones, or even appliances with some level of intelligence built into them — so much data is being generated that it requires architectures to be devised around the input, processing, movement, and storage of that data. That can happen on multiple levels.

“In AI applications, various techniques are being deployed,” said Paul Graykowski, senior technical marketing manager at Arteris IP. “A recent customer has developed a complex multi-channel ADAS SoC that can handle four channels of sensor data, each with its own dedicated compute and AI engine to process the data. Similarly, new AI chip architectures will continue to change to meet the requirements of new applications.”

From big to small
Time to results is usually proportional to distance, and shorter distances mean better performance and lower power. So while massive datasets still need to be crunched by hyperscale data centers, there is a concerted effort by the chip industry to move more processing downstream, whether that’s machine learning, deep learning, or some other AI variant.

Cerebras is the poster-child in the deep-learning world, where speed is critical, followed closely by accuracy of results. Natalia Vassilieva, product management director at Cerebras, reported that GlaxoSmithKline has increased its drug discovery efficiency by using its wafer-scale device in its Epigenomic Language Models. In one scenario, GlaxoSmithKline was able to reduce the deep neural network-based virtual screening time for a large library of compounds from 183 days, running on a GPU cluster, to 3.5 days on the Cerebras device. That “chip” has more than 2.6 trillion transistors, 850,000 AI optimized cores, 40 GB of on-chip memory, and a memory bandwidth of 20 PB per second (one petabyte equals 1,024 terabytes). It also consumes 23 kW of power, and uses internal closed-loop, direct-to-chip liquid cooling.

Graphcore took a different approach by introducing an intelligence processing unit (IPU) technology. By using multiple-instruction, multiple-data (MIMD) parallelism and local distributed memory, the IPU can deliver 22.4 PFLOPS (1 petaflop per second equals 1000 teraflop per second) while only air cooling is needed. Additionally, the IPU has a theoretical arithmetic throughput of up to 31.1 TFLOPS in single precision. It is much faster than the A100’s 624 TFLOPS. In one test conducted by Twitter, the IPU outperformed the GPU.

Fig. 2: The IPU technology, taking advantage of multiple instruction, multiple data (MIMD) parallelism and local distributed memory, outperforms GPU. Source: Graphcore

AI also can go small. AI-enabled smart things, also known as the artificial intelligence of things (AIoT)/embedded AI, are flourishing. According to Valuates Reports, edge AI hardware will grow from $7 billion in 2020 to $39 billion in 2030. AI has added intelligence to edge computing, network end points, and mobile devices. Along with the IoT, more and more applications are using embedded AI. Among their number are wearables, smart homes, and smart remote controls, including some that use voice recognition. Also relying on embedded AI are AR/VR gaming, smart automotive panels, object and movement detection, home health care, meter reading, smart factories, smart cities, industrial automation, and smart buildings, including control and energy management. The list goes on, limited only by one’s imagination.

“AI will make IoT computing more efficient with its ability to process data faster locally,” said Suhas Mitra, product marketing director for Tensilica AI products at Cadence. “This includes providing better response time and smaller latencies, since data being generated is also processed on the edge device instantly. Executing AI edge processing will be more reliable, since it may not always be possible to send a massive amount of data via a live wireless or wired connection to the cloud constantly. It also relieves the pressure of storing and processing enormous amounts of data in the cloud, which could contain personal and sensitive information. Privacy concerns about sending user information to the cloud may make it impossible to upstream data regardless of consent. Doing more edge computing extends the battery life, as some of the compute requires fewer cycles on the edge platform when using AI approaches. As a result, less energy is consumed, and heat dissipation is lower.”

All AI chips need to be trained before inferencing can take place. While the datasets are often so large that they require a large data center to train, further training can be done at the personal computer or development system level. Developers will go through a painstaking process to ensure an optimal inference algorithm is achieved. Many AI chip manufacturers furnish a list of training partners for their customers. Even with consultants’ help, developers still have to pay for the consulting time and go through the training effort.

An easier way to do this is to implement with pre-trained models, such as Flex Logix’s EasyVision platform. “With the pre-trained X1M chip for modules, developers can bypass the training process and go directly to product development and test,” said Sam Fuller, senior director of inference marketing at Flex Logix. “The pre-trained solutions have been field-tested and proven, which is much more efficient than developers’ trial-and-error approach. Often, the dedicated pre-trained chip is much more efficient than regular CPUs.”

Thinking even smaller
The possibilities for including AI in even smaller devices are growing, as well, thanks to tiny machine learning, established by the tinyML Foundation to support embedded on-device ML and data analysis operating in the mW range. Many of these devices can perform ML in vision, audio, inertial measurement unit (IMU), biomedical. Additionally, it provides an Open-Source Neural Network Optimization framework called ScaleDown to simplify deploying ML models to tinyML devices.

TinyML can run on any programmable AI chips, including Arduino boards. The mission of Arduino has been to provide electronic devices and software to hobbyists, students, and educators. It has evolved over the years, and solutions based on Arduino are used in many industrial segments today. But combining tinyML and Arduino hardware can potentially provide very low-cost embedded AI solutions. Typical hardware costs less than $100.

One of the challenges of designing AI into these tiny devices is power budgeting. Synaptics has taken on the challenge to develop low-power-budget AI and sensor chips. According to Ananda Roy, senior product manager leading the low-power AI product line at Synaptics, the company’s Katana AI SoC is capable of people detection/counting and fall detection and can run active AI vision inferencing with 30 mW at 24 MHz or higher power at 90 MHz. Deep Sleep mode consumes less than 100 µW. Overall, it is much more power efficient than other AI chips. To achieve efficient power management, the neural processing unit (NPU) relies on a memory architecture with multiple memory banks that can be set to ultra-low power modes when not in use and scalable operating voltage and processor-speed, much like stepping on the gas when you need your car to go faster.

FlexSense, a sensor chip for AI applications, was designed combining a low-power RISC CPU with an analog hardware front end that is highly optimized for efficient conversion of the inductive and capacitive sensor inputs. Together with the on-board Hall effect and temperature sensors, it comprises four sensors to detect inputs such as touch, force, proximity, and temperature, all in one small package (1.62 x 1.62 mm) using only 240 µW, or 10 µW in sleep mode. Traditional designs would require four ICs.

Fig. 3: The low-power sensor comes in a small package (1.62 x 1.62 mm). It replaces four ICs. Source: Synaptics

Security issues and improvements
When it comes to security, AI is both a potential vulnerability and a potential solution. As AI chips are optimized for specific use cases, and as the algorithms are updated, the industry’s learnings decrease and the attack surface widens. But AI also can be used to identify unusual patterns in data traffic, sending out alerts or autonomously shutting down affected circuits until more analysis can be done.

Srikanth Jagannathan, product manager at NXP, pointed out the importance of having the right mixture of functions, chip security, and low power for battery operated devices. The i.MX AI chip combines Arm’s lower-power Cortex-M33 with the Arm TrustZone and NXP’s on-chip EdgeLock, embedded ML, and multiple I/Os. Power consumption is expected to be less than 2.5 watts. Yet it is able to deliver performance of 0.5 TOPS (512 parallel multiply-accumulate operations at 1 GHz).

Fig. 4: The i.MX AI chip combines Arm’s lower-power Cortex-M33 with the Arm TrustZone and NXP’s on-chip EdgeLock, embedded ML, and multiple I/Os. Source: NXP

Kathy Tufto, senior product manager in Siemens EDA’s Embedded Software Division, pointed to the need for establishing a software chain of trust, but noted that starts from the hardware. The goal is to prevent any code that hasn’t been authenticated and validated from executing. Among the solutions she identified:

  • Data at Rest – Secure boot root of trust and Software chain of trust access control.
  • Data at Motion – Security protocols and Crypto acceleration.
  • Data at Use – Hardware-enforced separation through Memory Management Unit (MMU).

“Device manufacturers must also keep in mind that security issues commonly arise after devices are deployed, which means they need to design their devices in a way that they can be updated after they have been deployed,” Tufto said. “Regulatory bodies, including the FDA, are insisting on a strategy of managing CVEs both pre- and post-release to satisfy security requirements for medical devices. Common vulnerabilities and exposures (CVE) monitoring is a process where new CVEs are evaluated against the modules in the device, allowing the device manufacturer to determine appropriate action when new CVEs are found. While a manufacturer can perform these activities itself, it is simpler and easier if you use a commercial software solution that includes security vulnerability monitoring and patches, such as Sokol Flex OS, Sokol Omni OS, and Nucleus RTOS.”

AI chips will continue to evolve and scale, and AI will be used in multiple ways both within those chips and by those chips. That will make it more difficult to design those chips, and it will make it harder to ensure they work as expected throughout their lifetimes, both from a functional and a security standpoint. It will take time to see which benefits outweigh the risks.

While developers continue to develop AI to emulate the human brain, but they are a long way from a device that actually can think for itself. Nevertheless, there are many ways to optimize these systems for specific use cases and applications, and not all of them require human intervention. As time goes on, that will likely mean more AI in more places to do more things, and it will raise design challenges involving power, performance, and security that are both difficult to plan for, to identify, and ultimately to fix.


Lance Harvie says:

Please stop calling computation models – AI – there is no innate intelligence.

Leave a Reply

(Note: This name will be displayed publicly)