Agentic AI Is Changing Data Center Architectures

Standalone GPUs are being replaced by heterogeneous SoCs and chiplets that combine CPUs, GPUs, and NPUs to eliminate memory bottlenecks, reduce latency, and boost efficiency.

popularity

Key Takeaways:

  • The rise of agentic AI is shifting data centers from GPU-centric number crunching to CPU-driven orchestration, where managing long-running reasoning loops and context is just as important as raw compute.
  • Integrating CPUs, GPUs, and stacked memory into tightly coupled multi-die architectures with varying workloads makes it much harder to ensure they will be reliable and efficient.
  • Hardware-level security, access control, and monitoring are becoming critical design requirements so autonomous agents cannot access forbidden data or execute untrusted code, further raising the bar for chip and system architects.

The rapid proliferation of agentic AI is forcing chip and system architects to rethink data center design from the ground up. Instead of optimizing only for raw GPU throughput, they must now validate complex hybrid systems in which CPUs orchestrate long‑running reasoning loops and manage context, memory, and data movement. GPUs and accelerators will handle the heaviest numerical work, but that’s only one of several pieces in the pie.

That shift also explodes the verification challenge. Functional and performance verification must be done together, with large‑scale emulation and prototyping, realistic agent workloads, and a deep focus on memory architecture, context swapping, power behavior, and thermal integrity in 3D-IC and stacked-memory designs. And all of this needs to be secure and reliable, with monitors and access controls that keep autonomous agents from touching certain data or executing untrusted code.

“The rise of agentic AI is reshaping the demands placed on the CPU,” said Satadal Bhattacharjee, global head of cloud and AI infrastructure silicon in Arm’s Cloud AI Business Unit. “As AI systems become more complex, the CPU is emerging as the orchestration and execution engine for a continuously operating intelligence loop, managing context, tool calls, memory movement, security boundaries, and accelerator utilization.”

Arm’s forecast indicates that agentic AI will require data centers to deliver up to four times the CPU core density within the same power envelope, but this doesn’t diminish the importance of accelerators. “This underscores the critical reality that accelerator performance increasingly depends on the efficiency, responsiveness, and balance of the entire system,” Bhattacharjee said.

At the same time, agentic workloads introduce more unpredictable control flow, irregular memory access patterns, synchronization requirements, and I/O intensity. “Avoiding system-level stalls will require tighter CPU-accelerator coupling, more efficient data movement, higher-bandwidth memory access, and system fabrics capable of supporting coherency, isolation, and scale,” he said. “As a result, heterogeneous architectures are becoming both more modular and more tightly integrated. Technologies such as PCIe, CXL, coherent chip-to-chip links, and advanced fabric IP give system designers new ways to balance flexibility, bandwidth, latency, and efficiency.”

The impact of agents on data center architectures is fundamental. “When we talk about AI, GPUs used to be used for matrix math and number crunching,” said Sathishkumar Balasubramanian, head of products at Siemens EDA. “That’s totally changing now, because the agentic flow is coming up. CPUs were mainly used for inputting data and loading it into different GPUs. That use is changing from data loaders to data orchestration. The entire orchestration layer is handled by the CPUs, so Intel is seeing demand going up because people have realized they need CPUs to do a lot of these agentic workflow tasks and only use a GPU when it’s necessary. Again, the rise of data orchestration is going to be very key, and we’re moving from offloading data to orchestrating the data, and that’s the new compute cluster.”

There are now agentic reasoning loops driving the infrastructure from isolated boxes. “You only start doing GPU-heavy things when it’s needed, Balasubramanian said. “What’s also changing is that there used to be GPUs in one rack, CPUs in another rack. The problem with that is that everything needs to access a memory where the actual data is, and there’s too much latency. So now [the processor developers] are trying to do what the server companies did in the past, having both the GPU and the CPU in the same rack,”

And because agentic AI requires complex orchestration, tool-calling, and reasoning loops that cannot rely solely on GPUs, as had been the case for the past few years, this signals the shift back to tightly integrated heterogeneous SoCs and chiplets. This is evident in recent announcements involving Intel’s Core Ultra Series 3 mobile processors, code-named Panther Lake, Nvidia’s RTX Spark PC chips (which have Arm CPUs), Apple’s Fusion architecture, AMD’s APUs, and Nvidia’s Vera Rubin platform, among others.

The concept isn’t exactly new. Intel first introduced an SoC with a CPU and a GPU in January 2010. But the fundamental physics of how they interact has completely shifted. Early SoCs treated the integrated GPU as a secondary component, used to output a display to a monitor or render basic 3D graphics while relying on slow, separated memory pools. Today’s agentic AI-focused SoCs are engineered for continuous, asynchronous, multi-step execution loops. This has inspired architectural innovations that did not exist in older SoC designs.

“They merged them within the same die, and they share the same bandwidth in terms of memory protocols and everything, so they can access the unified memory,” Balasubramanian said. “The latency is completely reduced, and there’s a lot more work happening both on the CPU side and on the GPU side. The architecture is changing completely with the way things are happening. Even PCs are going to be like that, with much more beefed-up GPUs and CPUs, because you’ve got to be running your own NemoClaws, and all your own 24/7 agents, and that requires a lot of heavy lifting both on the local compute and also on the data center side.”

The architecture of these chips can vary greatly, depending on the end application.

“While agentic AI is indeed rapidly impacting data center compute allocation ratios (CPU versus GPU), the true impact of the rise of agentic AI will be felt most acutely in the overall fabric of cloud versus edge compute, and felt particularly by the AI service companies themselves,” said Steve Roddy, chief marketing officer at Quadric. “At current rates of token demand growth, the available supply of data center compute capacity will not catch up to demand, despite $1T in annual hyperscaler CapEx. As a result, in recent months, we’ve seen a groundswell of interest in pushing much more GenAI compute ‘horsepower’ into a new breed of AI-focused edge devices. Just this month, we saw Nvidia introduce a PC chipset that claims hundreds of TOPS of inference capacity to try to address this market. But that is in a high-end laptop costing $2,500 or more, with all the other PC features needed for a human-driven computer, not an agentic-focused compute solution.”

Roddy said the market needs a dedicated agentic token server, priced well below $1,000, that consumes electricity comparable to a conventional home appliance or desktop PC. “Soon we will see PetaOp-class inference in passive air-cooled appliances that are suitable for the home and office. 100 million of these agentic token engines distributed in homes and offices could collectively deliver more than a Zetta-Op of inference compute with no massive data center build out and no new power plants.”

Data centers will still sprout from farmland like carefully tended crops. “But they will work in concert with a vast distributed arsenal of compute power tucked into our homes and offices,” Roddy said.  “The keys to enabling this new compute paradigm are:

  1. The adaptation of AI models to a decentralized compute model. The consumer chatbot and the coder’s agentic workflow will both need to have advanced models that split computation between a centralized massive-parameter model and the local 100B+ parameter model, and
  2. Power-efficient, fully-programmable, designed-for-the-edge inference processing power – not repurposed GPUs.”

The pressure on latency
Underpinning all of this is the ability to quickly move data and process it where needed. Antonio Costa, director of product management for PCIe and CXL at Synopsys, noted that just a few years ago, the focus was almost entirely on GPUs training large language models and inferencing in the cloud.

“In that environment, what we have seen in customer designs was that there was a main CPU, multiple GPUs, with a ratio of one CPU to four GPUs, or one CPU to two GPUs. In our case, PCIe will be used between the CPU and the GPU to transmit the data for the training and the parameters. Normally, that’s what the training is about: trying to define the weight of those parameters to train a model. That was the first wave of this AI revolution — training models, and inferencing after the models are trained, to enable everyone to use LLM chatbots.”

In that context, the CPU was feeding data into the GPUs, PCIe was used as a channel, and the most important aspect was bandwidth. “We need to have bandwidth between the CPU and GPU to transmit all these parameters, but latency was not a big problem because it’s just one way it’s trained,” Costa explained. “Then you read back these parameters into the system to save them. What has changed with the introduction of agentic AI is that you are no longer just feeding data into the GPU. You are using the CPU as an orchestrator of the full system. With agentic AI, the CPU is interacting with files, with network websites, with disks to read and write data, where the GPU is the brain telling you what to do next based on the instructions that the CPU provides. But the CPU is the one taking action.”

AI agents will take the actions based on what the LLM model instructs them to do. That requires much more interaction between the CPU and the GPU. The CPU has to read the data and then very often write to the GPUs, as well as interact with all the surroundings, like the network interface card for web access, and the SSD drive for memory expansion, because much more memory is needed to handle more data to take actions based on what the user wants.

“Say you want to create a PowerPoint,” Costa said. “You have to open the PowerPoint application. Then you have to ask the LLM model to give you the data in the PowerPoint. This is a much more CPU-centric application, which puts the CPU back in the spotlight. We’ve seen recent announcements for Arm and Intel products because of the agentic AI shift, and it means that many more PCIe links are needed to connect to all the surrounding devices and to connect to the GPU. But now the latency is critical. If you take too much time to get your response back, it means your agent is slow. So, latency is a critical aspect. PCIe is well-positioned to address the latency aspect of that, and it becomes a fundamental protocol to address those challenges because the number of lanes and connections to enable agentic AI is exploding. We have seen some customers designing some of these chips, and we see they have the need for a hundred lanes. This compares to 16 lanes of PCIe for AI training. So, it’s at least five times more requirements for the number of lanes and bandwidth than before.”

Verification challenges of agentic AI SoCs
One of the biggest challenges here is verifying everything from data movement to interactions between different types of processing elements, and between processors and memories.

“Everything gets more complex,” Balasubramanian said. “The verification workload is huge right now. There are two different paradigms of compute in a single [agentic AI] chip, and you need to verify that they work together very well, that there are no conflicts, and so forth. In terms of memory, are you able to handle the memory bottleneck? How are you structuring your memory? How are you structuring how you get your data in and how you queue your instructions? There are a lot more terms of how we verify. I’m talking about functional verification. Also, the performance verification needs to be thorough if the complexity is higher, which will equate to a huge demand for emulation.”

Anytime there is a big change in hardware architecture, developers need to start co-developing software and hardware to make sure everything works functionally.

“This means emulation and FPGA prototyping are needed,” he said. “Those are the things that are going to be huge in terms of verification to help them do the functionality correctly. That’s on the first side of the functional verification. Then you have to do the performance validation, as well. You need to make sure that you’re able to meet the high demand requirements in terms of memory to processors to GPUs, and so forth. Do you have enough? That’s something they need to worry about. The third thing is that everything is going to be 3D-IC stacked dies. That’s the way they can do it now, and it includes understanding the physical effect of everything you’re doing. You can have a very high switching bus, but what does it mean in terms of your thermal map? If it is really that hot, and you have a big HBM on top of it, what’s going to happen? Is it going to melt? Will it cause some kind of deformity in the wafer? Everything needs to work perfectly for them to have a high-performing hybrid architecture chip. It means functional verification and emulation need to change. You need to understand the protocols. You need to understand the different memory configurations. You need to architect your software to make sure that your hardware requirements are met, and vice versa. And then implementing it is a big challenge in terms of 3D-IC and all the thermal effects and everything else.”

Additionally, as the general industry knowledge around security risks grows, customers are increasingly asking about building in hardware security, as well as incorporating security monitors. “With agents, it’s a challenge,” Balasubramanian said. “How do you make sure you have protected access control for your system on the hardware side? You have monitors built in, both security and reliability monitors, which is another angle to think through because you want to make sure the agent doesn’t execute any untrusted code or anything else. There are a lot more things that are getting in there, and it’s a huge space.  Security and hardware monitoring get much more challenging with these complex architectures.”

Conclusion
While the optimal architecture for agentic AI will vary by workload, the general direction is taking shape. “AI infrastructure is evolving from accelerator-centric servers into heterogeneous rack-scale systems, where more purpose-built systems optimized for each stage and component of the agentic workflow can be optimally executed,” Arm’s Bhattacharjee said.

To Roddy, there are more questions. “Does an open hardware ecosystem evolve, where compute horsepower is modular and expandable at the outset before settling into everyday appliance utility, just as the PC started in the mid-1980s as a hobbyist BYO world of incremental hardware upgrades before evolving into today’s vanilla laptop market? Or, do various competing players try to establish proprietary closed boxes, perhaps tied to service providers, much like the cable set-top box market evolved in the 2000s and 2010s? Then, does the agentic AI software deployment model evolve to allow user migration from model to model, or will edge agentic token-servers be locked to a service provider contract, subsidized by the service contract? And how does that software model evolve?  Does it start today as a means to empower OpenClaw power users with open models, and then migrate to supporting token generation for subscription service users?”

Agentic AI is transforming data centers into tightly integrated, continuously orchestrated systems where CPU-driven workflows, hybrid CPU–GPU architectures, and hardware-level security all have to be engineered and verified as one. For chip architects, the real differentiator will be how well they can co-design compute, memory, packaging, and verification flows to keep pace with these fast-evolving agent workloads without sacrificing reliability or control.



Leave a Reply


(Note: This name will be displayed publicly)