Big Changes For Mainstream Chip Architectures

AI-enabled systems are being designed to process more data locally as device scaling benefits decline.


Chipmakers are working on new architectures that significantly increase the amount of data that can be processed per watt and per clock cycle, setting the stage for one of the biggest shifts in chip architectures in decades.

All of the major chipmakers and systems vendors are changing direction, setting off an architectural race that includes everything from how data is read and written in memories to how it is processed and managed—and ultimately how various elements that used to be on a single chip are packaged together. While node shrinks will continue, no one is banking on scaling to keep up with the explosion in data from sensors and an increasing amount of traffic between machines.

Among the changes:

  • New processor architectures are focusing on ways to process larger blocks of data per cycle, sometimes with less precision or by prioritizing specific operations over others, depending upon the application.
  • New memory architectures are under development that alter the way data is stored, read, written and accessed.
  • More targeted processing elements are being scattered around a system, with close proximity to memory. Instead of relying on one main processor that best suits the application, accelerators are being chosen by data type and application.
  • Work is underway in AI to fuse together different data types as patterns, effectively increasing data density while minimizing discrepancies between different data types.
  • Packaging is now a core component of architectures, with an increasing emphasis on the ease of modifying those designs.

“There are a few trends that are causing people to try to get the most out of what they’ve already got,” said Steven Woo, distinguished inventor at Rambus. “In data centers, you want to squeeze as much area out of hardware and software. This is the way data centers are rethinking their economics. It’s very expensive to enable something new. But the bottlenecks are shifting, which is why you’re seeing specialized silicon and ways to make compute more efficient. And if you can hold back sending data back and forth to memory and I/O, that can have a big impact.”

The changes are more apparent at the edge, and just beyond the edge, where there has been a sudden recognition by systems vendors that there will be far too much data generated by tens of billions of devices to send everything to the cloud for processing. But processing all of that data at the edge adds its own challenges, requiring huge improvements in performance without significantly altering the power budget.

“There is a new emphasis on reduced precision,” said Robert Ober, Tesla chief platform architect at Nvidia. “It’s not just more compute cycles. It’s more data packing in memory, where you use 16-bit instruction formats. So it’s not about storing more in caches as making things more efficient. And statistically, the results are consistent both ways.”

Ober predicts that through a series of architectural optimizations, doubling the speed of processing every couple of years should be possible for the foreseeable future. “We’re going to see the state of the art change,” he said. “There are three roof lines we have to deal with to make that happen. One is compute. The second is memory. In some models that’s memory access. In others, it’s compute. The third area is host bandwidth and I/O bandwidth. We need to do a lot of work with optimizing storage and networking.”

Some of these already are being implemented. In a presentation at the Hot Chips 2018 conference, Jeff Rupley, lead architect for Samsung Austin R&D, pointed to several major architectural changes in the company’s M3 processor. One involves more instructions per cycle—six wide versus four in the previous M2. Add to that branch prediction, which is basically several neural networks doing the equivalent of pre-fetch in search, and an instruction queue that is twice as deep, and the challenges begin to come into focus.

Looked at from another angle, these changes shift the nexus for innovation from manufacturing and process technology to architecture and design on the front end, and to post-manufacturing packaging on the back end. And while innovation will continue in process technology, just eking 15% to 20% improvement in performance and power at each new node is incredibly complicated—and it’s not nearly enough to keep pace with the massive increase in data.

“Change is happening at an exponential rate,” said Victor Peng, president and CEO at Xilinx, in a presentation at Hot Chips. “There will be 10 zettabytes [1021 bytes] of data generated each year, and most of it is unstructured data.”

New approaches in memory
Dealing with this much data requires a rethinking every component in system, from the way data is processed to how it is stored.

“There have been many attempts to create new memory architectures,” said Carlos Maciàn, senior director of innovation for eSilicon EMEA. “The problem is that you need to read every row and select one bit in each. One alternative is to build memory that can be read left to right and up and down. You also can take that a step further and add computation close to different memories.”

Those changes include altering the way memory is read, the location and type of processing elements, and using AI to prioritize how and where data is stored, processed and moved throughout a system.

“What if we could read just one byte at a time out of that array in the case of sparse data—or maybe eight sequential bytes out of the same byte lane, without using all the energy associated with other bytes or byte lanes we’re not interested in,” said Marc Greenberg, group director of product marketing at Cadence. “The future may be more amenable to this kind of thing. If we look at the architecture of HBM2 for example, an HBM2 die stack is arranged into 16 virtual channels of 64 bits each, and we only need to get 4 consecutive 64-bit words from any access to any virtual channel. So it would be possible to build arrays of data 1,024 bits wide and written horizontally, but read vertically 64-bits x 4 words at a time.”

Memory is one of the core components of the Von Neumann architecture, but it also is becoming one of the biggest areas for experimentation. “One big nemesis is virtual memory systems, where you’re moving data through in more unnatural ways,” said Dan Bouvier, chief architect for client products at AMD. “You’ve got translations of translations. We’re used to this on the graphics side. But if you can eliminate bank conflicts in DRAM, you can get much more efficient streaming. So a discrete GPU may run DRAM in the 90% efficiency range, which is really high. But if you can get smooth streaming, you can run APUs and CPUs in the 80% to 85% efficiency range, as well.”

Fig. 1: Von Neumann architecture. Source: Semiconductor Engineering

IBM is developing a different kind of memory architecture, which is essentially a modernized version of disk striping. Rather than confined to a single disk, the goal is to opportunistically use whatever memory is available, leveraging a connector technology that Jeff Stuecheli, systems hardware architect at IBM, calls a “Swiss army knife” for connectivity. The advantage of this approach is that it can mix and match different kinds of data.

“The CPU becomes a thing that sits in the middle of a high-performance signaling interface,” said Stuecheli. “If you modify the microarchitecture, the core can do more per cycle without pushing the frequency up.”

Connectivity and throughput are increasingly critical for making sure these architectures can handle the ballooning amount of data being generated. “The big bottlenecks now are in the data movement,” said Rambus’ Woo. “The industry has done a great job enabling better compute. But if you’re waiting for data, or specialized data patterns, you need to run memory faster. So if you look at DRAM and NVM, performance depends on the traffic pattern. If you stream data, you get very good efficiency from memory. But if you have data randomly jumping through space, it’s less efficient. And no matter what you do, with an increase in volume you have to do all of this faster.

More computing, less movement
Compounding the problem is that there are multiple different types of data being generated at different frequencies and velocities by edge devices. For that data to move smoothly between various processing elements, it has to be managed much more effectively than in the past.

“There are four main configurations—many-to-many, memory subsystems, low-power Io, and meshes and ring topologies,” said Charlie Janac, chairman and CEO of Arteris IP. “You can put all four of those in a single chip, which is what’s happening with decision-making IoT chips. Or you can add HBM subsystems with high throughput. But the complexity is enormous because some of these workloads are very specific and there are multiple workloads and pins per chip. If you look at some of these IoT chips, they are taking in huge amounts of data. That’s especially true for things like radar and LiDAR in cars. They cannot exist without some sort of advanced interconnect.”

The challenge is how to minimize data movement while also maximizing data flow when it is required, and somehow to achieve a balance between local and centralized processing without using too much power.

“On one side it’s a bandwidth problem,” said Rajesh Ramanujam, product marketing manager at NetSpeed Systems. “You want to try not to move data if it all possible, so you move data closer to the processor. But if you do have to move data, you want to condense it as much as possible. None of this exists on its own in a vacuum, though. It all has to be looked from a system level. There are multiple sequential axes that need to be considered at each step, and it determines whether you use memory in a traditional read-write manner or whether you leverage new memory technologies. In some cases, you may want to change the way you store the data itself. If you want faster performance, that generally means higher area costs, which affects power. And now you throw in functional safety and you have to worry about data overload.”

And this is why there is so much attention being focused on processing at the edge and throughput between various processing elements. But how and where that processing gets implemented will vary greatly as architectures are developed and refined.

Case in point: Marvell introduced an SSD controller with built-in AI so it can handle a larger compute load at the edge. The AI engine can be used for analytics within the solid state storage itself.

“You can load models directly into the hardware and do hardware processing at the SSD controller,” said Ned Varnica, principal engineer at Marvell. “Today, a host computer in the cloud does this. But if each drive was to send data to the cloud, that would create a huge amount of network traffic. It’s better to do the processing at the edge, where the host computer issues a command that is just metadata. So the more storage devices you have, the more processing power you have. The benefit in traffic reduction is enormous.”

What’s particularly noteworthy about this approach is it emphasizes flexibility in data movement, depending upon the application. So the host can generate a task and send it to the storage device for processing, after which only metadata or computational results are sent back. In another scenario the storage device can store data, pre-process it and generate metadata, tags and indexes, which are then retrieved by the host as needed for further analytics.

This is one option. There are others. Samsung’s Rupley emphasized out-of-order processing and fusion idioms, which can decode two instructions and fuse them into a single operation.

AI oversight and optimization
Layered across all of this is artificial intelligence, and this is one of the really new elements to enter into chip architectures. Rather than letting an operating system and middleware manage functions, that oversight is being distributed around a chip, between chips, and at the system level. In some cases, that could include neural networks within chips.

“It’s not so much how you package more stuff together as much as changing the traditional way of doing things,” said Mike Gianfagna, vice president of marketing at eSilicon. “With AI and machine learning, you can sprinkle all of this stuff around a system to get more efficient and predictive processing. In other cases, it could involve separate chips that function independently in a system, or within a package.”

Arm uncorked its first machine learning chip, which it plans to roll out later this year across multiple market segments and verticals. “This is a new type of processor,” said Ian Bratt, distinguished engineer at Arm. “It includes a fundamental block, which is a compute engine, plus a MAC engine, a DMA engine with a control unite and broadcast network. In all, there are 16 compute engines capable of 4 teraOps at 1GHz using a 7nm process technology.”

Because Arm works within an ecosystem of partners, its chip is more general-purpose and configurable that other AI/ML chips that are being developed. Rather than build everything into a monolithic structure, it compartmentalized processing by function, so each compute engine works on a different feature map. Bratt said four key ingredients are static scheduling, efficient convolutions, bandwidth reduction mechanisms and programmability to future-proof designs.

Fig. 2: Arm’s ML processor architecture. Source: Arm/Hot Chips

Nvidia, meanwhile, took a different tack, building a dedicated deep learning engine next to a GPU to optimize traffic for processing imaging and videos.

By utilizing some or all of these approaches, chipmakers say they can double performance every couple of years, keeping pace with an explosion in data while remaining within the tight confines of power budgets. But this isn’t just providing more computers. It’s changing the starting point for chip design and system engineering, beginning with a growing volume of data rather than the limitations of hardware and software.

“When computers came into companies, a lot of people felt the world was moving so much faster,” said Aart de Geus, chairman and co-CEO of Synopsys. “They did accounting on pieces of paper with stacks of accounting books. It was an exponential change then, and we are seeing it again now. What is evolving—that maybe gives it a sense of faster—is that you could sort of understand the accounting book to the punch cards to printing it out and computing. Mentally you could follow every step. The fact that on an agricultural field you need to put water and some type of fertilizer only on a certain day where the temperature rises this much, is a machine-learning combination of things that’s an optimization that was not obvious in the past.”

He’s not alone in that assessment. “New architectures are going to be accepted,” said Wally Rhines, president and CEO of Mentor, a Siemens Business. They’re going to be designed in. They will have machine learning in many or most cases, because your brain has the ability to learn from experience. I visited 20 or more companies doing their own special-purpose AI processor of one sort or another, and they each have their own little angle. But you’re going to see them in specific applications increasingly, and they will complement the traditional von Neumann architecture. Neuromorphic computing will become mainstream, and it’s a big piece of how we take the next step in efficiency of computation, reducing the cost, doing things in both mobile and connected environments that today we have to go to a big server farm to solve.”

Related Stories
AI Architectures Must Change
Using the Von Neumann architecture for artificial intelligence applications is inefficient. What will replace it?
Security Holes In Machine Learning And AI
A primary goal of machine learning is to use machines to train other machines. But what happens if there’s malware or other flaws in the training data?
Architecting For AI
Experts at the Table, part 1: What kind of processing is required for inferencing, what is the best architecture, and can they be debugged?


Art Scott says:

No future without Reversible Computing.

Michelle says:

Wider data read/writes, parallel computations, smaller data packed together, Applied accelerators, deeper pipelines, scatter gather output… smells like a DSP to me :)…

Leave a Reply