How Memory Design Optimizes System Performance

Changes are steady in the memory hierarchy, but how and where that memory is accessed is having a big impact.

popularity

Exponential increases in data and demand for improved performance to process that data has spawned a variety of new approaches to processor design and packaging, but it also is driving big changes on the memory side.

While the underlying technology still looks very familiar, the real shift is in the way those memories are connected to processing elements and various components within a system. That can have a big impact on system performance, power consumption, and even the overall resource utilization.

Many different types of memories have emerged over the years, most with a well-defined purpose despite some crossovers and unique use cases. Among them are DRAM and SRAM, flash, and other specialty memories. DRAM and SRAM are volatile memories, meaning they require power to maintain data. Non-volatile memories do not require power to retain data, but the number of read/write operations is limited, and they do wear out over time.

All of these fit into the so-called memory hierarchy, starting with SRAM — a very fast memory that typically is used for various levels of cache. SRAM is extremely fast, but its applications are limited due to the high cost per bit. Also at the lowest level, and often embedded into an SoC or attached to a PCB, NOR flash typically used for booting up devices. It’s optimized for random access so it does not have to follow any particular sequence for storage locations.

Moving up a step in the memory hierarchy, DRAM is by far the most popular option, in part because of its capacity and resilience, and in part because of its low cost per bit. That is partially due to the fact that the leading DRAM vendors have fully depreciated their fabs and equipment, but as new types of DRAM come online, the price has been rising, opening the door to new competitors.

There has been talk of replacing DRAM for decades, but DRAM has proved to be much more resilient from a market standpoint than anyone would have anticipated. In 3D configurations of high-bandwidth memory (HBM), it has proven to be an extremely fast, low-power option, as well.

JEDEC defines four main types of DRAM:

  • Double data rate (DDRx) for standard memory;
  • Low-power DDR (LPDDRx), primary used in mobile or battery-operated devices;
  • Graphics DDR (GDDRx), which initially was designed for high-speed graphic applications, but which also is used for other applications, as well, and
  • High-bandwidth memories (HBMx), which primary for high-performance applications such as AI or inside of data centers.

NAND flash, meanwhile, is typically used as removable storage (SSD/USB stick). Due to longer erase/write cycle and lower lifespan, flash is not suitable for CPU/GPU and system applications.

“The double data rate (DDR5) and low-power versions of LPDDR5 specifications are being refined by JEDEC, the memories standard body,” said Paul Morrison, solutions product engineer for Siemens EDA. “DDR6 and LPDDR6 are being worked on, as well. Other popular DRAM memories include high-bandwidth memory (HBM2 and HBM3), and graphic DDR (GDDR6, with a GDDR7 release imminent).”

But the growth of small, battery-operated devices and need for rapid boot-up of devices also have pushed up demand for flash memory. NOR flash memories are typically smaller, on the order of 1 Gbit. NAND flash, in contrast, is used in SSDs. Density now varies from one bit per cell to four bits per cell, with five- and six-bit per cell versions expected. Additionally, the move from 2D to 3D arrays further increased densities.

“In AI and many other fields, memory performance is critical to good system performance,” said Steven Woo, fellow and distinguished inventor at Rambus. “Running memory at the highest data rates will improve system performance but thinking through how data structures map to memory can improve bandwidth, power-efficiency, and capacity utilization as well. Increasing memory capacity can also lead to better performance, and the introduction of CXL will offer a way for AI and other processors to add memory capacity beyond what direct-connect memory technologies can currently offer.”

Proximity matters
Distance between memory and processors used to be a floor-planning issue, but as the amount of data that needs to be processed increases, and as features shrink, the amount of energy required to move more data back and forth between memory and processing elements increases. Thinner wires require more power to move electrons, and it takes more power to move them longer distances and increases latency.

This has spawned new interest in near-memory and in-memory computing, where at least some of the data can be partitioned and prioritized, processed, and significantly reduced. That reduces the total amount of energy used, and it can have a significant impact on performance.

In-memory computing (a.k.a processing-in-memory or compute in memory) refers to having the processing or computation inside the memory (such as the RAM). Some time ago, before this was done at the chip level, it has been demonstrated that by distributing data across multiple RAM storage units and combining that parallel processing, the performance results in cases like investment banking was 100 times faster. So while in-memory/near-memory computing has been around for a long time, and has gotten another boost from AI designs, only recently chipmakers have started to demonstrate some successes with this approach.

In 2021, Samsung’s memory business unit introduced the processing-in-memory (PIM) technology with integrated AI cores inside HBM memory. In a speech-recognition test using the Xilinx Virtex Ultrascale and an (Alveo) AI accelerator, the PIM technology was able to achieve a 2.5X performance increase and a 62% reduction of energy. Other memory chip makers like SK Hynix and Micron Technology also are looking at this approach.

In the domain of in-memory computation, a breakthrough announcement recently came from the international research team headed by Weier Wan, a recent Ph.D. graduate in the lab of Philip Wong at Stanford University, who worked on this idea while at UC San Diego. Other Ph.D. graduates at UC San Diego who made major contributions to this research are now running their own labs at Notre Dame University and the University of Pittsburgh.

By tightly coupling neuromorphic computing with resistive random-access memory, the NeuRRAM chip performs AI edge computation with high accuracy — 99% accuracy on an MNIST handwritten digit recognition task, and 85.7% on a CIFAR-10 image classification task. Compared with the state-of-the-art edge AI chips available today, the NeuRRAM chip was able to deliver 1.6 to 2.3X lower energy-delay product (EDP; less is better) and 7 to 13X higher computational density. This presents opportunities to lower the power of chips running a variety of AI tasks without compromising accuracy and performance in the years to come.

“One of the key factors in improving memory performance has always been minimizing data movement,” said Ben Whitehead, technical product manager for Siemens EDA. “By doing so, it also reduces power consumption. Using SSD as an example, a data lookup can increase transfer speed by 400 to 4,000 times. Another way to do this is to move computation close to the memory. The concept of compute-in-memory is not new. Adding intelligence inside the memory would reduce data movement. The concept is similar to edge computing by performing local computations rather than sending data to the cloud back and forth. Compute-in-memory in DRAM is still at its early stage, but this will continue to be the trend for future memory development.”

Memory standards update
There are three main standards groups/efforts underway that could have a significant impact on all of this:

  1. JEDEC: The organization continues its 50-year-plus role as the leading body of memory standards for the microelectronics industry. It has developed and published many standards focusing on the main memory (DDR4 & DDR5 SDRAM), flash memory (UFS, e.MMC, SSD, XFMD), mobile memory (LPDDR, Wide I/O), and more. It will continue to be the leading body of memory standards. JEDEC recently published two new standards. In August 2022, it released the DDR5 SDRAM Specification, which defines the minimum set of requirements for 8 Gb through 32 Gb for x4, x8, and x16 DDR5 SDRAM devices. The work was done based on the DDR4 standards, and part of the DDR, DDR2, DDR3, and LPDDR4 standards. In addition, in July 2021 JEDEC added LPDDR5 and LPDDR5X, defining the minimum requirements for an x16 one-channel SDRAM device and x8 one-channel SDRAM device with density ranges from 2 Gb through 32 Gb. The work was done based on previous specifications, including DDR2, DDR3, DDR4, LPDDR, LPDDR2, LPDDR3, and LPDDR4.
  2. CXL: The CXL Consortium is an open industry standard group that supports the Compute Express Link (CXL), an industry-supported cache-coherent interconnect for processors, memory expansion, and accelerators. The technology defines interconnects between the CPU memory space and memory on attached devices for resource-sharing, which can boost performance while minimizing software and system cost. It also helps define the accelerators used in AI/ML. The consortium recently released the CXL 2.0 Specification, which added switching to enable device fan-out, memory scaling, expansion, memory pooling, link-level integrity, and data encryption (CXL IDE) for data protection.
  3. UCIe: On the chiplet side is the recently released Universal Chiplet Interconnect Express (UCIe) standard. Chipmakers will continue adopting UCIe to connect chiplets, including memories. The current focus includes the physical layer (die-to-die I/O with industry leading KPIs) and protocols (CXL/PCIe) to ensure interoperability.

“CXL helps accelerators stay coherent coherent with the rest of the system, so passing data, messages, and doing semaphores is more efficient,” said Debendra Das Sharma, senior fellow at Intel and chair of the CXL Consortium Board Technical Task Force. “Additionally, CXL addresses memory memory capacity and bandwidth needs for these applications. CXL will drive significant innovations in memory technologies and accelerators going forward.”

Ideas on performance optimization
Some of these memory approaches have been around for decades, but nothing is standing still. Memory is still seen as a key element in the power, performance and area/cost paradigm, and tradeoffs can have a big impact on all of those elements.

“Memory technology evolves continually, said Gordon Allan, product manager at Siemens EDA. “For example, HBM is the perfect choice for AI applications right now, but it may be different tomorrow. The main memory standards body, JEDEC, defines DDR4 and 5, DIMM 4 and 5, LRDIM, and other memories today. But for future memory extension, the CXL standard, used in defining the interfaces in PCIe and UCIe, is gaining acceptance and momentum.”

Every processor requires memory to store data. Therefore, it is important to understand the characteristic of memory and how its behavior will impact the overall system performance. Some key considerations in designing and selecting memories include:

  • Maximizing performance within a given unit of energy;
  • Power budgeting and thermal management;
  • Matching memories to processing needs, such as AI systems, which demand higher memory performance, and
  • Re-using of design, density, and packaging (2D, 2.5D, 3D-IC)

Depending on the application, it’s also important to consider how data is transferred within a system and between systems.

“To optimize performance, you need to look at the system level,” said Marc Greenberg, group product marketing director for DDR, HBM, flash/storage and MIPI IP at Cadence. “For your system to achieve high throughput, it may require more than 80 memories connected to the processor. There are different ways to improve efficiency. One of them is to optimize the order of traffic access and maximize the number of tasks done with minimum bus cycles at a given clock frequency. A simple analogy is the checkout process in a grocery store. For example, a customer has five cans of pineapple, a watermelon, and something else. To achieve checkout efficiency, you would present all five cans as a group instead of presenting one can and then the watermelon, followed by another can. The same concept applies to memory. Additionally, having a smart memory controller (PHY and controller IP) at a single point to manage the traffic protocol of multiple memories will achieve much better optimization in memory design.”

The rollout of AI in many devices has made these kinds of considerations more essential.

“In AI training, memories that provide the highest bandwidth, capacity, and power efficiency are important,” said Rambus’ Woo. “HBM2E memory is a great fit for many training applications, especially with large models and large training sets. Systems using HBM2E can be more complex to implement, but if this complexity can be tolerated, it’s a great choice. For many inferencing applications, on the other hand, high bandwidth, low latency, and good power efficiency are needed at a good price-performance point. For these applications, GDDR6 memory can be a better fit. For end-point applications such as IoT, on-chip memory that can also be coupled with LPDDR can make sense.”

Fig. 1: HBM memory controller and PHY IP optimizes memory management functions. Source: Rambus

Fig. 1: HBM memory controller and PHY IP optimizes memory management functions. Source: Rambus

According to Micron, memory systems are more complicated than they appear. Within a given memory bandwidth, system performance can be influenced by factors like access pattern, locality, and time to solution. For example, a natural language processing model would require 50 TB/s of memory bandwidth to support a 7mS latency for time to solution. If the longer latencies can be tolerated, them memory bandwidth can be moderated accordingly.

Micron noted that architectures improve with a complete understanding of the solution stack — software to architecture to memory systems. Therefore, the starting point is to optimize the access patterns, data placement, and latency mitigation (i.e., data prefetch) within the algorithms while leveraging the memory architecture’s inherent strengths and addressing its limitations.

JEDEC has continued to make memory improvements, tackling such challenges as higher density, low latency, low power, higher bandwidth, and more. By following the specifications, memory makers and system designers will be able to take advantage of new innovations. In recent years, advanced tools from companies such as Synopsys and Siemens EDA have become available to perform needed functions such as test, simulation, and verification.

“One of the goals of JEDEC is to continue to provide the architecture innovations which support further scaling,” said Anand Thiruvengadam, product marketing director in Synopsys’ Custom Design & Manufacturing Group. “Newer memory specifications will continue to achieve higher density, lower power, and higher performance. For example, the power requirement for DDR4 is 1.2V while DDR 5 is 1.1V. During this voltage scaling, factors like signal integrity and how to open up the eye pattern have to be considered. Thermal management also has been improved. DDR5 has two to three temperature sensors per pin, an improvement over DDR4, which has only one. Therefore, it is beneficial to follow the specification.”

But following the specification is one thing. Meeting the specifications is another. “It is important to test the product according to the specification, making sure that it passes all worst-case scenarios,” Thiruvengadam said. “The complicated analysis and simulation may take weeks. Fortunately, the newer simulation software solution can cut this down to days.”

Conclusion
JEDEC will continue to define and update memory specifications covering the DRAM, SRAM, FLASH, and more. With the addition of CXL and UCIe standards, the memory development community will benefit from future system and chiplet interconnectivity. Even though UCIe is relatively new, it is expected to open up a new world of chiplets in the ecosystems.

Additionally, AI/ML is expected to continue to drive demand of high performance, high throughput memory designs. The constant struggle will be balancing low-power requirements and performance. But breakthroughs involving in-memory computing will open up the world to much faster acceleration in developments. And more importantly, these advanced memory developments will help propel future AI-based edge and endpoint (IoT) applications.

Related
Improving Memory Efficiency And Performance
CXL and OMI will facilitate memory sharing and pooling, but how well and where they work best remains debatable.
How To Optimize A Processor
There are at least three architectural layers to processor design, each of which plays a significant role.
HBM3: Big Impact On Chip Design
New levels of system performance bring new tradeoffs.
Will Monolithic 3D DRAM Happen?
New and faster memory designs are being developed, but their future is uncertain.
Standardizing Chiplet Interconnects
Why UCIe is so important for heterogeneous integration.



Leave a Reply


(Note: This name will be displayed publicly)