It depends on whom you ask, but there are advantages to both.
System designers are looking at any ideas they can find to increase memory bandwidth and capacity, focusing on everything from improvements in memory to new types of memory. But higher-level architectural changes can help to fulfill both needs, even as memory types are abstracted away from CPUs.
Two new protocols are helping to make this possible, CXL and OMI. But there is a looming question about whether they will co-exist, or if one will win out over the other.
“There’s a broad acknowledgement of folks wanting to get more memory bandwidth and memory capacity to CPU cores as the CPU core count grows in the processor,” said Mark Orthodoxou, vice president of marketing for data center products at Rambus. “Folks are running out of the ability to add DRAM channels.”
While the two new protocols share some high-level conceptual similarities, they are not the same. But there appears to be a lot of confusion as to whether they actually compete with each other. There are even widespread misconceptions floating around, especially with respect to OMI.
These days, everyone is focused on data, both in terms of growing volume and how best to manage it.
“Financial services want to add more data sources for fraud detection that can deliver instant results,” said Charles Fan, co-founder and CEO of MemVerge. “Social media wants more data sources to profile users, but deliver instant results. E-commerce retailers want more data sources, but instant recommendations. Chips are being designed with 1 trillion transistors, but they need to get to market in the same time cycle as previous generations. Genomic researchers want more cell data, but they want to shorten time to vaccine discovery.”
All of this requires more memory to service more computing. “A thousand times more compute is needed and a hundred times more memory – in just the next two years,” Fan said.
Memory and storage
Modern computing systems have a two-tiered memory structure. There’s working memory, which is local to the processor for fast access, and it’s typically some form of DRAM. Then there’s storage, a form of memory, which is logically and usually physically farther away from the processor. This is normally non-volatile memory like flash or even a hard drive.
This arrangement reflects a mix of functionality, cost, and access. “Memory” tends to be faster technology, albeit at higher cost than storage technologies. Even given the speed, it’s not fast enough to keep up with modern processors, which is why SRAM cache on a processor is so critical for performance.
“Storage” tends to consist of very high-capacity memories that are very inexpensive on a per-bit basis. But their access times can be orders of magnitude slower than what DRAM can provide.
There has been a lot of talk over the last decade of storage-class memory, which shares some of the characteristics of storage but has the performance of memory. MRAMs, RRAMs, and PCRAMs are the poster-children for this crossover category – and there are other ideas earlier in the research cycle.
The promise of using a single technology both for memory and storage is tantalizing, but it will create some challenges for chip designers creating ICs that need to interface with memory. Most chips have specific interfaces for DRAM. If you could use MRAM or RRAM, then what interface do you connect the CPU to? These memories may all have different access protocols.
Storage has different challenges, but the proliferation of memory types creates a similar quandary. In addition, data in storage typically must be retrieved in bulk for actual use. That copy operation takes time and consumes energy.
Both cases would benefit from a way to abstract away the details of the specific memory being used so that both chip designers, and to some extent software developers, will have to be less concerned about the memory details of a specific system. It also may make software more portable across different systems, which is particularly valuable in the data center.
Today, it requires a higher-level program or system to manage and structure a pool of different memory and storage resources. Such “big-memory” programs provide a way to increase the bandwidth and the capacity of memory.
“The thesis around big-memory computing is that, instead of the constant struggle of making storage faster and faster, leveraging other new hardware, complemented with the right set of software,” said Fan. “We can build a pool of software-defined memory that can be the platform for all active data the application needs to process, thereby reducing or eliminating the data transfer between memory and storage for the active application data.”
The CXL and OMI protocols both provide abstraction, although at a lower level. But as emerging solutions, it’s easy to confuse the two. OMI has little in the way of fanfare available online, and awareness of it seems to be lower than awareness of CXL. Depending on whom you talk to, they do or don’t do the same thing, and therefore do or don’t compete with each other.
The emergence of CXL and/or OMI doesn’t necessarily affect the use of big-memory management systems. Rather, it makes the physical memory connections easier to deal with. “We rely on the CPU for accessing memory using its interface/memory manager, and therefore our software is agnostic to the memory interconnects, including CXL, OMI and DDR4/5,” Fan said.
Near memory and OMI
Working memory used by CPUs needs to be fast. DRAM has provided the best speed/cost mix for years, and appears likely to continue doing so as the technology evolves. Even so, there have been ways to improve that performance, but at a cost.
DRAM’s Achilles heel is the set of long lines driving the memories. Their high capacitance makes it hard to keep pushing memory speeds higher and adding more memories.
Two variations have helped. One is the RDIMM, where the address and control signals are buffered on the chip. That speeds up those signals while leaving the data signals alone. LRDIMMs take this one step further by buffering the data, as well. That adds a clock cycle of latency, but speeds up the lines and allows for more memory.
Fig. 1: RDIMMs buffer address and control signals; LRDIMMs additionally buffer the data signals. The intent is to have shorter, less capacitive lines and faster access, at the expense of an extra clock cycle of latency. Source: Objective Analysis
But the ports used for access require many pins – 152 per channel for LRDIMMs, said Objective Analysis’ Jim Handy, during a presentation at last year’s Hot Interconnects conference. Eight channels would cost 1,216 pins.
“Because the pin counts are very large, the area required to drive those pins is significant, as it’s a parallel interface,” said Orthodoxou.
HBM is another alternative that provides higher access speed. While expensive, it provides the highest bandwidth. But its bus is 1,000 bits wide. There are other challenges, as well, described in a white paper on OMI.
“Although HBM is a help, it’s considerably more expensive than standard DRAM and is limited to stacks of no more than 12 chips, limiting its use to lower-capacity memory arrays,” the paper says. “HBM is also complex and inflexible. There’s no way to upgrade an HBM-based memory in the field. As a consequence, HBM memory is adopted only where no other solution will work.”
OMI emerged out of the OpenCAPI world, with the OMI spec separated out for the sake of latency. It’s intended to address these near-memory challenges in two ways — a move to SerDes, and the use of an on-DIMM controller. DIMMs used for OMI channels are referred to as differential DIMMs, or DDIMMs.
SerDes connections will replace the current DDR-style of interface, providing higher speed with far fewer signals. The controller partly provides the same function as the registers on an LRDIMM, increasing overall memory latency by around 4ns in the process.
“OMI latency includes the latency through the memory itself, and this is the round-trip read latency from the internal connection to the transmit port in the host back to the internal connection of the receive in the host,” said Allan Cantle, technical director and board adviser at the OpenCAPI Consortium.
Fig. 2: An LRDIMM compared to a DDIMM. The blue box on the left of the DDIMM is the controller. Latency is increased by a few nanoseconds. Source: Objective Analysis
In addition, however, the controller can interface to many different kinds of memory. It acts as a bridge between that memory and the processor. As far as the processor is concerned, all memory looks like OMI, and the details beyond that are handled on the DDIMM.
That allows a system builder to mix and match the types of memory being used. Each channel can be its own type of memory. In fact, a single DDIMM could have a mix of memory available as long as the controller supported that.
Fig. 3: A conceptual example of a mixed-memory system where each channel uses a different memory technology. Source: Objective Analysis
It’s unclear if systems really will be composed this way, however. Some believe the value of abstraction isn’t for creating a heterogeneous memory pool, but rather so that a single CPU with a single set of interfaces can access a homogeneous pool built out of any of those types of memory.
“Near memory will always be more of a choice of homogeneous memory and less of a need to abstract heterogeneous memory types,” said Gordon Allan, product manager for verification IP at Siemens EDA.
Bandwidth will be higher than standard DRAM interfaces, although HBM will still be faster. That said, having fewer pins means the silicon required on the SoC for the memory channels will be much smaller, making OMI more competitive with HBM on a bandwidth/area basis. Aggregate bandwidth may be higher if more channels can be used with OMI than with other interfaces due to the smaller interface footprint.
For this new paradigm to fully emerge, controller chips are first required, and then DDIMMs will need to be available. This process has started, but has farther to go. Even so, OMI uptake has been slow so far.
Fig. 4: A DDIMM showing the controller and multiple DRAM chips. A 2U version is available, as well. Source: OpenCAPI Consortium
“We are not engaged with customers asking us for that technology, but it’s still early days for OMI,” said Allan. “It’s a relatively new entrant promoted by IBM and some others. It’s still not widely adopted in the industry, but there’s certainly a lot of interest in it because it claims to expand both the capacity benefits of DDR and the performance bandwidth benefits of HBM. But it’s still a bold, unproven claim at this point.”
Far memory and CXL
The far-memory situation is more complicated. In addition to issues relating to specific types of memories, the frequent need to copy large chunks of memory is a significant concern, especially for memory- or storage-hungry applications like machine learning, and especially in the data center.
These are issues CXL tackles. “CXL optimizes and virtualizes data transfer, storage, and computation,” said Levent Caglar, group director of engineering, system design group at Synopsys.
This is useful in data center applications. “The HPC landscape consists of a plethora of computing constructs,” said Arif Khan, product marketing group director, IP Group at Cadence. “CPUs, GPUs, accelerators, FPGAs, etc., are connected to ever-increasing memory pools. CXL addresses the demands of heterogenous computing while maintaining cache coherency and allowing for expandability of memory.”
But it’s also complicated. “We need to consider up to three different aspects of storage,” said Siemens EDA’s Allan. “First up are co-located processors and memory. At the other end of the processing pipeline, we have coherent memory and links to storage, where data has to be shared with other processing and communication elements. And we have wider-scale search and retrieval of storage in the data center. CXL sits in the second and the third of those areas.”
Fig. 5: Block diagram of a CXL controller. The CXL functionality relies on PCIe for physical interconnection. Source: Rambus
With respect to abstraction, CXL is notionally similar to OMI, acting as a bridge that allows the processor to be agnostic to memory type. “From the point of view of the rest of the system, that memory is as logically as close as it can be to the CPU,” said Caglar.
But CXL has a much broader remit than OMI does, having many more use cases to cover. “OMI and CXL are very similar in terms of the near-memory problems they’re trying to address,” said Orthodoxou. “Where they differ is CXL’s attempt to solve far-memory problems.”
Find Part 2 Improving Memory Efficiency And Performance here
CXL and OMI will facilitate memory sharing and pooling, but how well and where they work best remains debatable.
As a point of clarification. OMI DDIMMS with Microchip Controllers were introduced at FMS in August 2019 and have since ramped too full production with Samsung, Micron and Smart Modular.