Improving Memory Efficiency And Performance

CXL and OMI will facilitate memory sharing and pooling, but how well and where they work best remains debatable.

March 28th, 2022 - By: Bryon Moyer

This is the second of two parts on CXL vs. OMI. Part one can be found here.

Memory pooling and sharing are gaining traction as ways of optimizing existing resources to handle increasing data volumes. Using these approaches, memory can be accessed by a number of different machines or processing elements on an as-needed basis.

Two protocols, CXL and OMI, are being leveraged to simplify these processes, from on-chip/near-chip memory to storage, a type of memory that is “composable.” The basic idea is that by raising the abstraction level, memory can be provisioned as required for any particular job, and then allocated for another job when that one is finished. In theory, that should improve performance and efficiency. But which protocol offers the best bang for the buck for which application or architecture isn’t entirely clear at this point.

“In a memory-pooling application, any portion of memory where an SoC or controller can be allocated to the CPU is managed by something called a fabric manager, a standard software entity that’s been defined by CXL,” noted Mark Orthodoxou, vice president of marketing for data center products at Rambus. “It manages the assignment of physical memory regions so that, to the CPU, it looks like it has a memory pool that grows and shrinks dynamically.”

Fig. 1: CXL can allow memory pools accessible by multiple hosts. Source: Rambus

Memory pooling is related to, but distinct from, memory sharing by different consumers. “The difference between memory pooling and memory sharing is that with pooling, any given CPU has access only to its allocated memory region,” said Orthodoxou. “Memory sharing is more for large machine-learning training sets and workloads, where you have these 500 GB, 1 TB, 3 TB training sets. If you want to have multiple CPUs go and compute on it, today you need to move data around. So how do we reduce data movement and reduce the memory footprint per CPU in the data center? You apply memory sharing.”

Memory sharing requires coherency. But this is different than processor coherency, such as what’s needed in a multicore processor. CXL addresses application coherency, which deals with higher-level coherency between different consumers that may be collaborating on the same problem. OMI, in contrast, has no coherency functionality, instead relying on the processor’s cache coherency infrastructure to handle that.

“Up until now we’ve had monolithic memory on a monolithic cache architecture,” said Gordon Allan, product manager for verification IP at Siemens EDA. “As we bring in multiple processing elements, coherency becomes a requirement. So it’s driven us to think differently about memory and, instead, to think about the data and the different kinds of memory that our application needs. This is part of the disaggregation process, where we’re able to describe our storage needs in a finer-grained manner and not overspend on the coherency attributes where they’re not needed. So different kinds of data can have different levels of replication or coherency. Those attributes are all supported by a fabric such as CXL, where the processing elements can indicate, on a memory range by memory range basis, how the coherency should work.”

Full coherency in that regard will be supported by CXL 3.0, which isn’t yet public. Consequently, the details of how that will happen are not available.

Passing workloads between processors
CXL also allows memory to be “handed” to different nodes for processing. What’s being passed back and forth in these cases isn’t the actual data, but rather something analogous to pointers, with address translation mapping the physical addresses to each consumer’s individual memory map.

“The fabric manager orchestrates everything,” said Orthodoxou. “It tracks who’s making the request, what address they are requesting, and how that maps to the physical representation of memory behind the controller.”

To the user, this remains largely hidden. “The detail involves a lot of transactions going back and forth in order to maintain that illusion of the pointer to the data,” noted Allan.

But that’s not the only thing hidden from view. It’s also not clear which protocol is faster for various applications because the latency impact of CXL is measured differently from OMI. While OMI talks in terms of an amount of latency to add to existing latency, CXL speaks only of the round-trip interconnect speed. So 40 ns latency means 20 ns of transport in each direction.

Richard Solomon, technical marketing manager, CXL at Synopsys, was very specific about how that’s measured: “It’s the first bit of the packet appearing on the transmit side to the first bit of the packet appearing at the receive side,” he said.

Note that another memory-abstraction effort, Gen Z, has been folded into CXL. “Gen Z has conceded that CXL has the mainstream and the critical mass, so they are contributing their own particulars into that collaboration,” said Allan.

Possible overlap
There appears to be disagreement or uncertainty, however, as to whether CXL also can address near memory. Many industry spokespeople tend to speak for either CXL or OMI, or the other technologies, being unsure exactly how the “other” works. Especially when it comes to calculating latency, not everyone is clear on exactly what it means.

CXL has a CXL.mem profile, which is said to be of use for near memory. Rather than using the OMI or DDR bus, it still uses PCIe as interconnect. So, in a system built for CXL to handle near memory, there would be PCIe slots hosting memory cards instead of DIMM slots.

“We’re seeing people build straightforward DDR interfaces, for instance,” said Solomon. “It’s got a bridge chip that has CXL on one side and some number of DDR channels on the other. Inside are computational abilities.”

From a die area perspective, both OMI and PCIe use serial approaches, so they both use fewer pins than current parallel schemes. “Today you’ll have some number of PCIe lanes coming off the CPU – let’s say 64 PCIe Gen 5 lanes that are broken up into 16-lane stacks,” said Orthodoxou. “And those stacks then can be bifurcated by 2, 4, 8, or 16. The way the CPUs are being designed now, each one of those stacks is capable running in a CXL mode or in a PCIe mode. From an area- and perimeter-efficiency standpoint, there’s really no difference from OMI, because they’re both just serial interfaces with a certain number of lanes. The only difference I would argue between CXL and OMI in that regard is that, for OMI, the implementation is very specifically in these DDIMMs. So your form factors are very specific. In the case of CXL, most people are talking about implementing memory extensions through drive form factors or add in cards.”

OMI is more independent in that it doesn’t rely on any particular PCIe generation. “OMI is running at a speed grade that’s not tightly coupled to PCIe electricals,” he said. “OMI is more tightly coupled into the data fabric and the processor, which allows lower latency than CXL as traditionally implemented. The CXL subsystem is coupled into the processor where the PCIe bus is, so you have a slight latency hit that comes with that.”

Whether this is practical depends to some extent on the bigger overall picture. “CXL honestly is not going to be very different – probably on the order of an additional 5 or 6 nanoseconds, round trip,” said Orthodoxou. “That’s probably in the noise.”

Others point to a larger latency hit. And this is where things get muddy, with multiple opinions given with varying confidence. In general, because CXL tends to quote only interconnect latency, while OMI quotes full access latency (or a delta), it’s an apples-to-oranges comparison. Some of the documents detailing how these numbers might be arrived at are not available to the public.

For near memory, there appear to be two camps. On one side are those that see a simple positioning — OMI is for near memory, and CXL is for far memory. And even if you could use CXL for near memory, no one would.

“CXL certainly is addressing the mid and the far memories,” said Allan. “It’s not directly applicable to the near.”

So where does OMI fit into this picture? “OMI is entirely complementary to CXL and therefore should be adopted by the CXL consortium, if nothing else, to save all of the confusion they are causing in the industry,” said Allan Cantle, technical director and board adviser at the OpenCAPI Consortium.

The other camp sees CXL as having the strongest ecosystem, so OMI won’t be used outside of IBM’s ecosystem.

“OMI has not been broadly adopted outside of the POWER portfolio and some derivatives that are targeting HPC style applications,” said Orthodoxou. “Every other CPU vendor and GPU vendor has a roadmap to intersect CXL. It’s not a question of one protocol being better than the other for memory attachment. It’s really a question of ubiquity of adoption.”

An additional position was taken by Marc Greenberg, product marketing group director, IP Group at Cadence. “I don’t see OMI vs CXL discussion as ‘near vs. far,’” he said. “I see it as ‘Version N vs Version N+1.’ CXL.mem is the latest iteration and provides a high-bandwidth and capacity DRAM solution to server-type systems.”

No extant commercial example of CXL for near memory has emerged, but OMI is also in early days. So at this point, there’s conjecture by all camps.

Less to worry about
But that doesn’t diminish the value of either CXL or OMI. These abstraction technologies will allow software developers, chip designers, system builders, and system integrators to worry less about memory specifics.

SoC designers will be able to build “generic” interfaces to both near and far memory and have that SoC work across a wide range of possible system configurations. System builders and integrators will have more flexibility in assigning memory and processors so that more systems can be used for more workloads. With reduced data traffic, execution should require less energy. And software developers can create programs that will work across a wider range of systems.

As to the differences between OMI and CXL, the market ultimately will sort that out. CXL is being widely adopted, although not necessarily for near memory. Near memory could end up being CXL or OMI – or remain DDR or HBM if neither of the abstraction standards is adopted.

Orthodoxou sees this creating fundamental change in the data center. “The appetite for more memory bandwidth, more memory capacity, and less stranded memory — or some combination of those three – will change the nature of memory interconnects in the data center.”

Bryon Moyer

(all posts)
Bryon Moyer is a technology editor at Semiconductor Engineering.

Improving Memory Efficiency And Performance

Bryon Moyer

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers
Entities, people and technologies explored

Related Articles

RISC-V’s Increasing Influence

3D-IC For The Masses

Chiplets Add New Power Issues

Development Flows For Chiplets

Chiplet Tradeoffs And Limitations

New Data Center Protocols Tackle AI

Implementing AI Activation Functions

Future-proofing AI Models

Sponsors

Recent Comments

About

Navigation

Connect With Us

Improving Memory Efficiency And Performance

Bryon Moyer

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers Entities, people and technologies explored

Related Articles

RISC-V’s Increasing Influence

3D-IC For The Masses

Chiplets Add New Power Issues

Development Flows For Chiplets

Chiplet Tradeoffs And Limitations

New Data Center Protocols Tackle AI

Implementing AI Activation Functions

Future-proofing AI Models

Sponsors

Newsletter Signup

Popular Tags

Recent Comments

About

Navigation

Connect With Us

Knowledge Centers
Entities, people and technologies explored