A technical paper titled "Failure Tolerant Training with Persistent Memory Disaggregation over CXL" was published (preprint) by researchers at KAIST and Panmnesia.
"TRAININGCXL can efficiently process large-scale recommendation datasets in the pool of disaggregated memory while making training fault tolerant with low overhead," states the paper.
Find the technical paper here. or here (IEE...
» read more