Distributing RTL Simulation Across Thousands Of Cores On 4 IPU Sockets (EPFL)


A technical paper titled “Parendi: Thousand-Way Parallel RTL Simulation” was published by researchers at EPFL.


“Hardware development relies on simulations, particularly cycle-accurate RTL (Register Transfer Level) simulations, which consume significant time. As single-processor performance grows only slowly, conventional, single-threaded RTL simulation is becoming less practical for increasingly complex chips and systems. A solution is parallel RTL simulation, where ideally, simulators could run on thousands of parallel cores. However, existing simulators can only exploit tens of cores.
This paper studies the challenges inherent in running parallel RTL simulation on a multi-thousand-core machine (the Graphcore IPU, a 1472-core machine). Simulation performance requires balancing three factors: synchronization, communication, and computation. We experimentally evaluate each metric and analyze how it affects parallel simulation speed, drawing on contrasts between the large-scale IPU and smaller but faster x86 systems.
Using this analysis, we build Parendi, an RTL simulator for the IPU. It distributes RTL simulation across 5888 cores on 4 IPU sockets. Parendi runs large RTL designs up to 4x faster than a powerful, state-of-the-art x86 multicore system.”

Find the technical paper here. Published March 2024 (preprint).

Emami, Mahyar, Thomas Bourgeat, and James Larus. “Parendi: Thousand-Way Parallel RTL Simulation.” arXiv preprint arXiv:2403.04714 (2024).

Related Reading
Anatomy Of A System Simulation
Balancing the benefits of a model with the costs associated with that model is tough, but it becomes even trickier when dissimilar models are combined.

Leave a Reply

(Note: This name will be displayed publicly)