Creating a multi-core simulation that functions as a practical software development platform.
An increasing number of embedded designs are multi-core systems. At the pre-silicon stage, customers use a simulation platform for architectural exploration and software development. Architects want to quantify the impact of the number of cores, local memory size, system memory latency, and interconnect bandwidth. Software teams wish to have a practical development platform that is not excruciatingly slow.
This blog shares a recipe for simulating Cadence DSPs in a multi-core design as separate x86 processes. The purpose is to reduce simulation time for customers with simple multi-core models where cores interact only through shared memory. It uses a Vision Q8 multi-core design to share details of the XTSC (Xtensa SystemC) model, software application, commands, and debugging. Note the details shared are for a simulation run on an Ubuntu Linux machine, Xtensa tools version RI-2023.11, and core configuration XRC_Vision_Q8_AODP.
A complex model (figure 1) is one in which one core accesses another core’s local memory, or there are inter-core interrupts. Simulation runs as a single x86 process.
Fig. 1: A complex model.
A simple model (figure 2) is one in which cores interact only through shared memory. Shared memory is a file on the Linux host.
Fig. 2: A simple model.
As depicted in figure 3, each core is simulated using a separate x86 process. Cores use barriers and locks placed in shared memory for synchronization and data sharing. Locks are placed in un-cached memory that support exclusive subordinate access. The XTSC memory component, xtsc_memory, supports exclusive subordinate access. Cadence software tools provide a way to define memory regions as cached or uncached.
Fig. 3: Each core is simulated using a separate x86 process.
A demo application performs a 128×128 matrix multiplication. Work is divided so that each of the 32 cores computes four rows of the 128×128 result matrix. Cores use barriers to synchronize. Cadence tools provide APIs for synchronization and locking. Note without a higher-level lock, prints from all cores will get mixed up. Therefore, in the demo application, only core#0 prints.
The following sample command runs the 32-core simulation in such a way that each core is a separate x86 process. It runs a matrix multiplication application in cycle-accurate mode with logging off.
>>for (( N=0; N<32; N=N+1 )); do xtsc-run -define=NumCores=32 -define=N=$N -define=LOGGING=0 -define=TURBO=0 -define=PROG_NAME=/..path../MatMul -i=coreNN.inc & done
“xtsc-run” is a Cadence Xtensa SDK application allowing users to run systemC simulations without C++/systemC programming. coreNN.inc is the model topology in a readable text format. Since each core runs as a separate x86 process, each x86 process synchronizes at the end of the elaboration phase before starting the simulation.
Figure 4 shows the dump of the Linux “htop” command. It shows 32 separate processes.
Fig. 4: Dump of the Linux “htop” command.
It is easy to create a custom simulation using an XTSC script. Capturing wall time with XTSC requires logging, which slows down the simulation. Custom simulations also offer per-core debug control. A single custom simulation executable can simulate each of the 32 cores as a separate x86 process.
One approach to a single custom simulation executable is to generate sc_main.cpp for, say, core#0 from coreNN.inc using the following XTSC script.
>>xtsc-run -define=NumCores=32 -define=N=0 -define=LOGGING=0 -define=TURBO=0 --xxdebug=sync -i=coreNN.inc -sc_main=sc_main.cpp -no_sim
Modify the sc_main.cpp generated for core#0 to create a generic sc_main.cpp to build a single simulation executable for all cores. The Xtensa SDK includes Makefile targets to build custom simulations.
By default, the simulation runs in cycle-accurate mode. Fast functional (Turbo) mode provides additional improvement over cycle-accurate mode. Note that the fast functional mode has an initialization phase, so gains are visible only when running an application with longer run times.
The table captures simulation wall time improvements. Note that these are illustrative wall time numbers. Actual wall time numbers and improvements will depend on your host machine’s performance and your application.
Simulation Type | Wall Time | Comments |
Single process cycle accurate mode | 17500 seconds | |
Multiple x86 processes cycle accurate mode | 1385 seconds | 12X faster than single process |
Multiple x86 processes turbo mode | 415 seconds | 3X faster than cycle accurate mode |
Attaching a debugger to each of the individual x86 core simulation processes is possible. Synchronous stop/resume and core-specific breakpoints are also supported. Configure the Xplorer launch configuration and attach it to the running simulation processes as follows (figure 5).
Fig. 5: Debug configuration.
Figure 6 shows 32 debug contexts.
Fig. 6: 32 debug contexts.
As shown, using Xtensa SDK, you can create a multi-core simulation that functions as a practical software development platform. Please visit the Cadence support site for information on building and simulating multi-core Xtensa systems.
Leave a Reply