SPONSOR BLOG

Simulating Multiple DSPs As Multiple x86 Processes

Creating a multi-core simulation that functions as a practical software development platform.

November 14th, 2024 - By: Nayan Gaywala

An increasing number of embedded designs are multi-core systems. At the pre-silicon stage, customers use a simulation platform for architectural exploration and software development. Architects want to quantify the impact of the number of cores, local memory size, system memory latency, and interconnect bandwidth. Software teams wish to have a practical development platform that is not excruciatingly slow.

This blog shares a recipe for simulating Cadence DSPs in a multi-core design as separate x86 processes. The purpose is to reduce simulation time for customers with simple multi-core models where cores interact only through shared memory. It uses a Vision Q8 multi-core design to share details of the XTSC (Xtensa SystemC) model, software application, commands, and debugging. Note the details shared are for a simulation run on an Ubuntu Linux machine, Xtensa tools version RI-2023.11, and core configuration XRC_Vision_Q8_AODP.

Complex vs. simple model

A complex model (figure 1) is one in which one core accesses another core’s local memory, or there are inter-core interrupts. Simulation runs as a single x86 process.

Fig. 1: A complex model.

A simple model (figure 2) is one in which cores interact only through shared memory. Shared memory is a file on the Linux host.

Fig. 2: A simple model.

Multiple x86 process – Simple model

As depicted in figure 3, each core is simulated using a separate x86 process. Cores use barriers and locks placed in shared memory for synchronization and data sharing. Locks are placed in un-cached memory that support exclusive subordinate access. The XTSC memory component, xtsc_memory, supports exclusive subordinate access. Cadence software tools provide a way to define memory regions as cached or uncached.

Fig. 3: Each core is simulated using a separate x86 process.

Demo application

A demo application performs a 128×128 matrix multiplication. Work is divided so that each of the 32 cores computes four rows of the 128×128 result matrix. Cores use barriers to synchronize. Cadence tools provide APIs for synchronization and locking. Note without a higher-level lock, prints from all cores will get mixed up. Therefore, in the demo application, only core#0 prints.

SystemC simulation

The following sample command runs the 32-core simulation in such a way that each core is a separate x86 process. It runs a matrix multiplication application in cycle-accurate mode with logging off.

>>for (( N=0; N<32; N=N+1 )); do xtsc-run -define=NumCores=32 -define=N=$N -define=LOGGING=0 -define=TURBO=0 -define=PROG_NAME=/..path../MatMul -i=coreNN.inc & done

“xtsc-run” is a Cadence Xtensa SDK application allowing users to run systemC simulations without C++/systemC programming. coreNN.inc is the model topology in a readable text format. Since each core runs as a separate x86 process, each x86 process synchronizes at the end of the elaboration phase before starting the simulation.

Figure 4 shows the dump of the Linux “htop” command. It shows 32 separate processes.

Fig. 4: Dump of the Linux “htop” command.

It is easy to create a custom simulation using an XTSC script. Capturing wall time with XTSC requires logging, which slows down the simulation. Custom simulations also offer per-core debug control. A single custom simulation executable can simulate each of the 32 cores as a separate x86 process.

One approach to a single custom simulation executable is to generate sc_main.cpp for, say, core#0 from coreNN.inc using the following XTSC script.
>>xtsc-run -define=NumCores=32 -define=N=0 -define=LOGGING=0 -define=TURBO=0 --xxdebug=sync -i=coreNN.inc -sc_main=sc_main.cpp -no_sim

Modify the sc_main.cpp generated for core#0 to create a generic sc_main.cpp to build a single simulation executable for all cores. The Xtensa SDK includes Makefile targets to build custom simulations.

By default, the simulation runs in cycle-accurate mode. Fast functional (Turbo) mode provides additional improvement over cycle-accurate mode. Note that the fast functional mode has an initialization phase, so gains are visible only when running an application with longer run times.

Simulation wall time

The table captures simulation wall time improvements. Note that these are illustrative wall time numbers. Actual wall time numbers and improvements will depend on your host machine’s performance and your application.

Simulation Type	Wall Time	Comments
Single process cycle accurate mode	17500 seconds
Multiple x86 processes cycle accurate mode	1385 seconds	12X faster than single process
Multiple x86 processes turbo mode	415 seconds	3X faster than cycle accurate mode

Debugging

Attaching a debugger to each of the individual x86 core simulation processes is possible. Synchronous stop/resume and core-specific breakpoints are also supported. Configure the Xplorer launch configuration and attach it to the running simulation processes as follows (figure 5).

Fig. 5: Debug configuration.

Figure 6 shows 32 debug contexts.

Fig. 6: 32 debug contexts.

As shown, using Xtensa SDK, you can create a multi-core simulation that functions as a practical software development platform. Please visit the Cadence support site for information on building and simulating multi-core Xtensa systems.

Nayan Gaywala

(all posts)
Nayan Gaywala is a senior principal application engineer at Cadence.

Simulating Multiple DSPs As Multiple x86 Processes

Complex vs. simple model

Multiple x86 process – Simple model

Demo application

SystemC simulation

Simulation wall time

Debugging

Nayan Gaywala

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers
Entities, people and technologies explored

Related Articles

RISC-V’s Increasing Influence

3D-IC For The Masses

Chiplets Add New Power Issues

Development Flows For Chiplets

New Data Center Protocols Tackle AI

Chiplet Tradeoffs And Limitations

Implementing AI Activation Functions

Future-proofing AI Models

Sponsors

Recent Comments

About

Navigation

Connect With Us

Simulating Multiple DSPs As Multiple x86 Processes

Complex vs. simple model

Multiple x86 process – Simple model

Demo application

SystemC simulation

Simulation wall time

Debugging

Nayan Gaywala

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers Entities, people and technologies explored

Related Articles

RISC-V’s Increasing Influence

3D-IC For The Masses

Chiplets Add New Power Issues

Development Flows For Chiplets

New Data Center Protocols Tackle AI

Chiplet Tradeoffs And Limitations

Implementing AI Activation Functions

Future-proofing AI Models

Sponsors

Newsletter Signup

Popular Tags

Recent Comments

About

Navigation

Connect With Us

Knowledge Centers
Entities, people and technologies explored