Better Optimization For Many-Core AI Chips

System-wide functional analysis helps optimize many-core SoCs and get them to market on time.


The rise of massively parallel computing has led to an explosion of silicon complexity, driven by the need to process data for artificial intelligence (AI) and machine learning (ML) applications. This complexity is seen in designs like the Cerebras Wafer Scale Engine (figure 1), a tiled manycore, multiple wafer die with a transistor count into the trillions and nearly a million compute cores.

Fig. 1: The Cerebras Wafer Scale Engine is a good example of a huge, complex manycore SoC.

The market for AI SoCs continues to grow and is highly competitive. Semiconductor companies find their niche based on performance, cost, and flexibility. Targeting one or another of these parameters has led to an explosion of new manycore architectures. System architects are trying many different approaches, but all the designs are highly complex and all the chip makers want to harness that complexity into a competitive advantage.

Of all the sources of complexity, one in particular is very important to consider in multicore AI SoCs:  functional errors and degraded performance arise when many threads are running in parallel on shared data. Traditionally, designers could use classical CPU run control to debug the problem, but not with manycore architectures. Between the round-trip delay, the number of cores, control and data parallelism, multiple levels of hierarchy, and interdependent processes, designers have a slim chance of determining the root cause of software problems.

Additionally, designers need to consider hardware-software co-optimization, which requires a lot of functional analysis. To implement AI applications on the SoC, designers need to compile the source code to take advantage of the manycore architecture. This frequently requires a custom toolchain that has full knowledge of the architecture. The process involves a cycle of hardware and software optimization and testing starting in SoC emulation and continuing through first silicon and subsequent generations of the device, shown in figure 2.

Fig. 2: System-level data is used throughout the SoC lifecycle.

Through this cycle of functional analysis, the teams can learn:

  • How effectively data is shared
  • Whether the network on chip (NoC) is over-subscribed or unbalanced
  • How to measure application performance without impacting code execution
  • How to optimize the memory controller profile for data throughput
  • How to correlate events from across the SoC

Getting to this point requires a new approach to optimizing AI SoCs and the software that runs on them. It calls for a system-wide functional analysis to bring high-quality AI SoCs to market on time and to maintain optimal performance after deployment. Some features of system-wide functional analysis include:

  • Detailed insights into any subsystem or component
  • An accurate and coherent picture of the whole system from boot
  • Transaction-aware interconnect monitoring and statistics
  • Classical processor run control and trace
  • Support for all common ISAs and interconnect protocols
  • Flexibility to choose or change which subsystems are important
  • Flexible and powerful tools to generate data insights

An on-chip infrastructure of monitoring and analysis IP and software provides all these benefits from simulation to deployment. Figure 3 shows a typical architecture for SoC functional monitoring and analytics.

Fig. 3: An Embedded Analytics platform provides system-level visibility that turns chip complexity into an advantage.

Let’s posit an example, shown in figure 4. This block diagram of a manycore chip is instrumented with an on-chip network-on-chip (NoC) monitor that traces all NoC transactions into a circular buffer. Since the NoC Monitor is transaction-aware, it can be configured to detect certain bus conditions – for example, a deadlock that causes transaction duration to exceed a certain threshold (in terms of number of cycles). When the threshold is exceeded, the NoC monitor can output the details of the deadlocked transaction and those immediately preceding it, allowing diagnosis of the problem. This requires no run-time intervention from the debug host.

Fig. 4: A block diagram of a manycore chip instrumented with an on-chip network-on-chip (NoC) monitor.

The same NoC monitor can be configured to trigger trace elsewhere in the system on detection of the same deadlock condition– for instance via a status monitor block tracing the behavior of a hardware accelerator – using the cross-triggering functionality of the Embedded Analytics message infrastructure.

Understanding the issues involved in implementing an effective system validation and optimization environment is key to the successful delivery of manycore SoCs and is a key reason why working with a supplier with deep expertise in this area is essential.

Additional resources:

Leave a Reply

(Note: This name will be displayed publicly)