How to improve the control flow and data management when writing and observing code.
When writing code it is often useful to add informational statements that give an insight into control flow and data management as well as aiding in observation of the actual code at runtime. As such, instrumentation is an important component of code running on a live system. The proliferation of “printf” debug statements, whereby data is output to a console, is testament to this.
Sending text data to a USART or similar peripheral via printf is perhaps the most common method of instrumentation. It does have its drawbacks; the data rate of most USARTs are usually relatively low and at the same time the overhead of maintaining such communication is relatively high, involving the use of FIFOs and interrupt servicing. It is also sometimes complicated to access a serial port connection on a production system, which may be located remotely. With this in mind, the use of a USART for instrumentation can be considered non-ideal choice for use cases involving high-performance code or the collection of remote instrumentation data.
An alternative method may be to use network devices, such as Ethernet. These devices typically afford much higher bandwidth rates than USARTs, and are ideal for the collection of remote data. However, this does involve manually encapsulating the data in protocols such as TCP/IP, which can dramatically increase the overhead of servicing the peripheral. Therefore the overhead of instrumentation can be higher.
Using USARTs, Ethernet or other generic data peripherals can have detrimental effects on instrumented code. As an example, we can imagine a system which performs network packet data processing. If we consider using a USART then we may find that the data processing is limited because the overhead of sending instrumentation data is limited by the USART bandwidth. If we then consider that we then use Ethernet as a transport for instrumentation, we might find that the instrumentation on packet data processing contains data on the process of instrumentation itself.
It is considered desirable for instrumented code to run at close to the performance and run-time profile of non-instrumented code. That has the implication that instrumentation has as little management overhead as possible, and does not markedly interfere with operation of the non-instrumentation code. One way to solve these problems is with a device which is designed for the purpose of instrumentation.
The System Trace Macrocell
A System Trace Macrocell (STM) grants software developers the ability to instrument code utilizing the CoreSight Trace subsystem as a transport. CoreSight is a central part of most ARM SoCs, and is intended to operate at the similar clock rates as the rest of the components of the system. The STM itself operates in a non-invasive fashion requires very little overhead besides memory-mapped peripheral writes, and does not (directly) generate interrupts.
ARM defines a System Trace Macrocell Programmers’ Model Architecture Specification (currently version 1.1, referred to here as “STM Architecture”) and licenses the current CoreSight STM-500 product as implementation of that architecture.
Further information on CoreSight Trace can be found in Eoin McCann’s 3-part blog on CoreSight.
The STM instruments using the MIPI System Trace Protocol version 2.0 (STPv2), which is available to MIPI Members. The protocol itself defines a method for both instrumentation data and metadata to be encapsulated in a trace stream, composed of varying sized data elements (from 4- to 64-bit). The instrumentation is otherwise free-form and neither the protocol nor the STM place any limitations on the data content of the stream. These aspects of the STM free the software developer from having to be concerned with instrumentation overheads and available bandwidth.
Instrumentation via STM can be identified as being output via a particular “Master,” in order to differentiate the various sources within a system. A simple implementation might attribute all instrumentation with a single Master identifier. A more complex design might attribute each individual core with a unique Master identifier, making it clear which core was running the software was responsible for generating a particular datum of instrumentation.
Any device that can generate a memory system write can generate instrumentation, for example DMA peripherals and GPUs.
The number of masters within a system and their identifiers are part of the implementation of the system, and may or may not map directly to, for example, AXI IDs. Check the design documentation for your chosen SoC for details on which components are able to generate stimulus via memory writes, and what their STPv2 Master ID is.
Each STM implementation has access to up to 65536 instrumentation channels. Each of these channels is clearly defined in the trace stream, allowing for multiple types of instrumentation to be intermixed within a single system or single application. For instance, channel 0 could be used to encode ASCII text, while channel 10 could output packet headers in a binary format. Alternatively, one channel could be allocated to each Process within a system.
Metadata: Marks, Flags, Timestamps and Triggers
STM metadata is highly flexible, allowing one to arbitrarily Mark any trace data packet. A marked datum is typically used to identify the start of data or something of interest in the trace stream. A Flag can be used in a similar way; however, no data is associated with a Flag.
Each packet can be supplemented with a Timestamp, which takes an external clock signal and converts it into an incrementing count in the trace stream. In this manner a trace stream can be synchronized with other trace in the system, such as Instruction Trace from an ETM, or simply allow timing information to the trace decoder.
STPv2 defines the format of the timestamp to be flexible. The STM-500 outputs timestamps in a natural binary format, with the ability to encode a delta to conserve bandwidth.
A Trigger is special as they are both output to the trace stream and can have an effect on the rest of the trace subsystem. The result of a Trigger can be routed to other components in the system. In this manner code can be instrumented and also generate additional trace from other Trace Macrocells within the system at pertinent points. This is particularly useful for post-mortem analysis use cases.
Channels are formed on the STM by way of “stimulus ports.” These are groups of registers within the SoC memory map that, when accessed, generate the desired trace output. The STM Architecture defines both “Basic” and “Extended” Stimulus Ports. A Basic Stimulus Port is simple; data is written to the port, and that data is then output.
Extended Stimulus Ports allows for the augmentation of the data with useful metadata, along with the importance of that data (Guaranteed or Invariant, discussed later). The Extended Stimulus Ports consist of a grouping of 16 registers in a 256-byte memory mapped region, separate from the STM configuration registers.
Depending on the address offset of the register within a group, a different STPv2 packet is output. The offsets are defined in the STM Architecture, Section 3.1 (Table 3-1), a summary of which is shown:
The size of the data payload of each packet is determined by the size of the access made to the stimulus port offset. For example, an 8-bit store to offset 0x18 would nominally generate a ‘D8’ packet, while a 32-bit store to offset 0x18 would nominally generate a ‘D32’ packet, and so on.
To reiterate, we can “Mark” and “Timestamp” our data, and also output metadata only via “Flag” and “Trigger” mechanisms (these types of instrumentation have no data payload.)
Since ARM’s STM and STM-500 IP do not implement the Basic Stimulus registers, we will not cover them here. ARM partners implementing an STM may choose to implement them per the STM Architecture. If, when designing an SoC, there is a requirement for more simple instrumentation, then it is possible that an Instrumentation Trace Macrocell (ITM) could be implemented which can provide similar functionality, although with a different programmers’ model and trace output format. Please check your SoC documentation.
Fundamental Data Size
The STM implementation defines a “Fundamental Data Size.” This is essentially the maximum size of an access to the stimulus port registers, as determined by the implementation of the connection between the STM and the rest of the system.
For STM-500, as implemented in revision r1 of the Juno SoC, the fundamental data size is 64-bit, so a 64-bit stimulus should generate a D64 packet. Care should be taken to realize this value as it can change the way a trace decoder is written for application instrumentation that may run on multiple platforms.
Some SoCs implement an earlier version of the STM, the r0 revision of Juno being one example. The Fundamental Data Size is defined as 32-bit for that implementation.
An STM with a Fundamental Data Size of 64 bits may also be connected in such a way that it does not have a 64-bit wide data path, for example there may be a ‘downsizer’ between the instrumentation source and STM.
If a 64-bit memory system write is performed and either of the above are true, the actual trace output behavior is undefined by the STM architecture. Care should be taken to ensure these aspects are taken into account as it can change the way extracting instrumentation is performed within a trace decoder.
Guaranteed and Invariant Stimulus
The STM Architecture specifies two types of transaction, accessible through the stimulus port interface at separate offsets within the port – Guaranteed and Invariant. A write to the stimulus port “guaranteed” registers must be emitted by the STM as a trace packet; additionally, if a timestamp is requested (DnTS, FLAG_TS) and timestamping is enabled in the STM configuration registers (STMTCSR), then the timestamp will be generated.
Writes to the Invariant registers allow the STM to make a determination as to whether the full scope of instrumentation will be output. This is useful for instrumentation types that may be implemented as “lossy” – for instance, the output of the state of a loop counter where intermediate loop counts can be inferred, or where timestamping is not fundamental to the instrumentation. Invariant stimulus may, when emitted, “drop” timestamps for the sake of trace bandwidth. Important instrumentation – for instance, an error or other pertinent instrumentation, may still use Guaranteed stimulus.