The When, Why, And How Of Waiting And Backoff In Multi-Threaded Applications On Arm

Best practices for improving throughput and fair access to shared resources.

popularity

With multithreaded applications, there are situations where it is unavoidable or desirable to wait for other threads. Implementing such wait instruction sequences correctly is important for both multithreaded scalability and power efficiency. Scalability is measured both in terms of aggregated throughput and fairness. Fairness is when all contending threads get an equal share of the contended resource. Fairness is an important but often overlooked characteristic and with poor fairness, some threads could starve, potentially forever, and thus contribute to high tail latencies.

This blog post describes different user space delay and wait implementations for the Armv8+ architecture and recommends best practices for the purpose of improving throughput and fair access to shared resources.

When to wait

There are two common situations where a thread needs to or ought to wait:

  • When waiting for some resource to become available
    • For example, waiting to acquire a simple spinlock or waiting for its turn to enter a critical section.
  • When retrying failed CAS (atomic compare and swap) operations

Waiting in user space ‘spin-waiting’ may be preferred over invoking the operating system which requires system calls. If a thread spends more time waiting than processing, it can be questioned whether the thread should be present at all. If long and/or unpredictable waiting periods are acceptable, consider using relevant OS mechanisms for synchronization as this allows the OS to manage the processor in a more efficient way.

Why backoff

When waiting for other threads, the primary objective is to avoid memory contention. We define memory contention as multiple threads accessing the same cache line and at least one access is a write. Writes do not have to be explicit, for example software and hardware prefetching, or other processor operations, for example cache management operations, can have the same practical effect. Memory contention hurts multithreaded scalability due to the use of MESI-based cache coherency protocols. With MESI, a write requires that processor to get the corresponding cache line in exclusive state which causes the invalidation of that line from all other processors that hold a copy. Another consideration is the behavior of reads following a write. Other processors reading the line will cause the line either to migrate or lose exclusive state. When spin-polling, processors are continuously reading a memory location and reacquiring a copy of the corresponding line if necessary. This can lead to inefficiencies when the owner comes to, for example, release a lock which requires another write to the cache line and thus exclusive ownership of the line. Together, these properties expose software to bottlenecks in and communication latencies of the memory system. The execution time overhead may grow linearly with the number of processors contending for the cache line. Write/write contention may also cause CAS (atomic compare and swap) operations to fail, wasting cache coherence bandwidth without making progress.

Reducing demand to the shared memory location decreases memory contention and allows for other threads to make quicker progress. This improves aggregated throughput and fairness and enables greater system wide progress. Reducing such demand is also known as backoff.

The chart below shows the behavior of a spinlock as the number of threads contending on it increases. From the chart we can see that using a backoff approach to reduce demand when waiting scales much better with high thread counts. Both throughput and fairness improve. The backoff is implemented by the thread waiting for several cycles between accesses to the contended spinlock.

Since the waiting processor is not doing any useful work, we would also like to decrease power consumption while waiting. Avoid keeping the processor pipeline and memory system busy when no progress is being made. Ideally, the processor should enter a low power state with clocks disabled.

How to delay

Waiting is usually implemented using an instruction sequence which causes the processor to delay or stall execution. Such instruction sequences should ideally be independent of microarchitectural characteristics, for example instruction latencies and processor clock frequency. The implementation will also affect power consumption of the delay function, the lower power consumption while waiting, the better.

The Arm architecture provides a user space accessible counter-timer which is incremented at a fixed but machine-specific rate. Software can (spin) wait until the counter-timer reaches some desired value. To make the waiting more power efficient, Armv8.7-A introduced the WFET (FEAT_WFxT) instruction, which allows the processor to enter a low power state for a set time, or until an event is received. Without this feature, we can instead use the ISB instruction to decrease processor activity and thus power consumption between checks of the counter-timer. Note that we do not depend on the latency of the ISB instruction.

C code implementing an Armv8.0-A compatible delay function using the counter-timer is shown below.

static inline unsigned long
counter_read(void)
{
    unsigned long cnt;
    //Need __volatile below since CNTVCT_EL0 does not return a constant value
    __asm __volatile("mrs %0,cntvct_el0" : "=r" (cnt));
    return cnt;
}

static inline unsigned long
counter_freq(void)
{
    unsigned long freq;
    __asm("mrs %0,cntfrq_el0" : "=r" (freq));
    return freq;
}

void
nano_delay(unsigned long delay_ns)
{
    //Prevent speculation of subsequent counter/timer reads and memory accesses
    __asm __volatile("isb");
    if (delay_ns > 18)
    {
        //Adjust for overhead of initial ISB
        delay_ns -= 18;
        //unsigned long delay_ticks = delay_ns * counter_freq() / 1000000000UL;
        //With the below simplifications and adjustments,
        //we are usually within 2% of the correct value
        unsigned long delay_ticks = (delay_ns + delay_ns / 16) * counter_freq() >> 30;
        if (delay_ticks != 0)
        {
            unsigned long start = counter_read();
            if (delay_ticks > 40)
            {
                //Delay using ISB for all but the last ticks
                do
                {
                    __asm __volatile("isb" ::: "memory");
                }
                //Subtract and compare to handle counter roll-over
                while (counter_read() - start < delay_ticks - 40);
            }
            //Spin in empty loop for the last 1..40 ticks to increase accuracy
            do
            {
            }
            //Subtract and compare to handle counter roll-over
            while (counter_read() - start < delay_ticks);
        }
    }
}

Compiling this code, for example using GCC, generates the following machine code:

0000000000000000 :
   0:   d5033fdf        isb
   4:   f100481f        cmp     x0, #0x12
   8:   54000289        b.ls    58 <nano_delay+0x58>
   c:   d1004800        sub     x0, x0, #0x12
  10:   d53be001        mrs     x1, cntfrq_el0
  14:   8b401000        add     x0, x0, x0, lsr #4
  18:   9b017c00        mul     x0, x0, x1
  1c:   d35efc00        lsr     x0, x0, #30
  20:   b40001c0        cbz     x0, 58 <nano_delay+0x58>
  24:   d53be042        mrs     x2, cntvct_el0
  28:   f100a01f        cmp     x0, #0x28
  2c:   540000e9        b.ls    48 <nano_delay+0x48>
  30:   d100a003        sub     x3, x0, #0x28
  34:   d5033fdf        isb
  38:   d53be041        mrs     x1, cntvct_el0
  3c:   cb020021        sub     x1, x1, x2
  40:   eb03003f        cmp     x1, x3
  44:   54ffff83        b.cc    34 <nano_delay+0x34>
  48:   d53be041        mrs     x1, cntvct_el0
  4c:   cb020021        sub     x1, x1, x2
  50:   eb00003f        cmp     x1, x0
  54:   54ffffa3        b.cc    48 <nano_delay+0x48>
  58:   d65f03c0        ret

Potential issues are that the initial ISB instruction causes some undefined delay (approximatively 30-40 cycles) and that the counter-timer frequency may be too low to provide proper accuracy. We try to compensate for this by subtracting some nanoseconds from the requested delay time and spin-wait in an empty loop at end of the waiting period. The above implementation has the following behaviours when running on Graviton 2/3/4.

Later versions of the Arm architecture include a number of new features which improve the situation and simplify the implementation. From Armv8.6-A, the counter-timer must have the resolution of 1GHz (but actual update rate could be lower due to counter scaling). When FEAT_SB (Armv8.5-A), FEAT_ECV (Armv8.6-A) and FEAT_WFxT (Armv8.7-A) are supported, the following implementation can instead be used:

//Self-synchronising read of CNTVCTSS_EL0, no need for any speculation barrier
static inline unsigned long
counter_read_ss(void)
{
    unsigned long cnt;
    //Need __volatile below since CNTVCTSS_EL0 does not return a constant value
    __asm __volatile("mrs %0,cntvctss_el0" : "=r" (cnt)); //Requires FEAT_ECV (Armv8.6)
    return cnt;
}

void
nano_delay_v87(unsigned long delay_ns)
{
    unsigned long delay_ticks = delay_ns; //From Armv8.6, CNTFRQ_EL0 always return 1GHz (1000000000)
    if (delay_ticks != 0)
    {
        unsigned long start = counter_read_ss();
        unsigned long end = start + delay_ticks;
        do
        {
            __asm __volatile("wfet %0" : : "r" (end) : "memory"); //Requires FEAT_WFxT (Armv8.7)
                                                                  //Spurious wakeups may occur
        }
        while (counter_read_ss() - start < delay_ticks);//Handle counter roll-over
    }
    //Prevent speculation of subsequent memory accesses
    __asm __volatile("sb" ::: "memory"); //Requires FEAT_SB (Armv8.5)
}

The following machine code is generated:

0000000000000000 :
   0:   b4000100        cbz     x0, 20 <nano_delay_v87+0x20>
   4:   d53be0c2        mrs     x2, cntvctss_el0
   8:   8b020003        add     x3, x0, x2
   c:   d5031003        wfet    x3
  10:   d53be0c1        mrs     x1, cntvctss_el0
  14:   cb020021        sub     x1, x1, x2
  18:   eb00003f        cmp     x1, x0
  1c:   54ffff83        b.cc    c <nano_delay_v87+0xc>
  20:   d50330ff        sb
  24:   d65f03c0        ret

With these functions, high accuracy delay can be achieved. The actual resolution still depends on the actual update rate of the counter-timer.

Common mistakes

Many different approaches to implementing delay and backoff functionality have been attempted with varying results. Some are listed below with explanations why they are not recommended. Existing code may use the x86 PAUSE instruction to create the delay used for backoff. When porting the code to Arm, many of the approaches below have been used, often with poor results.

  • Empty loops: An empty loop can be used for causing a delay, but the actual delay will be microarchitecture- and compiler-specific. An optimizing compiler will also likely remove an empty loop. Processor implementations are constantly improving in decreasing latency and increasing throughput of existing code. A processor implementation with deep OoO capabilities might quickly speculate the whole delay loop and then also speculate subsequent memory accesses. The actual delay would be difficult to measure and this would also defeat the purpose of decreasing memory contention. Furthermore, such aggressive spin waiting would be very power consuming.
  • NOP: NOPs are used for padding. Processors attempt to minimize the overhead of executing NOPs. Indeed, many microarchitectures discard NOPs after fetch and decode so that they have essentially zero overhead. If you use NOP in a loop, it is actually the loop overhead that contributes to any delay and backoff.
  • YIELD: YIELD is allocated from the hint (NOP) space. On a processor implementation with hardware multithreading, YIELD serves as a hint to hardware that the executing thread isn’t doing anything performance critical, and the hardware could instead prioritize other hardware threads. On processors without hardware multithreading (the vast majority of Arm processor implementations), YIELD is treated as a NOP.
  • Lone WFE: WFE waits for an event and allows the processor to enter a low power state while waiting. Any event received by this processor will then wake up the processor and continue execution. Without a locally generated event (e.g. from SEVL or LDXR), WFE will wait for an indefinite time until some external source signals an event. One such source is the periodic event stream which wakes up all processors waiting in WFE at the same time which would amplify contention at that point in time. Since some threads will sleep for a very long time, memory contention will generally decrease, and throughput may look good in microbenchmarks. Usage in real code has proven to instead degrade performance.
  • SEVL+WFE: By signaling a local event using the SEVL instruction, we ensure a following WFE does not sleep indefinitely. The instruction sequence may cause delay for a small number of cycles. Since any latency in waking up from WFE sleep is detrimental, there is some pressure to decrease its latency. This makes SEVL+WFE not a future-proof solution for fixed time delay.
  • LDXR+WFE: The processor can monitor a memory location for changes using LD{A}XR (load-exclusive) and WFE instructions. WFE halts the processor while waiting for an event to be signaled. Any update to the monitored memory location will generate a local signal which will then wake up the processor. But using LDXR+WFE does not decrease memory contention. LD{A}XR will cause the processor to obtain a shared copy of the monitored cache line. If any location in the cache line is updated, the shared copy will be invalidated, and software must perform another load (which will again obtain a shared copy) to check the new value. This load and check cannot be avoided since spurious wake ups may occur. The cache coherence transactions and thus the memory contention is essentially the same as when spinning on a load.
  • ISB: The instruction synchronization barrier (ISB) synchronizes the context of the processor, which typically involves flushing the instruction pipeline, and then restarts execution. On a modern Arm out-of-order design, the delay introduced by an ISB is typically in the range of 30-35 cycles. This is likely very different from what can be expected from the x86 PAUSE instruction on modern x86 cores for which ISB is often substituted when porting existing code to Arm. ISB is commonly used in performance critical OS/kernel code and there is some pressure to decrease its latency. If the number of ISB instructions (i.e. iteration count) is tuned for Arm hardware, it could serve as part of a delay loop.
  • sched_yield(): The Linux sched_yield() call reschedules the thread for execution at some undefined future point in time. If there are no other threads to run on the processor, sched_yield() will just become an expensive busy-waiting implementation. If there are other threads to run on the processor, the imposed delay becomes non-deterministic and potentially detrimental. For further explanation why sched_yield() is not a suitable solution, read the Linux man page on sched_yield, specifically the caveats section.
  • usleep(0)/nanosleep(0): The Linux usleep() and nanosleep() calls invoke thread suspension control by the OS scheduler and will result in a context switch, the calling thread will sleep for many microseconds. They are not suitable for the short-latency waiting within the same context that we are normally after.


Leave a Reply


(Note: This name will be displayed publicly)