Streamline garbage collector scanning and improve runtime behavior for modern memory-intensive applications.
Last year, I explored how you can use the Arm Scalable Vector Extension (SVE) in .NET to unlock SIMD performance at scale. This year, my focus has shifted to something less visible but just as fundamental to runtime performance. Write barriers in the CoreCLR garbage collector (GC).
Write barriers are not a feature most .NET developers ever think about. They do not change how you write C# code, and you do not see them in benchmarks unless you deliberately look. However, they are among the hottest code paths in the runtime. Every time managed code writes a reference field, which happens constantly, the write barrier runs. This behavior makes them prime candidates for micro-optimization.
In this blog post, I cover:
A write barrier is a small snippet of code that runs whenever a managed reference is stored into a field or array element. Its job is to keep the GC’s metadata in sync with program memory. Without a write barrier, the GC would not track when one object starts or stops referencing another. This gap can cause incorrect memory reclamation when the GC assumes that no object references that memory. The write barrier is the GC’s eyes and ears during execution.
Depending on the GC mode, a write barrier might:
Because this code runs on every reference update, its performance is critical. In many applications, the write barrier is the function that runs most often. For this reason, it is written in Arm64 assembly. During its operation, the write barrier needs to know certain state in the GC and pointers to various tables. To speed this up, it uses cached copies of key GC state, which are placed nearby in memory. Some of these caches are static. Others are updated dynamically by the GC to reflect runtime configuration.
To keep the barrier fast, the card marking logic is deliberately simple. It errs on the side of marking more memory dirty than required. This design reduces time during each write, but it requires the garbage collector to scan more memory than it really needs to. On small heaps, this overhead is negligible. On large servers with tens or hundreds of gigabytes of managed memory, the additional scanning increases pause times and reduces throughput.
The current design is a compromise that saves time on each write but pays for it later during GC.
In .NET Runtime PR #111636, we changed the Arm64 design to match the x64 design that has used a WriteBarrierManager for years.
Instead of one universal helper, the runtime now uses 10 specialized write barrier variants. Each variant is optimized for a particular GC configuration. Some variants are tuned for server GC. Others provide more accurate dirty card marking. The most precise variant uses Armv8.1-LSE (Large System Extensions) instructions. Each variant assumes a specific GC state and removes redundant checks. This turns what used to be a complex code path into near straight-line execution. The result is smaller and faster barrier code.
The challenge is ensuring the correct variant is called each time a reference changes. As before, all .NET code continues to call the same one global writebarrier function, as it would be impractical to update all calls. Meanwhile, every time any relevant state is changed inside the GC, the GC calls out to the WriteBarrierManager.
The WriteBarrierManager then decides which is the correct specialized writebarrier function to use. It then copies this function over the top of the global writebarrier function, flushing codecaches as necessary. In many cases, this happens once at startup. In others, the active barrier may switch dynamically as the program evolves.
The main point in this change is the tradeoff between the cost of each write and the cost during each collection.
Since writes happen very often, it may sound counterintuitive to make them more expensive. But in practice, the cost per write is tiny, and the additional work is only a couple of extra instructions. The savings during collection are significant. For example, imagine a large web service with a 64 GB heap, running background GC. Reducing the number of dirty cards by 5–10% can translate into fewer milliseconds of pause time per collection, multiplied across thousands of collections per day. That is a huge win in terms of tail latency and throughput.
This is a classic example of shifting work out of the critical path (GC pause time) into the steady-state path (writes). For modern server workloads, this tradeoff provides a clear benefit.
In the new version of the code, there are still constants that need to be loaded by the write barrier. For example, it loads the location of various tables and the offsets into those tables of the different regions. The offsets can change while the program runs, and need updating by the writeBarrierManager. In both in the old version of the code, these constants are placed in a small buffer directly after the main writebarrier. This locality ensures loading is fast and the CPU can cache effectively.
In x64, this is done in a different way. Instead of using a buffer, x64 writes the constants directly into movabs instructions. This approach avoids loading these values from memory. This is possible because the variable-length of instructions. A single 64bit constant can be moved into a register in one instruction. Arm64 uses fixed 32-bit instructions and can move only a 16bit constant in a single instruction. As a result, moving a 64bit constant requires four MOVK and MOVN instructions. This can be reduced if you need to move multiple constants that all share common parts.
In practice, the cost of a nearby load is low, and the saving by moving to four mov instructions is small, especially when you factor in the cost of a bigger code footprint. Arm does not generally recommend this optimization, so the new Arm64 code continues to load the constants from the buffer.
At the code level, the change included the following steps:
This work forms part of a broader, ongoing effort to make .NET on Arm64 as performant and mature as .NET on x64. Write barriers might appear minor, but they are among the most performance-critical components of the runtime. This work is part of the ongoing collaboration between Arm and Microsoft to push .NET performance on Arm64 even further, ensuring that managed code runs smoothly and efficiently on all platforms.
In fact, running .NET on Arm64-based Azure Cobalt 100 shows performance improvements compared with equivalent AMD systems on key workloads. With the announcement of Cobalt 200, we expect it to improve further.
The new WriteBarrierManager for Arm64 may not change how you write C# code, but it changes how the runtime behaves in subtle and important ways. By trading a little extra work per write for much more efficient GC scanning, we have made the runtime better suited to today’s memory-intensive workloads.
It is another reminder that runtime performance is not just about flashy vector intrinsics or JIT tricks. Sometimes the biggest wins come from making the invisible machinery of the garbage collector just a little bit smarter.
Leave a Reply