Accelerating Computational Lithography With GPU Rasterization

Addressing a bottleneck in nanometer-scale semiconductor manufacturing.

popularity

By Loay Hegazy, Mohamed Taher, and Sherif Hammouda

Semiconductor manufacturing at advanced nodes has become a race against physics. Even as feature sizes shrink, the tools that design and validate these circuits must operate with precision and speed. We show in a recent research paper that one part of the post-tapeout flow, rasterization, can be significantly sped up by deploying GPUs for massively parallel processing.

The challenge: Rasterization at the nanometer scale

Rasterization is the process of converting continuous geometric shapes—polygons and curves—into discrete pixels on a grid, as shown in Figure 1. In semiconductor manufacturing, rasterization underpins mask synthesis, lithography simulation and optical proximity correction (OPC), the techniques used to ensure that what’s designed actually manufactures correctly at nanometer scales.

Fig. 1: Example of converting a polygon into a pixel-based representation through rasterization.

The problem: traditional CPU-based rasterizers are slow and may not scale well with designs that contain billions of polygons and trillions of pixel evaluations. For post tape-out teams working under tight time-to-market pressures to order photomasks, this is unacceptable.

The stakes are equally high for accuracy. A one percent error in pixel coverage can translate to critical dimension variations that render a chip non-functional. Traditional binary rasterization—where a pixel is either fully covered or not at all—is insufficient. The industry needs fractional pixel coverage to accurately simulate light intensity and resist behavior at nanometer scales.

This is where GPU acceleration enters the picture.

Why GPUs matter for semiconductor lithography

GPUs excel at data parallelism—processing thousands of independent tasks simultaneously. Rasterization, at its core, is a massively parallel problem: each pixel can be evaluated independently, and modern designs contain billions of pixels. This alignment between the problem structure and GPU architecture makes rasterization an ideal candidate for GPU acceleration.

However, simply porting a CPU rasterization algorithm to a GPU doesn’t work. GPUs have different memory hierarchies, different performance characteristics and different constraints. Irregular memory access patterns, which CPUs handle reasonably well, can cripple GPU performance. Precision requirements that are easy to meet on CPUs can be challenging on GPUs without careful algorithm design.

The semiconductor industry has recognized this opportunity. Siemens has been exploring GPU acceleration across their tool suites to tackle computationally intensive workflows. Rasterization has been a natural focus. Rasterization in computational lithography is a particularly valuable application because:

  • The problem is inherently parallel
  • The bottleneck is severe
  • The precision requirements are non-negotiable
  • The impact on time-to-market is direct

What’s been missing is a rasterization algorithm specifically engineered for the GPU’s architecture while maintaining the precision demands of computational lithography.

A GPU-friendly approach to rasterization

Recent research has produced a massively parallel GPU rasterizer designed from the ground up for this problem. The algorithm decomposes rasterization into five key stages:

  1. Initialization and memory optimization: The algorithm reserves shared memory (fast, on-chip GPU memory) for polygon data, minimizing slower global memory access.
  2. Spatial partitioning: Polygons are assigned to GPU thread blocks based on their location, enabling simultaneous processing of multiple polygons.
  3. Bounding box calculation: Each polygon’s minimal bounding rectangle is computed, pruning unnecessary pixel evaluations outside this region.
  4. Fine-grained thread allocation: GPU threads are dynamically assigned to individual pixels or small pixel groups, enabling independent coverage calculations.
  5. Precision-preserving pixel classification: Each pixel is classified as inside, outside or on the boundary of the polygon using floating-point arithmetic.

The last step deserves emphasis. Floating-point atomics—a GPU feature that allows multiple threads to safely update the same memory location with floating-point values—are critical here. When multiple polygons overlap a single pixel, floating-point atomics ensure that fractional coverage values are correctly accumulated. This enables the algorithm to compute the exact fractional area of a polygon within each pixel, essential for nanometer-scale accuracy. Figure 2 illustrates how the algorithm determines the fractional coverage of the polygon within the pixel using floating-point arithmetic.

Fig. 2: Simple example of pixel classification and processing.

For boundary pixels, the algorithm uses the trapezoidal rule to calculate the precise area of intersection between the polygon edge and the pixel. This is where sub-pixel precision comes from—not just knowing whether a pixel is covered, but knowing how much it’s covered.

The algorithm also preserves connectivity between sub-pixel geometries. Naive pixelization can create artifacts—disconnected features or spurious gaps—that don’t exist in the original design. This algorithm prevents those artifacts, ensuring that thin features remain connected and manufacturable.

Performance and accuracy results

The algorithm was benchmarked on NVIDIA H100 GPUs against highly optimized CPU implementations across two categories of designs:

Manhattan geometries (rectilinear shapes, common in standard cell layouts): up to 290x speedup compared to optimized CPU implementations (Figure 3).

Fig. 3: CPU and GPU runtimes for Manhattan datasets. GPU achieved large speedups with pixel errors under 1% against CPU results.

Curvilinear geometries (complex curves, increasingly common in advanced designs): up to 45x speedup compared to optimized CPU implementations (Figure 4).

Fig. 4: CPU and GPU runtimes for curvilinear datasets. GPU achieved large speedups with pixel errors under 1% against CPU results.

The speedup difference reflects the relative complexity of the two shape types. Manhattan shapes are simpler to rasterize, allowing the GPU’s massive parallelism to shine. Curvilinear shapes are more computationally demanding, but even a 45x speedup is transformative for a workflow that currently takes hours.

Critically, these speedups come without sacrificing accuracy. The GPU rasterizer achieved less than 1 percent absolute error compared to reference CPU calculations. This confirms that the precision requirements for nanometer-scale manufacturing are met, even with aggressive parallelization.

What this means for the industry

Faster iterations and competitive advantage

OPC, of which rasterization is a part, is a complicated and time-consuming process. A 45–290x speedup in the rasterization step directly translates to faster turnaround times.

These performance gains matter especially for teams developing chips for automotive, security and pervasive computing applications, where design complexity is compounded by reliability requirements and high-volume manufacturing economics. Automotive-grade chips demand zero-defect manufacturing at scale, security processors require precise implementation of physical unclonable functions and side-channel resistant layouts, and pervasive computing drives demand for cost-optimized designs that must still meet strict performance and power envelopes—all scenarios where faster, more accurate lithography verification translates directly to competitive advantage.

Improved design quality and yield

The ability to run more comprehensive verification and optimization directly improves design quality. OPC is fundamentally about predicting and correcting for the distortions introduced by the lithographic process. The more thoroughly OPC can be run—the more design variations explored, the more edge cases tested—the more accurate the final mask will be.

This translates to improved yield. Better OPC means fewer critical dimension variations, fewer bridging failures, fewer open circuits. The cumulative impact across a production run can be substantial. For a high-volume product, even a 1–2 percent improvement in yield can represent millions of dollars in additional revenue.

Scalability with hardware evolution

GPU hardware is evolving rapidly. Each generation brings more cores, higher memory bandwidth and new architectural features. The GPU rasterizer algorithm is designed to scale with these improvements. Unlike CPU algorithms, which have hit fundamental limits in single-thread performance, GPU algorithms can continue to benefit from hardware scaling for years to come.

This means that investments in GPU-accelerated rasterization tools will continue to pay dividends as hardware improves. A tool deployed today will become faster and more capable as new GPU generations arrive, without requiring algorithmic changes. For tool vendors and their customers, this is a significant advantage over approaches that have already reached the limits of CPU optimization.

Precision at scale: a non-trivial achievement

The ability to maintain sub-pixel accuracy while processing billions of polygons and trillions of pixel evaluations is non-trivial. Many GPU acceleration efforts sacrifice precision for speed. This algorithm demonstrates that speed and precision are not mutually exclusive—that it’s possible to achieve both.

This is critical for computational lithography. A fast algorithm that’s inaccurate is worse than useless—it’s dangerous, because it creates a false sense of confidence in designs that may not be manufacturable. The less-than-1-percent error achieved here means that the speedups come without hidden costs or risks. Design teams can trust the results.

Looking ahead

This work is part of a larger trend in EDA: leveraging GPU acceleration to tackle computationally intensive workflows. Siemens EDA has been investing in GPU-accelerated tools across their portfolio—from simulation through to manufacturing analysis.

The next steps involve integration into production EDA platforms and optimization for heterogeneous computing environments (mixing CPUs, GPUs and potentially other accelerators).

For advanced-nodes semiconductor manufacturers, the message is clear: computational lithography is entering a new era where GPU acceleration is not optional but essential for competitive design cycles. The tools that can deliver both speed and precision at nanometer scales will define the next generation of semiconductor development.

Further reading

W. Liu and R. K. Cavin III, “Rasterization theory, architectures, and implementations for a class of two-dimensional problems,” Integration, vol. 6, no. 2, pp. 179–199, 1988, https://doi.org/10.1016/0167-9260(88)90038-7

T. Matsunawa, B. Yu, and D. Z. Pan, “Optical proximity correction with hierarchical bayes model,” Journal of Micro/Nanolithography, MEMS, and MOEMS (JM3), vol. 15, no. 2, p. 021009, 2016. https://doi.org/10.1117/1.JMM.15.2.021009

Yixiao Ding, Chris Chu, and Xin Zhou. “An efficient shift invariant rasterization algorithm for all-angle mask patterns in ILT,” in Proceedings of the 52nd Annual Design Automation Conference (DAC ’15). https://doi.org/10.1145/2744769.2744797

Loay Hegazy, Mohamed Taher, and Sherif Hammouda. “A massively parallel GPU rasterizer for next-generation computational lithography,” Siemens EDA technical paper.

Mohamed Taher is a senior technical manager at Siemens EDA and professor of computer engineering at Ain Shams University.

Sherif Hammouda is a senior engineering director at Siemens EDA.



Leave a Reply


(Note: This name will be displayed publicly)