How to make IC design more accessible to more people.
Integrated circuit (IC) design is often considered a “black art,” restricted to only those with advanced degrees or years of training in electrical engineering. Given that the semiconductor industry is struggling to expand its workforce, IC design must be rendered more accessible.
The benefit of customized computing
General-purpose computers are widely used, but their performance improvement has slowed significantly—from over 50% annually in the 1990s to only a few percent in recent years—due to challenges in power supply scaling, heat dissipation, space and cost.
Instead, the research community and industry have turned to customized computing for better performance by matching customized architectures to workload in certain application domains. A good example is the Tensor Processing Unit (TPU) announced by Google in 2017 for accelerating deep-learning workloads. Designed in 28nm CMOS technology as an application-specific integrated circuit (ASIC), TPU demonstrated close to a 200x performance/watts power-efficiency advantage over the general-purpose Haswell central processing unit (CPU), a leading server-class CPU at the time of publication. Such customized accelerators, or domain-specific accelerators (DSA), achieve efficiency via customized data types and operations, customized memory accesses, massive parallelism and much-reduced instruction and control overhead.
However, this customization comes at a big cost (approaching $300M at 7nm, according to McKinsey) which the masses cannot afford. Field-programmable gate-arrays (FPGAs) offer an attractive, cost-efficient alternative for DSA implementation. Given its programmable logics, programmable interconnects and customizable building blocks—block random access memory (BRAM) and digital signal processing (DSP)—an FPGA can be customized to implement a DSA without going through a lengthy fabrication process and can be reconfigured for a new DSA in seconds. Moreover, FPGAs have become available in the public clouds, such as Amazon AWS F1 and Nimbix. One can create a DSA on the FPGA in these clouds and utilize it at a rate of $1-$2/hour to accelerate desired applications, even if FPGAs are unavailable in the local computing facility. My research lab developed efficient FPGA-based accelerators for multiple applications, such as data compression, sorting and genomic sequencing, with 10x–100x gain of performance/power efficiency over state-of-the-art CPUs.
The barrier to customized computing
However, creating DSAs in ASICs or FPGAs is considered hardware design, typically using register-transfer level (RTL) hardware description languages such as Verilog or VHDL, with which most software programmers are unfamiliar. According to 2020 U.S. Bureau of Labor Statistics data, there were over 1.8M software developers in the United States but fewer than 70,000 hardware engineers.
Recent progress in high-level synthesis (HLS) shows promise in making circuit design more accessible, as it can automatically compile computation kernels in C, C++ or OpenCL into an RTL description to carry out ASIC or FPGA designs. The quality of the circuits generated by the existing HLS tools depends heavily on the structure of the input C/C++ code and the hardware implementation hints (called “pragmas”) provided by designers. For example, for the simple 7-line code of the one-layer convolutional neural network (CNN) widely used in deep learning, shown in Figure 1, the existing commercial HLS tool generates an FPGA-based accelerator 108x slower than a single-core CPU. However, after proper re-structuring of the input C code (to tile the computation, for example) and inserting 28 pragmas, the final FPGA accelerator is 89x faster than a single-core CPU (over 9,000x speedup over the initial unoptimized HLS solution).
Fig. 1: Simple 7-line code of one-layer convolutional neural network (CNN) widely used in deep learning
These pragmas (hardware design hints) inform the HLS tool where to parallelize and pipeline the computation, how to partition the data arrays to map to on-chip memory blocks, etc. However, most software programmers do not know to perform these hardware-specific optimizations.
Our Solutions
To enable more people to design DSAs based on software-programmer-friendly code, we take a three-pronged approach:
• Architecture-guided optimization
• Automated code transformation and pragma insertion
• Support of high-level domain-specific languages (DSLs)
One good example of architecture-guided optimization is automated generation of systolic arrays (SA), an efficient architecture that uses only local communication between adjacent processing elements. It’s used by TPU and many other deep-learning accelerators, but it’s not easy to design. A 2017 Intel study showed that 4–18 months are required to design high-quality SA, even with HLS tools. Our recent work, AutoSA, provides a fully automated solution. Once a programmer marks a section of C or C++ code to be implemented in the SA architecture, AutoSA can generate an array of processing elements and an associated data communication network, optimizing computation throughput. For the CNN example, AutoSA generates an optimized SA with over 9,600 lines of C code including pragmas, achieving more than 200x speedup over a single-core CPU.
For programs that do not easily fit to common computation patterns (such as SA or stencil computation, for which we have good solutions using architecture-guided optimization), our second approach is to perform automated code transformation and pragma insertion to repeatedly parallelize or pipeline the computation based on bottleneck analysis or guided by graph-based deep learning. Building upon the open-source Merlin Compiler from AMD/Xilinx (originally developed by Falcon Computing Solutions), our tool — named AutoDSE — can eliminate most, if not all, pragmas inserted by expert hardware designers and achieve comparable or even better performance (as demonstrated on Xilinx’s Vitis HLS library for vision acceleration).
The third effort is to further raise the level of design abstraction to support DSLs so that software developers in certain application domains can create DSAs easily. For example, based on the open-source HeteroCL intermediate representation, we can support Halide, a widely-used image processing DSL with the advantageous property of decoupling algorithm specification from performance optimization. For the blur-filter example written in 8 lines of Halide code, our tool can generate 1,455 lines of optimized HLS C code with 439 lines of pragmas, achieving 3.9x speedup over a 28-core CPU.
These efforts combine to achieve a programming environment and compilation flow that is friendly to software programmers, empowering them to create DSAs with efficiency and affordability (especially on FPGAs). This is critical to democratizing customized computing.
Broadening participation
In their 2018 Turing Award lecture, “A New Golden Age for Computer Architecture,” John L. Hennessy and David A. Patterson concluded, “The next decade will see a Cambrian explosion of novel computer architectures, meaning exciting times for computer architects in academia and in industry.” We would like to extend participation in this exciting journey to performance-oriented software programmers enabled to create their own customized architectures and accelerators on FPGAs, or even ASICs, to achieve significant performance and energy-efficiency improvements.
This article is based on Jason Cong’s recent Vision Address at the 35th International Conference on VLSI Designs. The entire talk can be found here.
Leave a Reply