Speeding Up AI With Vector Instructions

Uses, challenges and tradeoffs in working with vector engines.

popularity

A search is underway across the industry to find the best way to speed up machine learning applications, and optimizing hardware for vector instructions is gaining traction as a key element in that effort.

Vector instructions are a class of instructions that enable parallel processing of data sets. An entire array of integers or floating point numbers is processed in a single operation, eliminating the loop control mechanism typically found in processing arrays. That, in turn, improves both performance and power efficiency.

This concept works particularly well with sparse matrix operations used for those data sets, which can achieve a substantial performance boost by being vectorized, said Shubhodeep Roy Choudhury, CEO at Valtrix Systems.

This is harder than it might appear, however. There are design issues involving moving data in and out of memories and processors, and there are verification challenges due to increase complexity and the overall size of the data sets. Nevertheless, demand for these kinds of performance/power improvements is spiking as the amount of data increases, and vector instructions are an important piece of the puzzle.

“While Nvidia did a great job in getting these applications to run on their GPUs, they’re very expensive, very power-hungry, and not really targeted at it,” said Simon Davidmann, CEO of Imperas. “Now, engineering teams are building dedicated hardware, which will run these AI frameworks quickly. Developers are looking at vectors to run machine learning algorithms fast.”

Vector instructions or extensions are not new. In fact, they are a critical part of modern CPU architectures, and are used in workloads from image processing to scientific simulation. Intel, Arm, ARC, MIPS, Tensilica, and others have paved the way for newcomers like the RISC-V ISA. What’s changing is the increasing specialization and optimization of both.

Arm first gained support for fixed-width SIMD style vectors in Armv6, which rapidly evolved into Neon in Armv7-A, according to Martin Weidmann, director of product management in Arm’s Architecture and Technology Group. More recently, Arm introduced the Scalable Vector Extensions (SVE), with support for variable vector lengths and predicates. SVE already has seen adoption, including in the world’s fastest supercomputer.

CPU architectures, such as the Arm architecture, are essentially a contract between hardware and software. “The architecture describes what behaviors the hardware must provide, and the software can rely on,” Weidmann explained. “Crucial to this is consistent behavior. Developers need to have confidence that any device implementing that architecture will give the same behaviors.”

For example, developers must be sure that for any code running on an Arm-based design, they will see the behavior described in the Arm Architecture Reference Manual. To this point, Arm created compliance testing resources.

Writing comprehensive compliance suites for something as versatile as a modern CPU is always a challenge, Weidmann noted. Vector extensions only add to that challenge, particularly in the following areas:

  • Testing of vector length agnostic (VLA) for SVE and SVE2;
  • Dealing with the dependencies introduced by concurrent processing of multiple elements, including exception handling and floating-point correctness, and
  • Complexities of scatter-gather load and store operations.


Fig. 1: Example of how longer vectors can be processed in parallel. Source: Arm

Graham Wilson, product marketing manager for ARC processors at Synopsys, pointed out that engineering teams are employing a combination of processing capabilities, pairing code that traditionally would run on a controller core with code that would run on a DSP. All of that is now merging into computation done on a unified processor.

“We’re seeing more traditional vector DSPs that have taken on more the role of scalar, or the ability to have more control code, because a lot of this is driven by devices that are running on the IoT edge, and they want smaller, lower-power computation. There’s also a broader range of code — from control code, DSP code, to vector code — and now there’s AI algorithm computation that’s being driven by small size and low power kinds of needs to run on a single processor. We see a trend from the more traditional vector DSPs, which have better control of computation and operation, along with a trend from the normal controller processes like Arm cores and others, to compute more vector code. Vector extensions are the path for basic controller processes that allow it to operate and run vector operations, along with vector code on a single controller core.”

SoC designers get these benefits for free, as it is contained inside the CPU. “It typically doesn’t directly touch the outside of it, so from a hardware designer’s perspective, if they’re including a CPU that happens to have vector extensions or not, it’s essentially the same to them,” noted Russell Klein, HLS platform program director at Mentor, a Siemens Business.

It’s the software designer who is going to need to take advantage of this and has to worry about how to program those vector extensions, Klein said. “That’s always been a bit of a challenge. In the traditional C programming language — and C++ that folks are using that to write programs that run on these CPUs — there isn’t a direct mapping from some particular C construct into the use of the vector extensions. Typically, there are a number of different ways that you can end up accessing these features. Rather than writing and C code, the most basic one is to write in assembly language, and then you can call the vector instructions directly. Most people don’t like to do that because it’s a lot of work and it doesn’t merge well with their C code that they’re working with everywhere else.”

To address this, processor companies such as Arm and Intel have written libraries to take advantage of these vector instructions, and provide a library for doing a fast Fourier transform, or a matrix multiply operation. “They’ve gone ahead and coded everything in assembly language to take advantage of those vector processing operations in the way that the the CPU designers intended,” he explained. “Then the user writing a program just calls the specialized FFT or matrix multiply, and it uses that. It’s an easy way for Intel and Arm to propagate that, and I would expect the RISC-V community to do the same thing. The Holy Grail is to have your C compiler be smart enough to look at your loops and understand that this could be vectorized.”

This is a hard problem that hasn’t been solved in the past, although work is underway by the team building LLVM, who claim they are able to recognize vectorizable loops and call vector instructions, Klein said.

Integration concerns
Another significant consideration is how the vector unit should be integrated into a core, whether it’s tightly coupled or it’s an independent unit.

“If you look at the team already working on that, nearly all of them decided to go with the separate vector unit that’s connected to the main pipeline, like in the execution stage,” said Zdenek Prikryl, CTO of Codasip. “And then, basically, you can run the operations in the vector unit separately. You don’t have to stall the main pipeline unless there is a dependency. This is something that we are targeting — to have an independent engine that communicates or is tightly coupled with the main core, but not inside of the pipeline of the main core. At the beginning, there should be some kind of cue for instructions like multiply-accumulates, or maybe for flows and integers for load and storing, and so on. And in the end, we have some kind of common stage in which you can then put the data tied to register files. Also bear in mind the memory subsystem, because if you have a vector engine, there are tons of data you have to process. So the memory subsystem is a key point as well — maybe even a bigger concern because you have to be able to feed the engine and, at the same time, pick up the data.”

High throughput is essential for the vector engine, so wide interfaces like 512 bits, and tightly coupled memories (TCMs) that can provide data in a fast manner are optimal. “These are the main questions that we have to ask at the beginning of the design,” said Prikryl. “That’s needed to create an architecture in a way that is not blocked by the memory subsystem, and is not blocked by the pipeline of the main scalar part, so the vector can produce outputs, can work with the memory, and ask the main core only when it’s necessary to communicate.”

The RISC-V vector engine allows for the selection of register width. “If you are targeting a smaller system, it can be smaller,” Prikryl said. “If you are targeting big server beast, then you have really wide registers and you have to basically tie it somehow the memory bandwidth with these registers. Then you are constrained by this, so how wide are you targeting the throughput? Usually you have to live with the standards that are out there, like the Amba standard. And then there are some limitations, as well, like 1,024 bits at the most. But at the same time, if you’re targeting such a wide interface, you usually suffer from the latency or frequency because it’s quite wide. So there is some kind of compromise. We would like to provide the fast data from the TCMs to be able to fit the data in reasonably fast. At the same time, we have to think about the programming model in the case of memory subsystem. I’d also like the possibility to load the data through the standard cache because the programming model is easier. If you write the C code, then eventually you can store the vectors, not only to the vector in memory, but also to the memory of the cache. And then, with the scalar part you can touch the vector and change things here and there.”

Yet another consideration is that there must be a way to feed the engine. The engine should be able to communicate with the scalar portion for the standard cache memory subsystem, and it has to be design this way. “We have to balance the programming model the way users are able to program on the machine,” he said. “It should be as easy as possible, which means we should give them instructions on how to do vectorization. We should give them the stack manipulation, and these kinds of things. These are usually done through the combination of the main memory and TCM. It’s not just a TCM for which you must preload the data somehow. These two worlds can be combined so it’s easy to program, and then I’m able to feed it through the TCM and I can still can provide the data there. But if I need to have something that’s not a critical part of the engine, it can work on the cache. It doesn’t have to go to the TCM. In this way, the memory subsystem can be tricky.”

Mentor’s Klein noted that one area of concern is the memory subsystem tied to the registers. “You need to be able to get data into these registers for performing the operations, and then you need to get the results back out,” he said. “For example, on an Arm core you can have a register up to 2,048 bits wide. If the bus width out to memory is 128 bits wide, what’s very quickly going to happen is that the vector processing unit is going to be starved of data because you won’t be able to pull it in from main memory fast enough. Then you also want to look at the path from the caches into the CPU. That can be wider than the path out to main memory, because fundamentally it’s not very difficult to build a vector processing unit that would consume more bandwidth than you have available in and out of main memory. If that’s the case, the engine has been over-engineered and you can’t get enough data to it fast enough or drain the results away fast enough to really take advantage of that acceleration that is available there.”

Additionally, when going from a controller with a unified memory subsystem to one with vector operations, the data needs to be aligned and packaged up. That kind of vector work collects all of this data, and then runs it on a single SIMD (single instruction, multiple data) operation. As such, the space within the memory needs to be pre-packed and pre-allocated.

“You also need to be able to bring that data in,” said Synopsys’ Wilson. “Sometimes this data, if you go to a long vector length, is quite long, as well, and it’s usually much longer than the general-purpose system memory that you may have. So you will need to either expand, or some of the traditional DSPs may use a dedicated memory load-store architecture to connect to this vector data memory. That allows you to efficiently bring this in, compute, and then send it back out.”

Verifying vectors
Verification of vector instruction extensions and vector engines generally is not too different from the scalar instructions.

“You need to verify these processors like any other,” said Darko Tomusilovic, verification manager at Vtool. “You need to understand what each instruction does, how to model it in your environment, how to preload either a random set of instructions to stimulate it, or to write proper software, which will be compiled into code you run. Apart from that, it’s a classic process like any other verification of a processor. It is, of course, more complex to model such instructions, but in terms of workflow, it is exactly the same.”

Roy Choudhury agreed. “The approach to verification of vector instructions is not too different from the scalar instructions. The process has to start with a comprehensive test suite, which can sweep through the configuration settings for all the instructions and compare the test results with golden reference. Once the configurations are path cleared, the focus should move on to constrained random and interoperability testing. Use cases of vectorization also need to be covered to ensure that the workloads and applications run smoothly.”

At the same time, there are a few verification considerations when it comes to vectors, Roy Choudhury said. “Since vector instructions operate with large amounts of data, the overall processor state under observation for any test is very large. Some vector implementations, like RISC-V, are designed to be very flexible, allowing users to configure the element size, starting element, size of the vector register group, etc. So the number of configurations each instruction has to be checked against is massive. These factors add up to verification complexity.”

In other words, verification needs to shift left. “The software techniques of continuous integration and test are becoming adopted by the hardware SoC and processor design teams,” said Imperas’ Davidmann. “The often quoted estimated that 60% to 80% of the cost and time of a design project is for verification is an over-simplification. Design is verification. As a design team develop a functional specification, the test plan is central to all discussions. Verification is no longer just a milestone at the end of the design phase. With the open standard ISA of RISC-V, the specification allows for many options and configurations in addition to custom extensions. As designers select and define the required features a detailed test plan is required to be co-developed at each state. Co-design is really hardware, software and verification as an ongoing simultaneous process.”

Conclusion
In addition to all of the design and verification challenges, one of the keys to designing vector instructions is to understand the end application being targeted.

“Ultimately, you’re putting in a vector engine because you’re doing some sort of signal processing or some sort of image processing, or more often today, inferencing,” said Klein. “The reason we’re hearing a lot about this is due to the proliferation of machine learning algorithms. There, multiply-accumulate operations are being done on large arrays. It’s really repetitive, there’s lots of data, and it fits well for this vector math. Let’s say you’re using an inferencing algorithm to look at the expected size of your feature maps of your convolution kernels, and how those fit into the vector unit that you’re contemplating building. If you’ve got convolution kernels that are 9 elements of 8 bits, putting in a vector processing unit that would be 1,024 bits isn’t going to help because you’ve only got those 72 bits of kernel data that you’re that you’re bringing into the mix. In this way, understanding that end application and taking into account the data patterns and computational patterns that you’re going to need to support is the way to get to the right mix of accelerator and I/O bandwidth and total design that’s going to efficiently meet the application that you’re looking for.”

Related
RISC-V Gaining Traction
Part 1: Extensible instruction-set architecture is drawing attention from across the industry and supply chain.
RISC-V: Will There Be Other Open-Source Cores?
Part 3: The current state of open-source tools, and what the RISC-V landscape will look like by 2025.
RISC-V Markets, Security And Growth Prospects
Experts at the Table: Why RISC-V has garnered so much attention, what still needs to be done, and where it will likely find its greatest success.
RISC-V Challenges And Opportunities
Who makes money with an open-source ISA, the current state of the RISC-V ecosystem, and what differentiates one vendor from the next.
RISC-V’s Expanding Footprint
Market opportunities and technical challenges of working with the open-source ISA
Open-Source Hardware Momentum Builds
RISC-V drives new attention to this market, but the cost/benefit equation is different for open-source hardware than software.
Open-Source Verification
Sorting out what is meant by open-source verification is not easy, but it leaves the door open to new approaches.



Leave a Reply


(Note: This name will be displayed publicly)