The Limits Of Parallelism

Tools and methodologies have improved, but problems persist.


Parallelism used to be the domain of supercomputers working on weather simulations or plutonium decay. It is now part of the architecture of most SoCs.

But just how efficient, effective and widespread has parallelism really become? There is no simple answer to that question. Even for a dual-core implementation of a processor on a chip, results can vary greatly by software application, operating system, and use case. Tools have improved, and certain functions that can be done in parallel are better defined, but gaps remain in many areas with no simple way to close them.

That said, there is at least a better understanding of what issues remain and how to solve them, even if that isn’t always possible or cost-effective.

“To achieve parallelism it is necessary to represent at some level of granularity what needs to be done concurrently,” explained John Goodacre, director of technology and systems, research at ARM. “To achieve concurrency you need to ensure there are no dependencies between what needs to be done. At each level of granularity, there are aspects that make it harder to maintain and ensure there are no dependencies—and hence scale.”

With dependencies, it’s necessary to synchronize or maintain some exclusive access over them. “The percentage of time a system requires to manage such dependencies limits the level of concurrency, and hence the level of scalability achieved by parallelism. This is well documented in Amdahl’s Law. Both hardware and software are used to express concurrency at different levels of granularity for parallelism, from the transistor to full systems, and from the compiler to entire application frameworks.”

Concurrency and parallelism used to be almost synonymous terms when parallel architectures were first introduced. The initial idea was that complex problems could be split into more manageable smaller ones.

“If you look at how architectures have evolved, parallelism and concurrency have gone on to mean different things,” said Anush Mohandass, vice president of marketing and business development at NetSpeed Systems. “Back in the CPU-centric days, we had a bus, and the problem with a bus was that only two guys could talk at one point. When you talk about parallelism or concurrency, the idea was how can two guys talk to two other guys at the same time. How can we get that throughput up? That was concurrency, that was parallelism, and people used to do that with crossbars. And from there they went to network-on-chip, where multiple agents could talk to multiple other agents.”

Parallelism and concurrency have since taken on different meanings as the scope of the problem comes into better focus. “Because we were still in the CPU-centric world, people used to partition memory into separate things. There was a CPU partition, there was a GPU partition, there was a DSP partition. And you can access memory within that region, but anytime you want concurrency or parallelism across different memory segments you run into problems. That’s the fundamental reason for cache coherency, where you want all these different compute engines — vision compute engines, GPUs, CPUs — having the same view of system memory and how to access that. That is cache coherency. If you define it along those lines, it’s not solved. It’s being solved as we speak,” Mohandass said.

Randy Allen, director of advanced research at Mentor Graphics, agreed. “It’s actually not that hard to make things run in parallel. Looking to figure out if things can be done at the same time is not a simple problem, but it’s not an earth-shaking type of problem, either. What you find is that in the parallel world, users go out, they take an application, they parallelize it, it looks pretty simple. So they say, ‘I’m going to run it, it’s going to run 10X faster or 20X faster.’ But when they put it down it runs 3X slower. The reason it runs 3X slower is things can only run as fast as they have operands or data inputs coming in. What happens when you parallelize it, you not only have to get things running in parallel, but you have get the data to the right place at the right time. That’s where the cache coherency comes in. Nine times out of ten you end up flogging, waiting for the results to come from some other way or go through memory or because the cache get restored, and that’s when you slow down.”

Adds Pulin Desai, product marketing director, IP Group at Cadence: “One way to define parallelism is to say, ‘I’m designing this SoC architecture, and I’m looking at parallelism where there is a heterogeneous environment of CPUs in an SoC, with a number of ARM processors, a GPU, along with a DSP. You have three processor units, they are working in parallel and trying to do different things. Then you start looking underneath that, and if there are something like eight ARM cores sitting there, from our DSP standpoint, I could be putting in multiple DSP cores. That’s the homogeneous environment within the heterogeneous environment. Inside it, within the DSP itself, we are achieving a lot of parallelism. That’s the whole concept of SIMD (single instruction multiple data), where we do 64-way, 8-bit SIMD plus very large instruction. So in a single cycle we could be executing 320 operations in single cycle, or we could be doing an ALU operation and a load store operation. That’s a parallelism, and not only single-operation, but 512 of those or 256 ALU operations plus another 64 load-store operations that I could be doing in parallel.”

Is parallelism a solved problem?
While it is well understood, solving parallelism issues is harder than it looks.

“Heterogeneity adds another level of complexity, and heterogenous SoCs can’t necessarily parallelize the same application,” said , CEO of Imperas Software, pointing out there are two types of parallelism when it comes to SoCs. One of them is the symmetric multiprocessing problem, which is pretty much solved. Asymmetric multiprocessing in hardware is not. And when it comes to software, that’s adds a whole different set of issues that can render any progress on the hardware issues moot.

SoCs today are comprised of lots of parallel but separate processors. “They’re not like application processors,” Davidmann said. “They’re small, controlling processors, and they may be just doing control or they may be acting as accelerators doing a specific function of algorithms of hardware, whether it’s a vision compression or encryption or an FFT (Fast Fourier Transform), maybe at a DSP level or maybe further up.”

This is what was put into silicon over the past 15 years, along with a growing emphasis on software because it could be changed after the hardware was committed to silicon. That has created a market for specific-function off-the-shelf processors, such as Synopsys’ ARC processor and Cadence’s DSPs. These devices focus on specific functions such as acceleration or power control.

“On an SoC you’ve got your main core, which does the application, but then you’ve got all these other subsystems,” Davidmann said. “If you look at them today, from a phone to an IoT type of device, what have you got on there? You’ve got a WiFi chip. That’s a processor. You’ve got a Bluetooth chip. That’s a processor. You’ve got a graphics engine. That’s a processor. You’ve got power control. That’s a processor. You’ve got some pattern analysis in an ADAS system. That’s a processor. They might be DSPs, or they might be ARM, MIPS, ARC or Tensilica chips, but what you have on your chips today are many processors—tens of processors, if not more—apart from your main application processor.”

Not all software can be parallelized
And this is one of the main gaps today. Hardware can be duplicated to create an array of identical processors. Developing software that can parse an operation across those processors, and then combine the results without huge overhead in terms of power or performance, is much more difficult. Not all problems can be broken down into discrete parts, and even when they are, not all of them are equal size or complexity.

“When multi-threading was first coming out, just putting aside that you have multiple parallel engines, even if you have one type of CPU, programming that thing or even multiples of that is really, really tough,” said Kurt Shuler, vice president of marketing at Arteris. “There are C libraries that people have used. POSIX is probably the most popular. There are custom libraries and custom tools and custom compilers. For instance, Intel has custom compilers for its x86 cores. Instead of engineers having to hand code things and say, ‘This part of the program can run in parallel with these other things, so I’m going to manually place it on a different processing core,’ it will try to do some of that automatically. And it will also use hardware acceleration code.”

To this point, another way parallel processor providers are dealing with writing software is to provide, as in the case of Cadence’s vision processors, auto-vectorization compilers. Those are in addition to libraries for certain functions. They have been optimized to help ensure the parallel operations are achieved and to keep the machine busy in every cycle, as well as to achieve the maximum parallel operations expected, Desai said.

But not all software is the same. For an operating system such as Linux, parallelism is fairly straightforward. There are often lots of jobs to do, and each job is relatively self-contained and linear. As a result, those jobs can be spread out across different processors or processor cores. But there are many other types of software in a stack—embedded RTOSes and drivers, middleware, and applications—that are much more difficult to run in parallel.

“As long as you can have software that’s in small chunks, in parallel, then there’s not a problem,” Davidmann said. “But if you have programs that need to be parallelized, if you’ve got algorithms that you need to break up, that’s when you run into trouble. POSIX threads are the way that all these Linux machines, Unix machines, Windows machines and everything works. They communicate between things. You have to lock things, and if you try and break a program up into components and run them in threads, you run the risk of them not really being in parallel and not communicating properly. The problem is that the programming languages aren’t there to give us this parallelism. It’s very hard to write good parallel programs, and the problem we have is, academically, the languages we’ve got are not good enough.”

Complicating matters is the fact that the majority of programs in the embedded world are written in C. In computers and IoT devices, much of the programming is done in Java, which is an object-oriented programming language. To make use of parallel hardware properly, better programming paradigms are needed.

Human limitations
Shuler contends that the languages and the libraries need to be changed to make it easier for software developers to create parallel programs. “The problem is human beings can’t think in parallel. Everything is a sequential flow. It’s really hard for them to do that, and if you do that a lot within the programming, it gets inordinately complex. It becomes chaos there, and a human being just can’t comprehend it. No matter what you do to make it easier to explicitly say these things and these things are parallelizable, and these things aren’t—the coder has to know that ahead of time and put that in the code. That’s really tough to do. The next step is trying to do some automation and some tooling to be able to look at the programmer’s intent and automatically infer that these things are parallelizable, and these things aren’t. So when I get through the compilation toolchain I’ll let the compiler know that this stuff can be run on a different core.”

And if it is hard for a person to figure that out, how much more of a challenge is it for a human to create the automation to do that?

“The compiler guys have been dealing with this, but compilers produce machine code for a particular instruction set architecture,” Shuler said. “So if you had an x86 processor, and an ARM processor in the same system, you’d still have two different compilers that you have to use to create software to run on those devices. You couldn’t have one software toolchain. You have different software toolchains. If you have an NVIDIA processor, they have their own toolchain. There’s a compiler in there, which takes software that you write, whether it is C or Fortran or whatever, and puts it into byte code/machine code that the NVIDIA GPU can understand. So at the compiler level, they can try to add in functionality to automatically parallelize/vectorize things for multiple cores running the same instruction set. That’s the software problem, and a single type of hardware problem. When you expand the different types of instruction set architectures — a DSP has a different compiler and software toolchain than an x86, or an ARM, or a graphics processor — then it gets really crazy.”

Every decade or so for the past half century there has been a big push to add more parallelism into more computer architectures. While this has worked for some applications, such as databases and video/image rendering, as well as for some operating system functions and virtualization software, the problem remains unsolved for many implementations of software.

Some industry experts believe change is coming and point to progress that has been made on a number of fronts. But whether this turns into a big change, or whether it creeps along with incremental improvements, needs to be judged over time. Despite all of the time and effort put into this problem, it’s still one that is not completely solved. And as more compute elements are added into devices, an effective solution may be even harder to develop.



Related Stories
Heterogeneous Multi-Core Headaches
Using different processors in a system makes sense for power and performance, but it’s making cache coherency much more difficult.
How Many Cores? (Part 2)
Part 2: Fan-outs and 2.5D will change how cores perform and how they are used; hybrid architectures evolve.
How Many Cores? (Part 1)
Part 1: Design teams are rethinking the right number and kinds of cores, how big they need to be, and how they’re organized.

Leave a Reply

(Note: This name will be displayed publicly)