SMP, Asymmetric Multi- processing And The HSA Foundation

Symmetric multiprocessing has gotten that attention, but future gains will require its lesser-known relative.

popularity

When we hear the term “multiprocessing,” we often associate it with “symmetric multiprocessing (SMP).” This is because of SMP’s initial prevalence in the high-performance computing world, and now in x86/x64 servers and PCs. However, it’s been known for years that SMP’s ability to scale performance as the number of cores increases is poor. (For more information on SMP’s inability to scale well, read Jack Ganssle’s 2008 embedded.com article, “The Nulticore Effect,” or the IEEE Spectrum/Sandia Labs article, “Multicore is Bad News for Supercomputers: Adding cores slows data-intensive applications.”)

kurt1

Processor companies serving the mobility and consumer electronics markets have avoided purely SMP solutions and instead have implemented asymmetric multiprocessing (AMP) architectures. An example of AMP is a mobile phone modem baseband SoC, which contains an ARM processor and a DSP to handle control and signal processing, respectively. We also see AMP architectures in today’s mobile phone application processors, which usually have multiple CPU cores and separate discrete graphics cores, video cores, audio cores and imaging cores.

Battery size and heat drive asymmetric multiprocessing in mobility devices.
The mobility world has always been forced to use “the best core for the job” because of the constraints imposed by battery size and heat dissipation.So architectures in mobility have always been created from a baseline expectation of heterogeneous core AMP.

Screen-Shot-2012-09-26-at-8.26.27-PM

This is in contrast to the server and PC markets, which have relatively unlimited (at least compared to a mobile phone) power-consumption and heat-dissipation capabilities. In these markets, it has always been easier to add more cores of the same type, connect them using cache coherency, and re-use the legacy software to run on top.
Things are starting to change, though, as the SMP approach starts to wear thin. For example, for server farms that power the likes of Google and Facebook, power consumption and heat dissipation have become huge cost and environment issues. And in the PC space, we have run into a “GHz wall” where the only way to have a step function increase in performance is to have different cores optimized for different workload types.

Why hasn’t AMP been implemented in the PC and server markets?
It’s hard.

In mobility designs, each heterogeneous processing core, whether graphics, audio, DSP, etc., usually has a custom firmware and software stack associated it. This software must be integrated to communicate with the CPU cores’ operating system, which necessitates coding work in the OS hardware abstraction layer and drivers.

Furthermore, these heterogeneous cores do not have a single view of system memory, so complicated synchronization schemes are usually implemented in hardware and software. Context switching and preemption are difficult to implement. And most importantly, each of these cores requires an expert programmer to code it, someone conversant in a particular core’s instruction set and tool chains. As a result, asymmetric multiprocessing has thrived in the relatively closed-to-developers/ISVs mobility and consumer electronics worlds while SMP has flourished in the wide-open world of PCs and servers.

The Heterogeneous System Architecture Foundation
The HSA Foundation is a non-profit organization that intends to make it easier for the world to adopt AMP architectures.

Its goals are to:

  • Make heterogeneous programming easy and a first-class pervasive complement to CPU computing
  • Continue to increase the power efficiency of heterogeneous systems (AMP), keeping it the platform of choice from smartphones to the cloud
  • Bring to market strong development solutions (tools, libraries, OS runtimes) to drive innovative advanced content and applications
  • Foster growth of heterogeneous computing talent through HSA developer training and academic programs to drive both learning and innovation

To achieve these goals, HSA will have to innovate by providing a technical framework and architecture to address the following issues:

  • Unified Programming Model – Today, CPU and GPU (or other accelerator) cores are programmed separately, with the GPU treated as a remote processor. HSA will allow developers to target the CPU or GPU by writing in task-parallel languages, like the ones they use today when writing for multicore CPUs.
  • Unified Address Space – HSA supports virtual address translation among the heterogeneous cores with an HSA-specific memory management unit (HMMU). HSA compute engines will use the same page-able virtual address space as used by CPUs today.
  • Queuing – CPUs, GPUs and other cores can queue tasks to each other and to themselves through an HSA runtime. Queuing can be managed in hardware to avoid OS system calls and enable very low latency communication between cores.
  • Preemption and Context Switching – HSA enables job preemption, job scheduling and fault handling capabilities to overcome potential problems created by rogue or faulted processes.

Screen-Shot-2012-09-26-at-8.26.43-PM
How will HSA do this?
HSA’s goals and the issues it has chosen to address are admirable, but are difficult to achieve. In my next article I’ll discuss the means by which the HSA Foundation will simplify heterogeneous asymmetric processing. Specifically, I’ll introduce the HSA solution stack, comprising the HSA Assembler, Runtime, Finalizer, and Kernel Driver, as well as HSA software libraries and intermediate languages.

Sources
Ganssle, Jack. “The Nulticore Effect.” Embedded.com, Dec. 8, 2008.
Moore, SamuelK. “Multicore is Bad News for Supercomputers: Adding cores slows data-intensive applications.” IEEE Spectrum, November 2008.
Kyriazis, George (AMD). “Heterogeneous System Architecture: A Technical Review.” Whitepaper, HSA Foundation, August 2012.
Processor core performance graph is from “Multicore is Bad News for Supercomputers: Adding cores slows data-intensive applications.” IEEE Spectrum, November 2008 and Sandia Labs.
Qualcomm Snapdragon S4 block diagram is from http://www.cnx-software.com/wp-content/uploads/2011/10/qualcomm_snapdragon_s4_block_diagram.jpg.
HSA Solution Stack diagram is from Phil Roger’s presentation at the AMD Fusion 2012 conference titled, “The Programmer’s Guide to a Universe of Probability: The Heterogeneous System Architecture.”

—Kurt Shuler is vice president of marketing at Arteris.