How Software Utilizes Cores

Best practices and key considerations for making software run more efficiently in multicore environments.


By Ann Steffora Mutschler
When writing software, how does the design engineer determine how much power it will draw on a particular targeted platform? While the question seems straightforward, the answer is not.

The industry is just starting to develop the ability to get some data in that space,
according to Cary Chin, director of technical marketing for Synopsys’ low-power solutions group. “And when we can do that, then I think what you’ll find is mobile applications will actually be written differently than the ones you run on a laptop because they’ll be better optimized for power and may do things differently in terms of how you cache data.”

Getting to that point isn’t simple, though. Jason Parker, operating systems architect at ARM, said power-efficient software needs to be part of the design from the start. “Designers need to constantly ask themselves, ‘Is this the most power saving way of solving this problem?’ Trying to retrofit power management and efficiency into an existing design is hard work, and all the silver bullets were used up a long time ago. Multiprocessor designs open up additional techniques and constraints for power management.”

Understanding what happens below the surface is a start. Threads and processes are the software abstractions that represent CPU execution and the visible memory space. A thread represents the execution state of the CPU, e.g. program counter, registers and flags. A process is the constrained process memory space for one or more threads to execute within with the MMU used to provide this, he explained. There can often be more than one thread in a process, and they all share the same data.

In a single-core processor, the CPU is shared between the threads by the OS kernel scheduler, execution is managed by the scheduling of threads, determined by the thread priority and time slicing and switching threads is known as a context switch. In comparison, a multiprocessor (MP) combines multiple high-efficiency CPUs together that can deliver greater aggregate performance for less total power than a single high- performance CPU, and provide more power management options, Parker noted.

MP systems are divided into symmetric and asymmetric systems. “Asymmetric systems can have different OSes running on different cores working together to provide the whole system solution. An example would be a smart phone that has an ARM CortexA8 application processor for the Android user interface, and a different Cortex R4 processor running the real-time telephone stack in the RF modem, and additional cores for graphics, video and low-power audio. The advantage of these systems is the processors and resources for each subsystem can be tailored to deliver the expected performance at minimal power. The disadvantage is the system architecture is often fixed and may not be able to implement a future requirement, e.g. new video format.”

Meanwhile, symmetric systems run a single OS kernel across identical cores with a coherent memory system joining them together, Parker explained. “SMP OSes will run multiple threads simultaneously, aiming to share the workload over the cores within the cluster. Well-structured code and algorithms, that are parallelizable, are able to harness the performance of the multiple cores. Existing code and serial algorithms may not be able to take advantage of multiple cores. Power management systems within SMP OSes will control power consumption by scaling performance on the cores using DVFS, and will turn off unused/underused cores.”

Today’s complex SoCs contain a mixture of SMP and AMP subsystems, with power optimized for their respective tasks. For example, “a multicore Cortex A9 system provides the flexibility for an open-platform OS where the future application requirements are not known, whereas the CPU requirements for an LTE modem are known at design time,” he said.

Attaining optimal core utilization
But just understanding how the system is structured is not enough. To achieve the best utilization of cores by the software certain techniques should be implemented, keeping in mind that core utilization is driven by the subsystem partitioning and the further parallelizability of system code and algorithms. “The OS scheduler can maximize execution efficacy by keeping threads and their data on the same or local CPUs while application software can force this by the use of thread affinity,” Parker said.

Maximizing core utilization will drive maximum performance. However, it may not be the most power efficient solution for every silicon process, particularly those with the power management to optimize thread scheduling when the total required software load is a fraction of total performance. For example in a dual-core system where the total load is 80% on one CPU, key questions to ask are:

1. Does the kernel run one CPU at 100% performance, with the second one turned off?
2. Does the kernel run both CPUs at 50% performance, with lower frequency, voltage and total power?

In addition to subsystem partitioning there are other ways to optimize how software utilizes cores, but it depends on the tasks at hand, Parker said, including consolidation of multiple OSes onto a onto a single CPU or cluster using a hypervisor. Also, many instances of a virtualized OS can be distributed over many cores using virtualization, such as in the case of Web servers. At the other end of the scale, embarrassingly parallel problems can be handed over to a GPU, using Open CL for example in image processing.

“In the middle is where things are interesting,” he said. “How does an existing system scale across many cores? This is a 30-year-old challenging problem for performance, and more recently the power cost. Using threads is a workable solution for existing code and a few cores (less than eight), but they are hard to program. Measurement and analysis, as ever, are the engineering skills required. Without a very good understanding of your system it will be hard to make good use of multi-core hardware.”

When to use multicore
Everything is headed in the direction of multiple cores today, said Synopsys’ Chin, “As the frequencies on processors are continuing to be pushed up, that pushes technology further and further and makes the power problem worse and worse. The idea of trying to increase throughput or increase processing capability by duplicating cores to either dual-core, quad-core, hex-core or many more in some processing units has been the path that most of the processor manufacturers have been on. People have been talking about that for the last 8 or 10 years.”

“As a result, we see lots of processors—Intel Core i5, Core i7 kinds of processors with four and six cores pretty mainstream today and very interesting, although the architecture in mobile electronics hasn’t really gone that route yet. I’d say it’s more the idea of heterogeneous cores where you are using specific cores for more specific tasks. In a mobile application there is even more demand for optimizing the processor capabilities to the specific task at hand,” he noted.

Some applications do better in multicore environments than others, however. “The big difference between the kind of performance improvement you’re going to see with regard to a server farm versus a mobile device is that on a server farm the applications like virtualization, databases, and Google searches are algorithmically well parallelized and can be threaded easily. When you’re in a cloud or server farm environment you also have the benefit of having many, many users which provides another level of parallelization and capability with the overall farm,” Chin said.

In those environments, it makes sense to parallelize and have as many cores as possible because the whole idea of starting up the farm is to raise utilization. “The idea is to have your farm running at close to 100% utilization if you can, 24/7, whether that’s with online finance applications or Christmas ordering seasonally. And you want that to be balanced with usage from other parts of the world.” he continued. “With a mobile application there’s only a certain amount of threading you can do in the OS and in the applications that you want to run. On something like a smart phone the idea isn’t to have it running all the time. In fact, the idea is the opposite. You want it running as little as possible.”