Which Software To Use?

A closer look at software development for ARM’s big.LITTLE Processing, Part II

popularity

By Achim Nohl
big.LITTLE processing refers to the concept of combining a high-performance ARM Cortex-A15 MPCore processor along with an energy-efficient Cortex-A7 processor. There were two primary use models recently introduced by ARM for big.LITTLE processing: task migration and MP. The big.LITTLE task migration use model is where the applications migrate between one cluster and another based on some criteria. The big.LITTLE MP use model, on the other hand, allows both CPUs to run simultaneously. Determining which software should run on the Cortex-A15 and which should run on the Cortex-A7 is likely to be decided at runtime through a power-aware scheduler in the operating system kernel.

The big.LITTLE task migration use model has almost no impact on the current Linux kernel infrastructure and can be applied today. Here, the software can seamlessly migrate from one processor to the other, depending on the use context and performance requirements. This is achieved by a new Task Migration Software Layer running underneath the OS that takes advantage of the CPUs’ hardware extensions for virtualization. The Power Management Working Group inside the Linaro organization is still working on making the required adaptations to the Linux kernel to enable the big.LITTLE MP use model.

Which application should run on which CPU?
So, how easy is it to decide which processor is used when and for which kind of software? Can we make this decision statically, depending on the kind of the application that should run on the processor? For example, is it as simple as running the video player on the Cortex-A15 and the e-mail editor on the Cortex-A7? Looking one level deeper, there are many factors beyond the type of application that have to be considered to make a good decision about which processor executes a given software task. The optimum criteria would be one that fits the user requirements at the actual time a service such as video or e-mail is used. Those requirements can change depending on the context in which the application is used. For best performance, the processor selection for an application should be a runtime decision. So, what constitutes an application?

Applications are complex combinations of multiple services
An application is not a monolithic piece of software. Applications use a multitude of services such as GPS, sensors, radio, graphics and sound. Moreover, the application, in the context of Java-based Android, also has a multitude of OS-level threads at its foundation. Furthermore, multiple applications can be executing at the same time. While a YouTube video is being watched, the e-mail editor may still be open in the background. As a result, the execution of an application such as video is interleaved with other applications such as e-mail and GPS. In order to illustrate the complexity and dynamics, refer to the figure below. It shows the large number of threads that are active in a time window of just a few seconds while using an Android e-mail and Web-browser YouTube application.


Figure 1 – Android services, e-mail and web browser threads, all other threads.

To determine how this diverse set of threads impacts the selection of the CPU in a big.LITTLE processing configuration, let’s assume a video has ended and the player is just waiting for the user to either repeat or close the video. If the video was running on the performance-optimized Cortex-A15, energy is burning from the moment the video stopped. It would be better to detect at the OS level that the Cortex-A15 is not fully loaded anymore, and switch the task to the Cortex-A7. But, as shown in the figure above, this may result in too many switches and a penalty in performance.

Using the OS kernel and workload analysis to drive the CPU selection
Luckily, the Linux kernel already has the right framework in place to make the best CPU selection. The framework is called CPUFREQ and is typically used to do dynamic voltage and frequency scaling (DVFS) based on the workload of the CPUs. Instead of changing frequencies, the framework can be used to trigger the switchover between CPUs. Does this solve all problems? Not really.

Imagine that while the video is running, in between the declines in CPU loading, the application is switched from the Cortex-A15 to the Cortex-A7. This could result in glitches in the video when the processing requirements go up again. Therefore, the cost of switching must be considered, as well.

Because of the dynamic nature of software execution, analysis of all possible task scenarios can’t be done on paper. The only reasonable way to find the best utilization of the CPUs is by comparing energy and performance along a large number of usage scenarios and combinations of the OS scheduler and power manager. Real software runtime scenarios must be profiled on a big.LITTLE processing platform. When big.LITTLE hardware is not an option, other software development alternatives need to be found. Fortunately, you do not have to wait for big.LITTLE hardware to profile how real software behaves on the processing pair.

Need for simulation
ARM’s CPU simulation models for the ARM Cortex-A15 and the ARM Cortex-A7, called Fast Models, have been available since the end of 2011. They enable the software community, such as Linaro, to bring up the Task Migration Software Layer, as well as to provide necessary Linux kernel changes for big.LITTLE MP. Fast Models are also suitable for a “first-order-of-magnitude” performance analysis. Fast Models are not about modeling timing. Instead, they offer a set of meaningful performance counters such as instruction count and L1 and L2 cache events. Those high-level performance counters are more than sufficient to allow designers to make an informed decision on how to tweak software for the optimal energy/performance tradeoff. What we are looking for, from a software point of view, is the optimal utilization of the ARM Cortex-A15 or the ARM Cortex-A7 cluster.

When these Fast Models are integrated into a virtual prototype, the relative performance gap can be determined by looking at the different clock-rates at which the CPUs operate. Through a relative power consumption instrumentation of the highest level power states, such as active, idle or off, the relative energy penalty can be determined. The cache snoop traffic, such as when switching from one cluster to the other, can also be considered. This combination of Fast Models and the profiling and instrumentation of virtual prototyping technology are good ways to model power and performance states for applications. The figure below illustrates the switching activity and the relative energy spent per process for the Cortex-A15 CPU during web browsing and e-mail.


Figure 2 – big.LITTLE processing state and energy profiling.

Profiling with repeatable, deterministic usage scenarios
Hundreds of minutes-long user scenarios (e.g., web browsing, video, SMS, navigation) can be simulated overnight with different settings such as the CPU switching threshold in the OS kernel scheduler. The usefulness of the results is defined by the large-scale investigation rather than accuracy. It is much less beneficial to drive the optimization using 99% accurate results and few-seconds-long scenarios than a more realistic large batch of minutes-long scenarios.

In addition, simulation has the ability to synchronize the stimuli required for the scenarios with the varying progress of the software. Here, the control and visibility of a virtual prototype plays an essential role. For example, the next touch-screen event will only be triggered once the application shows up on the screen. This way deterministic comparisons can be conducted, which puts a challenge on physical hardware. The advantage of the virtual prototype approach is that it does not require a complex test environment or special software hooks to automate the device control from a testbench. Moreover, it allows an apples-to-apples comparison because the execution will always be performed with the same starting conditions. This is especially beneficial because just the slightest change in the initial system state can have a huge impact on the measurement (e.g. butterfly effect).

As the examples above show, simulation can be a powerful tool to drive the optimization of the big.LITTLE task migration use model. Instead of guessing and planning using static spreadsheets, a much more informed decision can be made. The implications of optimizations can be analyzed and understood using real software and real scenarios. Because the dynamics of the software stack are considered, insight into the side effects of decisions can be gained. And, this can all be done before hardware is available so that the software stacks that perform and use the best energy profile can be developed in parallel with the hardware.