Compiler Optimization Made Easy

Finding an optimal compiler configuration for a given workload using AI.

popularity

In a previous blog post, we discussed the benefits of using automation to maximize the performance of a system. One use case I mentioned was compiler flag mining, and the fact that performance is available beyond the standard optimization flags provided by most compilers. Getting to this untapped performance is a difficult problem to solve, but fortunately there is an easy way.

A universe of options to explore

A compiler is a program that translates a set of instructions described in a high-level language that humans can process into machine-readable instructions. The resulting output of the compiler is a binary file that can be executed by a machine. In most cases it is preferrable for the resulting executable to run as fast as possible, which is why considerable effort is put into improving compilers. This is particularly true in the field of High Performance Computing (HPC), where the compiler could sometimes have more impact on a program’s performance than the processor architecture it runs on. As a result, compilers offer a plethora of options that can be invoked when compiling, allowing engineers to adapt the capabilities of the compiler depending on the type of application, hardware platform it runs on, and, last but not least, the target workloads.

The challenge, however, is that the number of options offered by compilers typically goes into the hundreds. With this many options to choose from, finding the right combination of compiler flags that will produce the most performant binary for our program is quite a daunting task. Consider this: with just 100 parameters – many of which can take a whole range of values – the number of possible permutations for our compiler flags is over 10157. In comparison, the number of atoms in the known universe is estimated to be “only” 1082. Trying to find a needle in a haystack would be so much easier!

Identifying the right combination of compiler flags that will produce the best performing binary is a task almost impossible to complete. I say “almost” because there is actually a way to run this analysis efficiently, through the use of automation and AI.

In this blog, I will describe a compiler flag mining experiment using the Synopsys SLM Optimizer Studio software and show what it takes to find an optimal compiler configuration for a given workload using an automated AI powered tool.

The experiment

For this challenge, we decided to optimize the configuration of the widely popular LLVM compiler. Our goal was to find the right combination of compiler flags that would minimize a target binary’s run time. This means we needed a target application to compile, and a workload to run tests against. For that purpose, we chose the XSbench benchmark, an industry recognized benchmark. XSbench has the advantage of taking only a few minutes to run for each cycle – helping us gather sample points quickly for our test. In addition, the average deviation between runs is fairly low (0.3% on average), which means we can trust that the results we collect as part of our experiment are representative and can be reproduced reliably. This benchmark provides an application that performs a series of lookups during testing, and its performance is therefore measured by the average number of lookups per second. This is the metric that we will aim at improving, by optimizing the LLVM compiler settings.

Before we can unleash Optimizer Studio, we need to hookup the tool with our benchmark. Fortunately, this is a task that is quite straightforward as there are only three items to define: the list of compiler flags that the tool will be allowed to modify throughout the experiment, the command line to call the benchmark, and finally how to get the lookups per second metric from each benchmark run. Once this first step is completed, we can now run Optimizer Studio, grab a cup of coffee, and watch the data come up in the graphical user interface, in real-time.

The results

Since we chose a benchmark that requires only a few minutes per run, we are able to see significant results and performance improvements within only a few hours. However, the resulting dot plot shows more than just performance gains (figure 1).

Fig. 1: Snapshot of Optimizer Studio’s interactive user interface.

Each dot on this graphic represents a configuration (in other words a specific combination of LLVM compiler flags) that was tested by Optimizer. The vertical axis shows the number of lookups per second, which is our target metric, while the horizontal axis indicates the timestamp, allowing us to observe performance gains over time as the experiment progresses. Our starting point is highlighted in green; this is our baseline configuration. In this case we chose our baseline to be a compiler call with no flags defined at all, however we could have selected any configuration as a starting point. The tool also highlights in red the most optimal configurations and the Pareto Frontier, allowing us to quickly identify the most interesting configurations. Let’s take a look at the results in more detail.

The first thing we notice is that after running only a couple dozen configurations, we see a huge jump in performance, and then, again, another one after testing 10 additional configurations. This rapid convergence towards improvements is typical when running experiments with Optimizer Studio. This is due to the nature of the underlying algorithms used – derived from genetic evolutionary algorithms – which are particularly well suited for this type of optimization problem within a huge solution space. Indeed, in this experiment we are manipulating 390 LLVM compiler flags simultaneously – an optimization task impossible to perform manually!

We also notice a couple of “steps” in the graph. For each configuration we have access to the list of flags that were set, and upon further inspection we notice that switching the compiler’s built-in optimization flag (“O”) from -Og to -Ofast dramatically improved performance (figure 2).

Fig. 2: Zoomed-in view showing the first jump in performance.

Optimizer Studio understood that quickly and focused on optimizing configurations with the -Ofast flag enabled. Keep in mind that the O flag isn’t the only dial we are playing with, as each configuration chooses from over one hundred flags!

If we were to perform this work manually, we would most likely assume that after such a large gain in performance, it would be as good as it gets and further optimization efforts would be futile. We would be wrong, as Optimizer is able to produce a second large jump in performance as it performs its automated analysis (figure 3).

Fig. 3: Zoomed-in view showing the second jump in performance.

Final thoughts

Ultimately, Optimizer Studio determined that a specific configuration with 181 individual flags set to specific values would provide the very best performance. Finding this optimal configuration manually within a reasonable timeframe would be simply impossible, even for a compiler expert. What this experiment shows is that AI enabled smart automation can be successfully harnessed to solve this type of a very complex problem, where a large number of variables create a mind-boggling solution space.

As a follow up to our experiment, it would be interesting to see if there is a way to achieve a similar level of performance by setting fewer flags. Could Optimizer Studio’s automated approach help us refine our optimization? Could it help us better understand the impact of specific flags on performance? The answer is yes and yes, and we’ll investigate further in a future blog post.

In the meantime, you can find more information about Synopsys SLM Optimizer Studio.



Leave a Reply


(Note: This name will be displayed publicly)