Writing Application Software Directly To The Metal

Applications must work within a power budget to gain performance once freely available at each process node; fewer options will force major shifts.


By Ed Sperling

How necessary is an operating system?

That question would have been considered superfluous a decade ago, possibly even blasphemous and career-limiting. But it now is beginning to surface in low-power discussions, particularly in compute-intensive applications where performance and power are both critical. General-purpose operating systems constantly call on the processor for updates, while software written straight into the metal using Verilog or System C can be written for specific cores.

Highly parallelized applications such as search, particularly in bioinformatics, already are exploring writing applications directly into FPGAs. And heterogeneous cores may give application developers more reason to write to the chip rather than an operating system application programming interface (API).

For application developers, power is as much a balancing act with performance as it is for hardware developers. While classical scaling before 90nm provided both power and performance benefits at each process node, the decision has moved largely to one or the other. For every gain in performance, there has to be a subsequent drop in power somewhere on the chip. Otherwise the clock speed cannot be improved without burning up the chip.

That has prompted software developers to look for different solutions. Even Intel, whose success was built almost entirely on tight integration with operating systems—Windows, Mac OSX and Linux—is looking at utilizing some of the cores in its future chips differently.

“There is broad agreement that we need to be able to represent the ability to do parallelism at the application level and not force everything through the operating system,” said Pat Gelsinger, senior vice president in charge of Intel’s Enterprise Group. “Any time you have a call through the operating system to get a resource—whether it’s a thread or an I/O—your application has gone away for thousands of clock cycles. You want to do that when you need something that only the operating system can give you.”

Typically the operating system acts like a layer of middleware. It makes the connections through its APIs that allow applications like Office to work together so that portions of one application can be dragged and dropped into another. But in highly parallel applications, the interactions are largely within the application rather than with other applications.

“There is an active effort to move some of this parallelism to the application level so the application programmer, given the right tools and libraries, can take advantage of that.” Gelsinger said. “Microsoft has taken steps like that recently with networking and the NPI (network programming interface) layer—moving it into the user space. Use the operating system for what you need it for, but allow parallelism to be more lightweight. Those steps are under way, and they will have great benefit. It started out as the HPC (high-performance computing) community, where they were using tens of thousands of threads.”

IBM is likewise experimenting with a thinner operating system layer for its Power architecture. Brad McCredie, chief architect of the new Power 6 chip and an IBM Fellow, said one of the first examples are hardware accelerators, which are being used to speed up applications.

“We’ve already created an architected layer in the Cell processor,” said McCredie. “It’s not exactly writing software into the metal. We gave the software programmers an architected interface, so we hid some of the messiness of the 100 gigaflop accelerator with a new generalized interface, which is OpenCL. We expect to put in multiple types of accelerators in the future.”

At some point, though, even this approach will run out of steam. McCredie said the debate inside IBM right now is when exactly that point will occur. He believes it will happen at 22nm.

“Eventually we’re going to run out of power on a chip,” he said. “The next way will be to design devices to do fewer and fewer things. That trend will happen. The question is whether we will be able to invent a more specific device that can do 80% of the workloads at less power? If it only does 10%, then no one will write a line of code for it. But if it covers 80%, then it will have much better power/performance.