Flawed Research?

Putting together the System and Power/Performance Bits postings means that I read through a lot research pages each week, and sometimes you read one that just doesn’t sit right. See if you agree.


Soft errors were first talked about a long time ago and memories, which are the most susceptible to this type of error due to their finer geometries and tighter packing, have long included protection from such errors. But the scare has proven to be larger than the real problem. However, as geometries shrink, we are again hearing about the potential problems.

Researchers at the MIT Computer Science and Artificial Intelligence Lab (CSAIL) have developed a new programming framework that knows when a bit of data can be sacrificed to permit timely and energy efficient performance — while allowing for calculation of accurate results. This sounds similar to the way the brain works in that it is known to be quite imprecise, and yet when it knows an accurate result is required it can provide checks to ensure the best possible result.

The researchers say that we could simply let our computers make more mistakes. If, for instance, a few pixels in each frame of a high-definition video are improperly decoded, viewers probably won’t notice — but relaxing the requirement of perfect decoding could yield gains in speed or energy efficiency.

In anticipation of the dawning age of unreliable chips, Martin Rinard, professor in the MIT EECS Department and principal investigator in the MIT CSAIL together with his research group have developed a new programming framework that enables software developers to specify when errors may be tolerable. The system then calculates the probability that the software will perform as it’s intended.

“If the hardware really is going to stop working, this is a pretty big deal for computer science,” says Rinard. “Rather than making it a problem, we’d like to make it an opportunity. What we have here is a … system that lets you reason about the effect of this potential unreliability on your program.”

The researchers’ system, which they’ve dubbed Rely, begins with a specification of the hardware on which a program is intended to run. That specification includes the expected failure rates of individual low-level instructions, such as the addition, multiplication, or comparison of two values. In its current version, Rely assumes that the hardware also has a failure-free mode of operation — one that might require slower execution or higher power consumption.

A developer who thinks that a particular program instruction can tolerate a little error simply adds a period — a “dot,” in programmers’ parlance — to the appropriate line of code. So the instruction “total = total + new_value” becomes “total = total +. new_value.” Where Rely encounters that telltale dot, it knows to evaluate the program’s execution using the failure rates in the specification. Otherwise, it assumes that the instruction needs to be executed properly.

Compilers typically produce what’s called an “intermediate representation,” a generic low-level program description that can be straightforwardly mapped onto the instruction set specific to any given chip. Rely simply steps through the intermediate representation, folding the probability that each instruction will yield the right answer into an estimation of the overall variability of the program’s output.

My analysis
Now I don’t know if your mind said – hang on a minute – as you read this article. Personally, I think they are confusing two completely different issues. One is accuracy and the other is reliability. Reliability, in this sense, involves an unexpected error in a calculation. Accuracy means that the precision of the result can be reduced. I accept that there are cases where performance or power can be offset with accuracy or resolution of the result, but what they are describing is an uncontrollable error produced by the hardware. You have to know that the error occurred before you can slow down the system.

A second issue, as I see it, is that a compiler produces a stream of instructions, but that is being fed into a pipelined processor. In most cases it will not be possible to slow down the pipeline just because a single instruction has been marked as being less important. I would be interested in hearing your thoughts on this subject. Perhaps a better way would be to power down a slice of the datapath reducing the precision of the calculation if that is what they are trying to do.

What do you think?

Leave a Reply

(Note: This name will be displayed publicly)