Tackling a bug, like catching jewel thieves, is easier when you can narrow down the suspects.
Over Easter weekend in 2015 there was a jewelry heist at the safe deposit building at Hatton Gardens in London. The safe deposit vault was in the basement of a building and is used by jewelers in the area for storing large amounts of diamonds, jewelry, precious metals, and cash. The thieves made off with over $300 million in loot, making it the biggest heist in British history. For a while it looked like it was going to be the perfect crime. The thieves left no trace. No fingerprints, no DNA fragments, nothing identifiable. There were no witnesses, and no video surveillance. Scotland Yard was left with almost no leads. For weeks after the theft, the public was speculating who might have pulled off such a spectacular feat. People envisioned an ‘Ocean’s Eleven’-type crew of highly trained criminal masterminds.
It turns out the heist was perpetrated by a group of senior citizens; elderly criminals who were bored with retirement. And they almost got away with it. It makes fascinating reading. One account of the story can be found here: http://www.vanityfair.com/culture/2016/03/biggest-jewel-heist-in-british-history.
What did them in was the almost ubiquitous video surveillance in London. The UK has over 1.8 million video cameras watching everything (http://www.theguardian.com/uk/2011/mar/02/cctv-cameras-watching-surveillance). In addition to capturing video, many of the video feeds are equipped with license plate readers. This is especially true in London, where the automated surveillance of cars is used to enforce a congestion tax (https://tfl.gov.uk/modes/driving/congestion-charge), requiring drivers to buy a permit to drive in central London during peak traffic hours.
The police used this database of cars to look for those that had been in the area prior to the crime, but were somehow oddly out of place. Most cars in the area belong to folks who visit on a regular basis. They work there, or deliver goods, or live in the area – so their presence can be considered “usual.” Police could eliminate those cars by matching multiple hits of license places before and after the crime. Other folks might be passing through. So police would see only a single occurrence of these cars. But someone planning the heist would case the crime scene. There would be multiple sightings of the car prior to the heist, but none after. Using “Big Data” techniques, the police were able to sift through the huge database of cars seen in the vicinity and identify a small number of potential suspects. They followed each of the suspects and eventually found a group of retirees who had a bit more cash than could be explained by their small pensions and who were bragging in their local pub about pulling off the largest heist ever. The rest is, as they say, history.
Looking for an anomaly is not just a good way to find the bad guys in real life, it’s also good for finding bad things in a design. There was a project at Stanford University called DIDUCE, which instrumented a compiler to save all values taken on by a variable when a program was run. The project was run by Dr. Monica Lam, and a paper on it can be found here: http://suif.stanford.edu/papers/Diduce.pdf.
Essentially, the program looks at the values taken on by all variables over time, finds the unusual values, and brings these values to the attention of the programmer to see if they represent a bug. If a variable usually has values between 500 and 1000, except once when it took on the value 2,384,294,103 — this might be a bug. Even though the program may not fail on any specific test case, it is probably worth looking to see what is going on around this unusual occurrence. The project found a frightening number of latent bugs in a number of open source projects, including bugs in the Java secure socket layer. The paper is definitely worth a read.
And we can apply these same hunting techniques to our system level debug, looking for what is there but shouldn’t be.
I recently was debugging a Linux boot failure on a design running in emulation. The problem was that part way through the boot the kernel would go into a panic and crash miserably. What made this particularly confounding was the fact that any change to the Linux kernel would change the location and symptom of the panic #heisenbug. Interestingly, even a change to the bootstrap loader, U-boot, or the disk image would also move the point of the panic in the boot process. Adding a single printf to the boot strap loader caused the panic, which occurred during the Linux boot, to move by more than 10 seconds. Adding a debugger to the mix was of no help. Single stepping around the point of failure always made the failure move to another spot.
Codelink is a non-intrusive debug technology. It enables a debugger to show you what is going on in the processor, but it does not affect the program execution in any way. As a result, it makes it a lot easier to find problems like this one. Codelink traces the processor as it runs and stores the data in what we call a “replay database.” It uses this database to reconstruct the state of the processor. Embedded in this are all the read and write operations that took place while the program was running. Sure, it’s a lot of data — literally gigabytes for a Linux boot. But that is exactly what is needed for feeding a big data algorithm.
The Codelink environment
Since the Linux image and the corresponding disk image which was failing on my RTL design would correctly boot on a virtual prototype of the design, I was pretty sure that the software was good. In a hardware/software system, by process of elimination, that leaves the hardware as the prime suspect. For the hardware to make the software go bad it has to deliver some kind of bad input to the program. My suspicion was on the bus fabric, which was written by a developer who was working on the AXI protocol for the first time.
Given all the reads and writes of the system, one sanity check is to look to see if all the reads match the previous writes. That is, if the processor writes a value of “x” to a given memory location, and then reads a value of “y” at a later point in time — this would be a bug.
But it’s not that simple. There are a number of cases where a read value will be different from the prior write. For example, hardware devices. A serial port will return a different data value from its data port every time it is read. The status port will return different values as well. So first, I needed to filter out all differences at the IO locations in the memory map. Also, DMA operations can update memory without the processor being involved. It was easy enough to turn off DMA operations. In a later blog post I’ll describe how to bring in the DMA operations.
So from the billions of read and write operations, after filtering out the IO locations I was left with dozens of suspect values. A few dozen suspects I can hunt down.
Looking at the problem memory accesses which were left, an obvious pattern emerged. The problems were in pairs:
And the pairs always had a 0x40xxxxxx and a 0x60xxxxxx. And the data expected and the data read were swapped. It turned out that the data was being returned from the wrong peripheral.
This gave me the “ah-ha” moment. I’ll spare you all the debug steps and fast forward to the solution.
But first, a brief primer on the AXI protocol. The AXI protocol allows a bus master to have multiple outstanding transactions. Those multiple outstanding transactions can be returned by the slave to the master in any order. The transaction ID is used to sort out which returned data should be paired with a specific transaction. But wait, there is a bit of fine print here. The master can have multiple outstanding requests with the same transaction ID #why #dontgetmestarted, and it is the responsibility of the bus fabric to force the return of these transactions to be done in the same order that they were sent out.
The source of the problem was that the processor was sending out two transactions to two different peripherals. The bus fabric needed to notice this, as per the spec, and force the ordering of the return. But the guy who wrote the bus fabric missed this tiny detail (hey, its two sentences in an 80 page document). If multiple requests with the same transaction ID were sent to the same memory device, the memory device serialized the transactions, no problem. If the requests went to different devices, most of the time the first transaction completed first — so everything was OK. But occasionally when the transactions were sent to different slaves, the second transaction would complete first. Thus the data values were associated with the wrong transaction. Surprisingly, this is did not always cause the Linux boot to go off the rails. About a dozen occurred before causing Linux to crater.
Finding this by staring at the waveforms or poking around in the debugger would have taken forever. Post processing the trace data made zeroing in on the problem much faster — so much so that we decided to add this capability as a feature to our Codelink product. We’re calling it the memory consistency check. Great for finding bus fabric problems like this one I described here, but also good at isolating cache coherency problems. Got a bad Heisenbug? You might want to take a look at what Codelink can do for you.