What hoops will designers have to jump through to avoid concurrency bugs?
Concurrency adds complexity for which the industry lacks appropriate tools, and the problem has grown to the point where errors can creep into designs with no easy or consistent way to detect them.
In the past, when chips were essentially a single pipeline, this wasn’t a problem. In fact, the early pioneers of EDA created a suitable language to describe and contain the necessary concurrency that existed within and across clock cycles. This language evolved into Verilog.
But when the performance of a pipeline could no longer be improved through design or device scaling, the only way forward was multiple pipelines. “It all changed around the mid-2000s when we no longer had the frequency scaling and we no longer had the power efficiencies,” says Craig Shirley, president and CEO of Oski Technology. “So while we still had transistor density scaling, we had to pivot from synchronous sequential design to parallelism. That was the beginning of the era of parallelism.”
By that time, the software industry was firmly fixed in a sequential programming paradigm and bandages were created at the system level to prolong the life of those methodologies. Unfortunately, the EDA industry essentially went in the same direction, adding higher-level programming concepts from software languages rather than creating languages that could faithfully represent the new levels of concurrency that were being integrated into a single chip.
Time spent in verification has grown, and it will continue to grow because of this. “In complex chip designs, finding corner-case bugs is a critical part of the functional verification process,” says Tom Anderson, technical marketing consultant for OneSpin Solutions. “Corner-case bugs, in turn, are all about concurrency. They occur when specific combinations of events happen at the same time and trigger behavior not previously exercised.”
The problem is that it is impossible to verify all of the potential corner cases. “We call these simulation-resistant superbugs,” adds Shirley. “These functional bugs are resistant to simulation because extreme corner cases are required to activate and detect them. They are a side-effect of designer innovations in parallelism and concurrency to offset the slowing of frequency scaling in a post-Moore’s Law era.”
Some of them can be tricky to detect. “Performance problems will be seen more frequently in designs of this type,” warns Russell Klein, HLS Platform program director at Mentor, a Siemens Business. “I am seeing this with increasing frequency at customers that I work with. The goal of creating such designs is to improve throughput and performance, but side effects and unintended consequences cause unexpected performance problems.”
Fig 1. Concurrency as defined in Portable Stimulus. Source: Accellera Systems Initiative.
It all comes about because there is no way to explicitly define communications between concurrent threads. When unintended or implicit communications happen, trouble lurks.
Languages and concurrency
Languages play a vital role in how things are specified. “The hardware design languages we have are pretty good at describing concurrency,” says Klein. “I am not seeing much lacking there. On the software side there are a lot of parallel programming languages, but software developers continue to cling to their single-threaded programming models. Anant Agarwal and Tilera established a technically successful approach to parallel implementation of software. Even though it delivered about 3 times more performance for the same power than traditional processors, it never achieved much commercial success. As scaling becomes more challenging, we may see software teams reconsidering these programming models, as a way to continue increasing system performance.”
But in some niche areas concurrency languages have been somewhat successful. “If you use OpenMP or OpenCL to express parallelism at a higher level, or even CUDA—now you have a compiler that makes sure that you don’t have coherency or deadlock issues,” explains Frank Schirrmeister, senior group director, product management and marketing for Cadence. “The same is true with designs that use FPGAs. How do you program something like Xilinx’s Zynq? Whenever they speak, they refer to OpenCL, where you have the ability to express concurrency. Then you have a compiler that makes sure that the dependencies are resolved.”
High-level synthesis (HLS) is the middle ground where instead of defining a suitable language, the industry went in the direction of discovering potential concurrency. “The scope of HLS is limited enough that it works,” adds Schirrmeister. “The focus is the micro-architecture level where I am unrolling loops and figuring out how to get the right level of parallelism to meet the performance requirement. But I am not sharing that parallelism with software running on the processor that access that block.”
“High-level synthesis tools do a pretty good job of finding and introducing concurrency,” says Klein. “In cases where it cannot extract concurrency due to the nature of the algorithm description, it will identify the dependencies for the developer, enabling them to manually modify the code to increase concurrency. However, HLS does not have global visibility. It cannot take into account activity in other modules or in software. To do a good job of partitioning—specifically between hardware and software, and partitioning software across multiple cores—one needs a global view of the flow of data through the system, encompassing both hardware and software.”
And that is where the problems start.
Concurrency at the system level
If concurrency is not specified, it has to be discovered. “Today, it means running the complete system, hardware and software, and monitoring it,” explains Klein. “Of course, once you have the complete system running there usually is not a lot of redesign that is going to happen, unless it really fails to meet its requirements. The industry really needs to come up with a way to monitor and view these interactions and dependencies, across hardware and software, in a way that leads developers to understand their systems better. And do it earlier in the design cycle, when changes can be implemented that will impact the current design.”
Attempts have been made to make communications explicit. “It goes back to the days of Alberto Sangiovanni Vincentelli, who promoted communications-based design,” says Schirrmeister. “Communications was explicit and you would then figure out how to implement the communications in a safe way such that there would be no chance of deadlock. That was also the basis for MCAPI. This API enabled the separation of communications from the processing. The beauty of MCAPI is that you could predictably say that this block needs to talk to that block, and this processing needs to talk to that processing. So you separate communications from processing. You know exactly what the dependencies are, and you can implement using automation in such a way that they are optimized and can never fail.”
Today, hardware and software design essentially are viewed separately. “System-level concurrency during design is not an obstacle since we provide IP that must work in a variety of systems with different timing constraints,” says Peter Greenhalgh, vice president of technology and a fellow at Arm. “Where control over concurrency is visible at the software level, established architectural techniques like locking instructions (e.g., load-exclusive, store exclusive) and atomics are used. Data integrity is also managed and maintained using standard coherency protocols like MOESI across IP such as Coherent Interconnects.”
Attempting to make hardware insensitive to asynchronous events is standard practice. “Ideally, you want to catch dependencies as early as possible in the design process,” says Dave Pursley, product manager for HLS at Cadence. “For example, you want to avoid the costly mistake of designing several large blocks of hardware only to find out that when placed in situ unexpected timing (latency) dependencies introduce a problem such as deadlock, data being dropped, or memory corruption. These situations can be avoided by using latency-insensitive interfaces on all blocks of hardware. Full handshakes ensure the data flowing through the system is processed as expected, and that no data is being dropped. It also explicitly enforces that there are no hidden assumptions about timing synchronization between blocks.”
There is a price to pay for this. “The cost for this is very small or negligible, adding an extra bit or two for handshaking block-to-block interfaces,” adds Pursley. “This design practice is especially helpful when designing IP for reuse. Latency-insensitive interfaces future-proof the IP.”
But there is a much bigger price being paid. Many of the security issues and vulnerabilities that are being exploited today are a direct result of making implicit communications possible. Coherency is a direct result of this. If all communications were explicit, there would be no need for universal coherence. The ability to share memory between threads creates vulnerabilities that then requires even more hardware to attempt to stop it from happening.
Debugging concurrency
Eventually concurrency will create a bug. “These are very difficult to debug, as they often involve software spread across multiple processors, and transactions interacting across multiple bus structures,” says Klein. “There are not a lot of tools that can show the developer what is happening across multiple cores correlated with bus transaction and hardware activity. The visibility needed to resolve these problems is hard to achieve.”
Most of the time it comes down to ad-hoc methods. “We have always relied on designers who were brilliant enough to comprehend the entire operation of the design—in their head,” says a typical designer. “As you move to larger and more complex designs, you are going up this curve where fewer and fewer people on the team are going to be capable of doing that. At some point, you reach a threshold where nobody can comprehend it. How do you find bugs when you have to cross the knowledge boundary?”
Klein says that he sees “developers using a lot of $display statements, parsing of simulation transcripts, and dumping vcd files from waveforms databases. They stare at the data until a developer has that ‘aha’ moment. It is a very ad-hoc approach—but a systematic one does not exist.”
Another way is to add performance limiting safeguards. “If you run into a coherence issue, one way is to flush things and get back to the right state,” offers Schirrmeister. “That will cost you performance. These problems can be very difficult to find. They are beyond everything that would be done for an IP-level test. They test the interactions between the components.”
There are two ways in which most companies attempt to flush out these problems—Portable Stimulus (PSS) and formal. “With PSS you have a constraint solver that goes through all possible permutations,” says Schirrmeister. “This allows you to figure out if there are any issues. Debug happens at that higher level. You express things in a way that looks like UML diagrams and then you look at sequence charts and this shows you what is activated and when. This makes it easier to debug issues.”
This does require a different way of thinking about verification. “You have to put functionality to one side and test system component interactions,” says Dave Kelf, CMO for Breker. “This includes performance and power stress testing, system level coherency, and critical paths. Doing this, the more complex corner case bugs can be uncovered. Randomizing system-level interaction can reveal odd behaviors hard to envisage by engineers focused on the functional specification, that reveal bugs or performance/power bottlenecks.”
Doing this type of verification on more abstract models can also help. “Accurately modeling the latency and dataflow of the system using loosely or approximately timed transaction-level models (TLM) is a common way to do this,” adds Pursley. “These models simulate much faster than RTL. More importantly, these are available much sooner in the development process, often before the RTL design has begun in earnest. Then, if unexpected dependencies are found during TLM simulation, the RTL specifications can be modified to accommodate any additional dependencies that the latency-insensitive interfaces alone cannot avoid.”
The other approach used to find these problems is formal. “Formal verification is especially effective at finding corner-case bugs because it exhaustively analyzes the design,” says OneSpin’s Anderson. “There’s no need to worry about missing a test case that would have found the bug. Further, formal analysis assumes that events and behaviors can happen concurrently unless specified otherwise. Although software is not directly analyzed, many of the assertions and constraints that specify design behavior reflect how the software (and firmware) can use the hardware. This provides a measure of system-level verification with a level of certainty unachievable by any other method. Debug occurs in a familiar environment since any bugs found by formal are reproduced in a simulation test.”
Moving forward
Without action, this problem will get worse, not better. “To really address the concurrency problem at the system level, and we should because the benefits are enormous, there are two things that have to happen in the developer community,” says Klein. “One is that hardware and software cannot remain separate and distinct disciplines. The second is that software developers need to embrace parallel programming models.”
Schirrmeister adds the need for making concurrency explicit. “OpenCL is trying to be at a higher level of abstraction in the same way that MCAPI is separating processing from communications. If these are used, then it becomes possible to automate this. It becomes possible to choose between two implementation options, such as HW and SW. When I know all of the inputs and outputs and who has the ability to talk to it, I can look at all of the options from a performance perspective and avoid bugs.”
When you add power into the equation, it becomes even more important that the problem is solved. “The programming languages are there, but with modest adoption,” concludes Klein. “The benefit to the developers is huge, in terms of performance and power consumption. If you look at a line up of processors, you will find that the highest MIPS per milliwatt is in the smallest cores, not the biggest. If you distribute the software load across multiple smaller processors you will get faster execution at lower power as compared to a single threaded program running on one big processor. Software developers won’t get that boost from scaling anymore—so perhaps they will start to look to these approaches.”
Related Stories
Brian,
It would have been interesting if you used self driving cars (end market) and how the issues listed in article are addressed. I would expect numerous concurrent events in auto software causing many super bugs.
Bill
I gave up on Accellera and IEEE efforts to deal with concurrency and high level design a while ago, particularly with leaving RTL (synchronous FSM) behind, and moving up to data-flow (asynchronous FSM) methodology (where it’s easier to do functional verification). As an alternative I came up with an extended C++ to replace Verilog, and a way to take advantage of highly parallel architecture with legacy code –
http://parallel.cc
The EDA industry is about to discover that neural-network engineering has a lot of similar requirements, but the AI guys are going to building new tools…