How To Handle Concurrency

System complexity is skyrocketing, but tool support to handle concurrency and synchronization of heterogeneous systems remains limited.


The evolution of processing architectures has solved many problems within a chip, but for each problem solved another one was created. Concurrency is one of those issues, and it has been getting much more attention lately.

While concurrency is hardly a new problem, the complexity of today’s systems is making it increasingly difficult to properly design, implement and verify the software and hardware that collectively comprise system functionality. The EDA industry attempted to address these issues in the past, but failed. So where are we today? And who could benefit the most from the adoption of the limited tools that are available?

Systems have increasing diversity. “We expect to see a much greater use of heterogeneous structures in systems and processor architectures,” says Michael Thompson, senior manager of product marketing for the DesignWare ARC processors at Synopsys. “We have hit the wall on the maximum speeds that we can clock the processors at around 3 to 4GHz for a laptop or server, but much less for embedded systems because of the power consumption. You can move to dual core or quad core, but eventually you run into problems with that, as well, because the structure gets so large that maintaining coherence and the functional cohesion of the cluster becomes difficult. So people are likely to go to more heterogeneous structures where they will use programmable elements as well as hardware elements to meet the requirements of the task and the power that is available.”

So how do we approach this at the system level? “As humans we have certain ways to approach a problem,” says Anush Mohandass, vice president of marketing and business development at NetSpeed Systems. “One is divide and conquer. We break everything down into small chunks to digest them and then assemble them. That makes sense in homogeneous architectures where you just had the CPU interacting with itself or a few other blocks. But in heterogeneous architectures, the level of interaction is high. If you build subsystems in isolation and then put them together, you will see problems.”

The whole industry has been migrating to heterogeneous architectures over the past couple years because they are more efficient. They use less power, and there is less of an emphasis on putting blocks to sleep and waking them up.

Fig. 1: Intel’s heterogeneous Kaby Lake-G architecture. Source: Intel

“Homogeneous processing is not the answer,” says Kurt Shuler, vice president of marketing at ArterisIP. “That only does one thing well. You can have a chip with six or more different types of software acceleration and slice up the processing. But the key is what you do in hardware versus software.”

There are several levels to the problem. “We need to separate the true system architect from the SoC architect,” says Drew Wingard, CTO at Sonics. “The system architect is responsible for the whole thing including the software and has a wider pallet of choices than the chip architect. We see different choices being made from the chip people in system companies versus the chip people within semiconductor companies.”

This is because the semiconductor company has to build something that will service multiple customers and multiple system architects. “There is extra work to make the device more general purpose,” adds Wingard. “It may require a reduction in abstraction, a reduction in the pallet of choices that the chip architect has available to them so that they can look conventional to a larger group of people. The system architect in a systems company can target a very specific product or service. They can make more tradeoffs and they do not need to make it as general purpose.”

Each user is looking for different solutions. “In the beginning people needed simulation,” says , CEO of Imperas. “Then they needed support for heterogeneous systems, then they needed debug environments, and then they needed verification tools to help them improve the quality and get more confidence in its correctness. Today we are seeing them want tools that help confirm that a system is sort of secure.”

At the chip level, specific problem can be defined. “At an abstract level most of the problems are scheduling and arbitration problems,” says Ashish Darbari, director of product management for OneSpin Solutions. “I would go a step further and say that most are specific problems related to maintaining ordering in systems designed to work out-of-order. Ensuring that loads and stores remain ordered with respect to each master when accessing a shared memory to keep individual caches coherent is a different verification challenge. If not done correctly, it can cause memory sub-systems to lock up.”

Frank Schirrmeister, senior group director for product management and marketing at Cadence offers a framework to look at the solution space. “I structure the problem into three areas. First, you need to be able to execute and observe what is going on in the system. Second, you need to be able to effectively create the testing for these issues. And third, you ideally would want some automation to help you make decisions about system design, partitioning, etc.”

Execute and observe
The starting point for everything is an executable model. “One of the tools put in place to help system architects are virtual prototypes,” explains Thompson. “These are being used more and more. This allows them to build the whole SoC in a virtual environment and they can write code to see how it interacts. That is becoming more important. The software component of these designs has grown in complexity, and so has the cost of development. Today most software teams are much larger than the hardware team. I expect that to continue because the software task continues to get more complex.”

The utility of the virtual prototype is to provide a platform for simulation and debug. “Eclipse or GDB work fine for single processor but when there are multiple, they are each in a separate window and they are not well controlled,” says Davidmann. “With symmetric multi-processing, GDB allows you to see threads, but there was no good way to control and debug when heterogeneous processing was added.”

Debug needs to span both the hardware and software. “The user interface of the system debugger needs to be connected into the HW debugger,” says Schirrmeister. “You need to be able to advance from there without having to switch through 15 windows. That is still an issue. We have tools for observing, debugging and understanding which state the system is in and this isn’t trivial. You have to combine things coming from the IP vendors as well. People want to do cycle-by-cycle accurate simulation for a multicore system and want to understand the state of the design, want to be able to stop and see all of the registers, the hardware waveforms, all of the processor software state.”

There are several abstractions at which virtual prototypes can be written and various problems they can solve. “I believe many concurrency problems can be solved with doing hardware/software architectural experiments before actually building devices and writing software,” says Jim Bruister, director of digital systems for Silvaco. “That is what ESL tools and higher abstraction simulation was supposed to provide—the ability to do ‘what if’ analysis ahead of the development.”

System-level verification

Once it’s possible to execute the system mode, that needs to be migrated into a verification task. There are several ways to do that.

“You can add assertions into the system so you get a feel for how the operating system is running,” suggests Davidmann. “We have applied the concepts of assertions as used in hardware into the simulator. These are watching the software and making sure it is doing what you expect. Then you let it run lots of jobs. This helps you identify when software goes along a path that you did not expect.”

But there are other views. “With anything involving performance or these types of analysis you don’t want the actual software to run,” says Schirrmeister. “You need to be a lot more targeted. At some point you may need to run actual software, but not to begin with. It is about effectively creating stimulus to get you into a stress situation for the software partitioning. Portable Stimulus (PS) is advancing scenario-driven verification so that you can create tests that stimulate software tasks on certain processors causing specific transactions, and you will see if the hardware reacts appropriately.”

The Portable Stimulus Working Group has been discussing a range of alternatives between these two extremes. “The Hardware Software Interface (HSI) layer will enable various abstractions of software to be substituted within a test,” explains , CEO of Breker. That means you may have dedicated software for low-level tests, but substitute the actual driver software at some stage in the verification flow. This allows for a progressive bring up of the software environment and limits the types of hardware problems that are likely to be encountered when attempting to debug higher-level issues.”

Schirrmeister points to some examples of the types of low-level issues that need to be worked out. “You may want to figure out what happens to cache and verify coherency. Plus, you may have three different areas shutting down power at different times. Does everything remain correct when they wake up? That is where the constraint solving aspect of PSS comes in. You can define the scenario where power domains can come up and down in different ways and ask the tool to provide all of the combinations.”

But some issues can defy dynamic verification. “The most common problem that people face is deadlock,” points out Mohandass. “Everything may have been silicon-proven and all that was changed was a simple update in the cache hierarchy, and suddenly this happens. Without understanding the complexity of system-level interaction it is almost impossible to identify.”

While formal verification has been extensively used to find deadlocks in RTL, applying the same techniques to the system-level would be impractical. “Architectural formal is a novel approach that leverages the exhaustive analysis capability of formal to explore all corner cases,” explains Roger Sabbagh, vice president of applications engineering for Oski Technology. “By using highly abstract architectural models we can overcome complexity barriers and enable deep analysis of design behavior. This forms a powerful combination that enables effective system-level requirements verification.”

While analysis has not been an easy problem to solve, the industry still looks for automation tools, as well. “The original goal of Imperas was to make it easier to program these concurrent, heterogeneous system and we failed,” admits Davidmann. “We defined the APIs between tasks and then a synthesis tool would simulate it, measure the results and then map tasks to hardware. System architects were not prepared to pay enough money to make this viable.”

“The application domain and the programming model are intertwined,” says Schirrmeister. “We have not figured out how to do general compute applications and optimize it in a compiler into a hardware architecture.”

But a part of the problem is being successfully automated, and this is changing the dynamics of some markets. “This is part of the reason why it’s no longer just about chip companies,” said ArterisIP’s Shuler. “The semiconductor companies’ customers are doing their own chips. The barrier to entry for making their own chips is decreasing. People who formerly worked at chip companies are finding jobs at Google, Bosch, Amazon and Waymo.”

Network-on-chip technology is playing an increasingly important role here, particularly for heterogeneous architectures, where there area a number of possible topologies, including rings, trees and mesh networks. These are critical for keeping everything concurrent and, where necessary, coherent.

Fig. 2: Simple mesh network. Source: Wikipedia

“You’ve got local memories that are closely coupled, such as SRAMs, mixed in with processing elements,” Shuler said. “If you have an irregular mesh network with different processing elements, you need to know where to hand off from one processing element to the memory in another. And with neural nets, you can overlay that onto the mesh.”

This used to be done manually, but as complexity grows so does the reliance on automation tools.

“NoC generation tools help to plug the gap,” says Mohandass. “They provide architects with a mathematical foundation for understanding the parallelism, concurrency and coherency. This is based on graph-theory and helps them understand the dependencies. Given a specification from the architect, we can tell them if it is correct or not. We also enable what-if analysis – how many different kinds of CPU do I need, how many GPU ports, how many other elements do I need and how does that work at the system level? We provide the tools to make sense of that and tell them what the performance will look like given the set of constraints.”

It is a multi-level problem. “The NoC companies create the pipes on which the data is transmitted but where you run into issues comes from usage,” asserts Schirrmeister. “One aspect is functional correctness and the other is performance. The NoC guys are more focused on ascertaining if the performance of the network is optimized and that you can rely on functional correctness in terms of what happens to transactions. To verify usage is where you need scenario tools to stress those combinations and figure out what is the best combination.”

Most aspects of system design and verification remain ad-hoc. While there may only be a few people who make the decisions about partitioning of software and hardware, scheduling, synchronization, and architecture, the effects of those decisions are felt by a lot of people with few tools available to help them.

Relief is on the way, but the EDA industry has been burned repeatedly in this area and investment is tentative. The predominant approach is to find ways to address the ‘effect’ that design teams are facing and provide small extensions for architects to better understand the ‘cause’. This means it may take a long time for good tools to become available. Until that happens, simulation of virtual prototypes remains the core of the toolbox.


Karl Stevens says:

The system and all its components need to be described as a network of inputs, outputs, triggering events, and function. Then that structure can be parsed and debugged without having to first define software and hardware functions.

Inputs and outputs are data(values) and signals(events/status). Data manipulation is arithmetic (operators/compares) while signals are Boolean (true/false). Boolean operators are and, or, and exclusive or and are evaluated using Boolean Algebra in the form of logic gates or LUTs. if/else sequences are error prone, confusing, and should not be used.

For some reason the design process has just been do more of the same and hope for a different outcome.

Starting with these two points:
1) Systems have increasing diversity. “We expect to see a much greater use of heterogeneous structures in systems and processor architectures,” says Michael Thompson, senior manager of product marketing for the DesignWare ARC processors at Synopsys.

2) Network-on-chip technology is playing an increasingly important role here, particularly for heterogeneous architectures, where there area a number of possible topologies, including rings, trees and mesh networks. These are critical for keeping everything concurrent and, where necessary, coherent.

Heterogeneous architectures share blocks of data of known size and content which is more efficient than cache coherency which is required if all memory is shared and are similar to OOP structure.

OOP classes encapsulate data and do not require shared memory and cache coherence at every level of access. Classes are polymorphic so there is some level of function change without creating a new class for every variation.

Heterogeneous systems correlate to OOP programs where there are design tools and debuggers.

Then there are block and statement lambdas with function delegates and an Abstract Syntax Tree SyntaxWalker to visualize the control graph and expression evaluation sequence.

Come on folks, the rock is too hard to keep digging this hole deeper. Take a peek around and see that the CSharp/Roslyn Compiler and Visual Studio already do some things that the hardware world only dreams about — yes, it will take a little digging and a change in mind-set.

Another thing is that the RISC cpu is not the ultimate architecture either because it is possible to design a computer that executes if/else, for, while, do, and expression assignments using the AST Syntax API output. The performance is on par with FPGA accelerators.

Kev says: – C++ with fine grained parallelism extensions.

I have IP for digging out of the hole without having to rewrite your old code too.

Kev says:

Just stuck in the wrong paradigms: SMP and RTL. One has globally shared data, the other clocks – which need a lot of synchronization and shouldn’t be in a design spec.

A much better paradigm for parallel system design is CSP (Communicating Sequential Processes) –

Sonics in particular could have gone the CSP route when doing NoC synthesis, but missed the opportunity (then again nobody else has taken it).

Leave a Reply

(Note: This name will be displayed publicly)