Impacts Of Reliability On Power And Performance

Determinism and coherency are becoming increasingly important as chips are used across a wide range of applications.


Making sure a complex system performs as planned, and providing proper access to memories, requires a series of delicate tradeoffs that often were ignored in the past. But with performance improvements increasingly tied to architectures and microarchitectures, rather than just scaling to the next node, approaches such as determinism and different kinds of caching increasingly are becoming critical elements in a design.

Most computer hardware today is non-deterministic, which means that two executions of a program will not be cycle-for-cycle identical at the microarchitectural level, even if they start from the same microarchitectural state. In fact, due to uninitialized state elements, I/O, and timing variations on high-speed buses, the microarchitectural states of the two executions will evolve differently.

But if the hardware can be designed with determinism, two executions of a program will be cycle-to-cycle identical. That simplifies system verification, and it makes hardware faults detected during bringup easier to reproduce and analyze.

“Determinism means that a system will not diverge from what you set out to do,” explained Gajinder Panesar, CTO of UltraSoC. “To get determinism, you need to avoid things like dynamic routing, dynamic contention and arbitration of the interconnect.”

Determinism is an important factor in many embedded applications, as well.

Figure 1: Deterministic design in action.

“Determinism can be implemented in a CPU at the hardware level or at the software level,” said Mike Thompson, senior product marketing manager for the Solutions Group at Synopsys. “Hardware (real-time) determinism is important in applications that require a response within a specific time following an input event. This is essential in an automotive airbag system, for example, where the airbag must fire within the specified time following a collision. But as systems become more complex and use more processors, maintaining determinism becomes more difficult.”

This includes software as well as hardware. “Software determinism can be used when it is desirable to have an application execute the same way given a set of inputs or events,” Thompson said. “Software determinism, for example, can make it easier to debug a program. A program that executes hundreds of times and then fails due to a random issue can be very difficult to debug. It may not even be possible to reproduce the event.”

Determinism also can be achieved by eliminating caches, said UltraSoC’s Panesar. “As soon as you start having caches with any significant amount of code, you’re going to start missing the cache, and therefore access to data. Instructions will be different, depending on if the item is in cache or not in cache.”

This is because of the way a cache works. A cache is a local memory. If the data is not in that local memory, the system has to go and fetch it from wherever it is.

“Then, it brings it back and puts it into the local memory,” said Panesar. “That’s okay if you’re just starting, but after a while the cache gets full, and then at some point you miss. It’s not in the cache, so you have to go and get it, but when you bring it back, you have to find space for it. Therefore, you have to take out something to replace it, and when you do that, it may be in cache or may not be in cache. Maybe some cycles earlier you brought it in, and then it grows really fast. Maybe another time, something else came in and replaced the line that you want so you have to go and get more.”

For data cache, if the accesses have to go across the interconnect to some external memory, and if that interconnect is being used by other things on the chip, they compete for that single resource.

“It’s like the freeway,” Panesar said. “When you’re trying to get onto it, if you’ve got a wide enough freeway with many lanes, then you’ve got the bandwidth. In rush hour, you don’t have that bandwidth. A similar sort of thing happens in SoCs. When the rush hour happens, you don’t get the bandwidth that you expected, and as soon as you start having that you can’t have determinism.”

Thinking coherently
Connected to this in many systems today is coherency, where one or more consumers require the same data and the data may not be local. For example, a coherent subsystem could be four processors. One of the processors has Data Item A, but the other processors also want to use Data Item A. Coherency is a way of having one place to go and fetch the data where one updates it, then updates the local copies of the others.

Coherency typically is used in a system where there is a need for shared memory resources, as it is used to keep the information in these memories the same.

“In a single CPU system, coherency can be maintained between an external memory or peripheral function,” Thompson said. “In systems where multiple CPUs are being used to run an application, coherency is maintained between the data caches to ensure that the current data is used by each processor. In larger systems, coherency also may be maintained between the top level (level 2 or level 3) caches between clusters. The use of coherency has increased in recent years, with the higher usage of multiprocessor systems that are being implemented to address growing performance requirements.”

Power vs. performance
Determinism and coherency are intricately linked with power and performance. Finding the right balance requires more more work because the under-the-hood hardware blocks are keeping tabs on where the data has gone and who has requested what.

“If the data was just local, then the hardware wouldn’t have to do that extra work,” Panesar said. “It’s not compute, but there are hardware blocks keeping track of where things are going. There’s more work. One level [of tradeoffs] is going to be more power, but this depends on the system because if the system is engineered where it has to move data around, that will take more power still. But if the system has a coherent network, then it won’t.”

Data servers these days tend to be coherent, and when running jobs on a server, data may not be where the core is running. It could be on another core or elsewhere, or in the Level 2 cache inside the interconnect. “You get better performance if you have data closer to your cores in the cache, which may be Level 2 cache inside the network,” he said. “That should save power because you’re not moving data around.”

Power is dictated by how much compute is required and how much data is moved around, a problem that has been increasingly apparent as the amount of data that needs to be processed continues to explode. The less data is moved and the closer it stays to the processor and memory, the lower the power.

“If the application that runs on the server is partitioned sensibly, you potentially reduce power,” he noted. “But if you have a system where you’re uploading something that is high performance in the cloud, it really depends on what kind of operating system you have and whether it makes efficient use of where your data is.”

This is a system-wide concern, and it is spilling over into other areas such as software engineering.

“Without coherency you can have a processor working on old or stale data, and this not only produces the wrong result but also wastes energy,” Thompson said. “You can manage this with software, but this increases the complexity of the software programming task and takes additional cycles burning more power.”

On the other hand, a processor with a coherency unit will use more power than a processor without coherency. As such, in most applications today, hardware coherency is preferred when multiple processors will work simultaneously on a given task. This makes the design much easier to implement and debug. Coherency is available with most high-end CPUs designed for multicore implementation. This means it is much easier to use hardware coherency, and it simplifies the programmers’ task so they can focus on more important things, he said.

One of those tasks involves balancing various compute elements in a design. Using multiple cores can distribute processing as needed using lower clock frequencies, and therefore lower power, but it also reduces determinism.

“As system performance increases, tradeoffs have to be made that make the system less deterministic,” Thompson said. “Ensuring determinism will come at the cost of maximum performance in a system. This can be a big challenge in high-end system design.”

Also, reduced power modes in processors can reduce determinism because of the latency they introduce. “Reduced power modes are common on most embedded processors and are used to idle the processor when it isn’t being used. Idling the processors in many applications is critical to meet the power requirements. Returning the processor from a reduced power mode can be dependent on things that are external to the processor like memory access times or an oscillator startup time,” he added.

Effects on verification
All of this has an impact on verification, too. Ensuring the same view of memory content across a many-core system—with other coherent components accessing three levels of cache—represents one of the toughest verification challenges.

“While it is possible to design an appropriate protocol, ensuring it works in all situations requires many thousands of system infrastructure level tests devious enough to seek out bizarre corner cases,” observed Dave Kelf, chief marketing officer at Breker Verification Systems. “System-level randomization, where constraints set across a system level test and coverage is measured at that same level of abstraction, is the only real answer to find these obscure corner cases.”

But once these tests are set up, it is also possible to profile a system-level design and adjust power/performance tradeoffs. “In this case, the verification testbench can be seen as a tradeoff platform to make protocol adjustments, while ensuring the design cooperates,” Kelf said. “Tradeoff adjustment and verification now fits hand-in-hand.”

Tradeoffs are becoming more complicated as chip and system architects seek to balance power and performance against determinism and coherency. And while these techniques and approaches are not new, setting up a balance and making sure a chip will work as planned exactly when it is supposed to is a challenging job. Throw in things like artificial intelligence, security, and safety-critical requirements and it gets even tougher.

Leave a Reply

(Note: This name will be displayed publicly)