A hardware-software contract is needed for software portability, but RISC-V is not yet defined well enough to know what that is.
Experts At The Table: RISC-V provides a platform for customization, but verifying those changes remains challenging. Semiconductor Engineering discussed the issue with John Min, vice president of customer service at Arteris; Zdeněk Přikryl, CTO of Codasip; Neil Hand, director of marketing at Siemens EDA (at the time of this discussion); Frank Schirrmeister, executive director for strategic programs and systems solutions at Synopsys; Ashish Darbari, CEO of Axiomise; and Dave Kelf, CEO of Breker Verification. What follows are excerpts of that discussion. To view part one, click here.
L-R: Arteris’ Min; Siemens’ Hand; Codasip’s Přikryl; Synopsys’ Schirrmeister; Axiomise’s Darbari; Breker’s Kelf.
SE: Verifying a processor is tough enough when you have a defined ISA, a defined architecture, and a complete specification. With RISC-V none of these exist. How much more difficult does this make defining conformance?
Hand: The advantage of RISC-V is you can adapt it. The disadvantage of RISC-V is you can adapt it. Most people don’t realize that when you make a change, you’ve pretty much thrown all the IP vendor’s verification out the window. You have to assess where it is. In some ways, you can pick particular architectures, but you don’t have to modify the processor. You could get an IP vendor, take their processor, verify it, you can say I’m going to treat this as golden, and I’m going to do an accelerator, because that’s the safer thing to do. This is the risk/reward tradeoff. Where do I want to take the risk and what is the reward I get for that? A lot of people are initially attracted to this great flexibility without realizing that this flexibility is also a huge responsibility to your customers, to make sure you don’t screw it up when you try to change something.
Min: Well, maybe go back to definitions. Maybe different definitions fit different stages. So maybe conformance is at the architectural level. Maybe the validation or verification is micro-architecture level. It’s depending on which specification you’re validating, verifying or conforming to.
Kelf: We’re ignoring the elephant in the room. If you look at Arm, they have a processor architecture and an instruction set built on it, and that’s what they are verifying. They have put a huge amount of effort into making sure it works really well. With RISC-V, you have an ISA, and then you’ve got a very large number of potential architectures. Everyone has come up with all kinds of clever, different things, and that’s not even considering the fact that you can add instructions. And so that’s why, firstly, the profiles idea is so powerful. Now you’re trying to set some boundaries and organize the verification problem into something that is at least attainable. Otherwise, it’s so variable that it’s almost impossible. On top of that, you’re trying to handle this variable instruction situation. Arm did not allow it until RISC-V made such a big deal of it. So Arm now does allow it, but they have a huge firewall right between the main Arm architecture, which is completely solid and not changing, and then this little area where you can add instructions. Even so, they’ve got problems verifying that. With variable architectures from different companies, and then the additional instructions, profiling goes some of the way to help that. At least we can make some definitions that we can certify against. You need that golden model. But even then, it’s difficult. So how can you create in-depth verification tests that are independent of the architecture, independent of the additional instructions that are added, and that can ensure whoever gets one of these processes can absolutely rely on it to run their software stack without it go wrong? Most of them break.
Hand: Do you need certain things? If you look at the initial use cases for RISC-V, it was indeed for embedded. They were writing all the software. They were using a custom compiler. It really doesn’t matter. If you start to look at the RISC-V systems that are coming out now, whether it’s single board computers or laptop, you have to worry about the ecosystem. Those processors had better be compliant with the instruction set. It really depends on what your use case is.
Přikryl: Not only the instruction set. That’s one basic part of the story. They compile the profile to the spec. This is what’s missing at the moment, because we do have these architecture tests. It lives in the GitHub. You can run it, but it gives you almost nothing. It just says, ‘assembly good, binary good.’ That’s pretty much it. It doesn’t tell you if I switch from user mode to a different mode, that it is done correctly. This is where I believe that certification might help, if it’s done correctly.
Kelf: And that is what the community is trying to figure out right now. How do we do it correctly?
Hand: Can you do it correctly given what you were saying, which is, you’ve got multiple architectures, depending on the profile, depending on extensions, you’ve got tens of micro-architectures to implement any one of those. There’s no shortage of different interpretations of that spec. And then you’ve got the implementation, which may introduce different things again, within the micro-architecture, and you’re going into physical implementation which includes new twists to the whole thing.
Min: That is another angle, when talking about the inside the core. But do we also need to look at the outside of the core? Which busses, or even standard busses that have little ports which may not be documented. When Arm designs a processor and a NoC, they could work around that, because they treat it as an embedded system. But when we get involved, it is different.
Hand: Is that why you start to see some of these hybrid systems? When they really need to leverage the software ecosystem, they’ll go with one processor, but where they really need to optimize and squeeze the last little bit of performance out of the processor, they’ll go with a RISC-V architecture. You do start to see more of these hybrid systems that have multiple processor architectures. Perhaps they are saying, for this particular scenario, this is the safest bet. And for this particular scenario, this is the safest bet. People like to frame it as if it is a battle to the death. It really isn’t.
Lin: It is an optimization choice.
Kelf: But you bring up a good point. Is it just the processor? Now we’re looking at the whole SoC —maybe even the interrupt controller, the memory management unit, all the basic stuff around the processor. And you can see it slowly expanding out, so it covers all these SoC items.
SE: It is one thing to verify everything that is defined to be RISC-V, but what about the things it doesn’t define? When we start talking about a hardware/software contract, this means everything that is shared between them. To have a notion of conformance, you have to know that the contract is being fulfilled. But the contract isn’t defined. How do you deal with those?
Kelf: We have a bunch of customers who are working with SoCs, and they’re bringing RISC-Vs from other places. Many of these processors have bugs, which are either a misunderstanding in the ISA or a complete hole. People think that when they get a processor as a piece of IP, that it is like getting an Arm processor and it’s going to work and be great. In many cases the processor IP was bad because the ISA, although a ton of work has gone into it, is still not defined well enough. It’s a real problem.
Hand: That is one of the advantages when you use formal methods, because in order to get an answer, you’ve got to have the constraints. The constraints then become a reference to what is not defined. To get it to pass, you have to have constraints. So now at least you have documentation of the holes. Our team takes working cores, and suddenly all these errors pop up. Usually it is in a gray area of the spec, or it’s an addressing mode that wasn’t defined, but now you’ve documented it. Using different techniques, you start to fill in the blanks in the processor spec. If you guys at Codasip are saying you can now have this sandboxing guaranteed, and because it’s a generated design and you’re not messing with those other pieces, that becomes a very powerful thing. Now you’re saying that you can change something without breaking something else. That’s the usual problem. You go in, you tweak an addressing mode, and all of a sudden you’ve opened up a whole hornet’s nest of problems. Formal can help you identify that hornet’s nest of problems, but then you need to define if I meant to do that? Was it intentional, or was it not intentional?
Přikryl: You spend time to identify the boundaries, the right boundaries, to check the common changes, and then we create the boundaries. If you are within those, you are happy. If you are not, we don’t allow it. If you do cross those boundaries, then all the verification burden is on you.
Hand: There are benefits, but within the benefits you’ve got to say, ‘What are the boundaries that I’m willing to adopt? What is the risk I’m willing to accept? Am I going to take it as a known-good IP and trust that my vendor’s IP is good? Do I trust but verify?’ I get that vendor’s IP and then I run compliance test suites on it and see if it works. Or do I treat is as the Wild West because I know what I’m doing and I trust myself. With great power comes great responsibility.
Přikryl: Speaking of the of the holes in the specs, that was true especially at the beginning. It is much better than it was. But in the beginning, we hit quite a few of them. There were times when we thought we had nailed it because we interpreted the spec in the right way. And then we talked to different vendors, and asked them about it. How did you interpret that? In some cases, it matched, but in others it didn’t. Then we had to come to an agreement.
Darbari: When I was working on a power controller for an SoC I found out loads of design implementation issues arising from architectural issues in a processor implementation. These were picked up when verifying the power controller design with a pre-loaded firmware image inside a formal tool. This was done in the early stages of verification as we were told that the design already has been verified by simulation and that formal would just cover the last leg. However, the issues exposed meant that the entire processor design had to be re-architected. I might sound like a broken record, but formal can verify not just hardware, but also validate the boundary of hardware and software, i.e., firmware. Cache subsystems are also an example where memory model-related bugs are caught with formal, exposing subtle weak memory model-type bugs.
Schirrmeister: This reminds me of the LRM issues you have in simulation, where people today still debate race conditions and how to interpret the language reference manual. In other processor architectures, you know what happens when an interrupt happens. Things are being stored away, and you have a way to tell the processor to put things on stacks and save all the registers. From what I learned, in RISC-V, that’s not defined. You need to figure out the hardware software contract and what to do when the interrupt comes. What does my processor do? The software developer doesn’t care. They don’t want to think about that. They want that to be sorted out for them by hardware. The committee that runs the certification compliance will have to define, for a specific profile, how these things should operate.
Přikryl: That’s right. We do have profiles that basically dictate the ISA. And then we have platforms where one of the tasks is to define things, like which interrupt controller I should have in the system. Perhaps you need a particular OS. There are activities happening to nail down these details. This can be part of this stamp, or this certification process. You are not only complying with rva23, or whatever, but you follow that platform definition.
Hand: If you take a step back, RISC-V is already at a place that is well ahead of what RISC-V has replaced. If you look at what RISC-V has replaced, it’s been a lot of the custom processors that people built themselves. It was one vendor, one software infrastructure. What is their interpretation? Now you’ve got multiple groups, multiple interpretations, cross-checking each other. How are you looking at the problem? The standard is getting more robust because there are more people entering the ecosystem and interpreting the ecosystem. The more people involved, the more robust the standard gets, the more solid it becomes for the end user, the end consumer. As the ecosystem grows, it gets more robust, which is a lot better than the alternatives. When you had 100 different processor vendors, all with their own software ecosystems, all with their own errata, that would probably fill a textbook. You don’t find out about it until the corner case happens, and your self-driving car drives off the bridge.
Min: RISC-V has come a long way over the last six years or seven years. We’re getting to the point where we could address the second part of problem — software engineers, or CPU engineers. So far, it’s been mostly driven by hardware engineers, and it has been hardware-focused. As the focus turns more toward conformance verification, software guys will drive it, because they only want to write software once that runs on multiple instances of hardware. That conformance will be driven by software, not really hardware or testing. Does this web browser from one company, running on a Linux computer from another company, run successfully? That itself will become a validation test or verification test. Similarly, there will be software benchmarks that are software-driven.
Leave a Reply