More than just the processor needs to be defined for standard operating systems. Profiles help a little, but still not enough.
Experts At The Table: What’s needed to be able to trust that a RISC-V implementation will work as expected across multiple designs using standard OSes. Semiconductor Engineering discussed the issue with John Min, vice president of customer service at Arteris; Zdeněk Přikryl, CTO of Codasip; Neil Hand, director of marketing at Siemens EDA (at the time of this discussion); Frank Schirrmeister, executive director for strategic programs and systems solutions at Synopsys; Ashish Darbari, CEO of Axiomise; and Dave Kelf, CEO of Breker Verification. What follows are excerpts of that discussion. To view part one, click here. Part two is here.
L-R: Arteris’ Min; Siemens’ Hand; Codasip’s Přikryl; Synopsys’ Schirrmeister; Axiomise’s Darbari; Breker’s Kelf.
SE: Profiles are an attempt to lock down the capabilities of a RISC-V processor, but do they go far enough to define the entire hardware software contract? Do profiles guarantee that software is portable?
Kelf: The profiles are an attempt to put that definition in place. But you have a point in that this is very hardware driven. You do see a lot of activity going on in Europe and it is the software guys presenting and doing things. But there’s a long way to go. Profiles are the right thing to do. It’s a great effort to put in place these definitions, but it doesn’t do enough for software. More work is needed.
Prikryl: It is definitely the starting point, because if you look at the community as a compiler developer, you need to rely on something. Then, you need to know ‘this’ item. Then the compiler will work for that profile, and if I give a binary to somebody else, it’s going to work. This must be done.
Min: What they want to avoid is compilation optimization for a specific CPU.
Prikryl: Eventually, you will get there.
Min: Profiles came about mostly because of the desire to run Linux and write software to that profile. There are RVA22 and RVA23, which are mostly compatible, a little bit different, but people write to this specific RVA, a rich operating system. Once that’s available, that’s the hardware abstraction layer the software developers could use.
Schirrmeister: What the Certification Committee hopefully agrees on is that you need to put together the list of things that must be checked so that a vendor can say, I am compatible with RVA23. And then, once you run these checks, the software developer can have confidence to run on it.
Hand: Would it be correct to say, depending on what level of software you want to be able to re-use for a compiler developer, that you need a certain level of certification so you can build a compiler against it? Then you need the next level of compliance in order to have an operating system against it, and then a next-level product. So as the software ecosystem grows, and the interest in the software as it shifts from a hardware to a software ecosystem, the compliance becomes more and more important and the level of specificity needs to go up.
Prikryl: It is going to be a lot of work to get this. RVA23 brings in the hypervisor. How are you going to test it?
SE: The problem is clearly large, but you still have to define what completion means. How do you know how far through this conformance process you are, and if you’re going to do certification, then you’re taking on ownership of that coverage problem.
Kelf: That’s a bit scary for the Certification Committee.
Prikryl: You do have a spec. You can slice it into certain items. You do have metrics then, because you can say, I cover this point, that point, or whatever point.
Kelf: That’s a good start, but you get into these situations, like with an interrupt controller, where you need to verify that you have put the right registers on the right stacks. The spec doesn’t start to touch that. Trying to produce a coverage model to handle that kind of thing is really difficult. There will be work going on with that, but it’s a massive project. It requires a lot of maturity. People will start to build models, especially the formal guys. Can they produce formal models, which by definition will give you a certain level of coverage? There will be holes. Then, some other parts of the process will be maturing, with customers trying to use it, and that will fill in those holes. The idea of producing a coverage model for one of these processors, especially an apps processor, is just not possible.
Hand: It’s also one of the reasons why some people choose not to modify the processor. I’ve spoken with customers who say they have decided to go RISC-V. They are going to build their own custom core. Then you talk to them a few months later and find they went with a standard IP vendor, and are going to do an accelerator, because it was too daunting, too difficult. They’re willing to trust the vendor because they have seen their requirements, they have seen their coverage report. If I change something I have to take ownership of the changes. They decide to do it the old-fashioned way. Let’s put accelerators on the bus and work it that way. It’s always a tradeoff.
Schirrmeister: For the industry, it’s a great opportunity. RISC-V has momentum. People will be more courageous in making changes or working with vendors to make the changes. For a while, there were EDA companies that did processor verification. They were focused on verifying instruction sets. But the market at the time wasn’t big enough, so it all became internal. Then there’s the big elephant that does it all in house and guarantees it. You need tools that help to figure things out — constraint solvers, and tools that allow you to add instruction. You need to be able to stop, look at the instruction in the RTL versus the reference model. That’s why models are so crucial. Then you need the tools to drive all the tests, because you don’t want to have an army of verification engineers having to generate them manually. And there are various options to do that. There is PSS. There are other technologies that drive and generate specific tests for processors. Everybody building these systems, starting with the ISA, the extended ISA, the processor in its environment. They need to make and run all the checks, formal and dynamic.
Hand: The important thing is no one wants to take on a burden for no reason. They’re doing it because they see advantages.
Schirrmeister: They get value out of the modification when they innovate something, but with that freedom to innovate comes great responsibility to make sure it all works.
Min: Most people are using standard configurations to help reduce their verification efforts. It’s how you put a fence between the standard IP versus whatever extension they’re going to put in, because the customers are responsible for any custom extensions.
Hand: If it’s the first time they’re doing RISC-V, they are already taking a risk by changing the ecosystem. What is my acceptable level of risk? It goes back to the benefit question. If they really need something, they’ll spend the time. They will invest in the optimizations. They’ll invest in speed up. They’re willing to take that risk because the benefit is there. But if all they’re looking for is an embedded processor to do a job, then yes, I could make it better, but I’m already taking enough risk on a new ecosystem. Why go further?
Min: From the NoC side, even the bus interfaces aren’t the same from every company. CHI is different from one company to the next company. The different versions may not be compatible. They may be mostly compatible, but then there are some ambiguities in the spec.
Schirrmeister: The only way to figure this out is to either run formal or simulation and see when you plug it together if it works. That’s why the ecosystem work is so important.
Hand: And that’s the exciting thing. When you see the richness of the ecosystem that is building up around RISC-V, it creates more activity. That increased activity puts more eyes on the problem, solves more of the problems, and it just gets bigger.
Min: It gets bigger, and as that happens we solve more problems, and the problems get smaller. Looking at the richness of the cores available, we have cores that are open-source and downloadable — semi-commercial processors that started off as commercial, and now they are open source. And then there are full RISC-V IP companies providing cores. There are a lot of variations. Conformance is a very important topic to the ultimate consumer of these chips, which is software.
Schirrmeister: Even compliance with one profile, where multiple companies claim to be compliant. The confidence that you can move around, and that your software stack can be ported with reasonable confidence between the different architectures — that will be important. Otherwise, you get huge fragmentation.
Min: Every company is little bit different. The business views are different. There will be fragmentation. The question is, how do we get the least common denominator optimized enough to be functional, and if customers want to go one step further to the final optimization, that last 10%, the CPU companies hopefully will allow them to do that.
Hand: The ultimate end goal is to be able to defer your IP choice to the very last second. If you can trust that the ecosystem is going to work, you can build it all out and then benchmark the IP vendors and see which one does the best job. We’re not there yet, but that’s the ultimate goal. We’re at the point today where you can choose what ecosystem you want to play in fairly early, but you’ve got to quickly lock down which IP architecture you’re going to use. That’s going to define a lot of your system-level decisions as well.
Schirrmeister: At the end of the day, you need the RISC-V system readiness program so that somebody can be certain that if it runs through these tests, and if it passes all of those, that it will work for you. That includes things like dealing with interrupts, can it to deal with all the system aspects around it. That needs to be there.
Kelf: We’re going to end up with a set of tests that will include formal, and dynamic tests for emulation and simulation — but a really comprehensive one. It’s going to cover the software space and a whole bunch of architectural features that aren’t part of the ISA, like the interrupt handling and memory management. You’re going to end up with this very big, comprehensive set of tests as part of a certification suite. This is far in excess of the compliance suite that exists now. Someone told me it tested about 4% of the actual processor when they ran coverage on it. It should include all the coverage suite to prove that you hit all these coverage metrics, they all pass.
Hand: If you look at the non-RISC-V processor companies, the amount of testing they’ve done over the decades is huge. The way the RISC-V community will get there is by a distributed effort. It’s not going to be possible for one company to do it all. That’s why the ecosystem becomes critical.
Prikryl: The main aim of the ISA compliance is different. When we started this group, there was a discussion about validation and verification. We crossed it out. It was not possible to do it at that point in time. The only thing this test should do is check the bare minimum, test a few instructions. That’s pretty much it. At the beginning, some customers asked if we were passing these tests? But it doesn’t mean anything. You have to go much further. Certification would go a step further, but it doesn’t mean that your CPU won’t break if a sequence of unexpected interrupts happens.
Hand: That’s still a problem with the other processor architectures, whether it be Intel or Arm. There’s always errata of weirdness happening.
SE: We thought there was a formal definition for RISC-V. Why is this not the basis for defining conformance and putting dedicated verification around this? That would seem to me to be the obvious place to start.
Darbari: We do have a formal SAIL model, but it only defines combinational behavior. Real bugs hide, which you only find when you do temporal verification with formal properties inside a formal tool. We are beginning to see more customers are leaning toward using this to sign-off their cores. With the way formal works, we find more corner-cases in implementation, but also specifications including software.
Hand: If you speak to the formal vendors, they believe that is a key part of compliance. You will find errors in almost every design, but there’s a baseline that you want to run. Ultimately though, in a software defined product, you want to run software. You still need these other verification environments, you still need the coverage environments, you still need the emulation. You still need simulation. Formal will tell you, ‘I meet a spec. Is there a hole that I didn’t know about?’
Kelf: There’s just too much stuff to do, and a formal tool cannot take that on. It can give you the baseline, which is necessary but not sufficient. And there’s all this complexity that comes up in these processors that a formal tool will not be able to tackle. It’s going to be real workloads running on big emulators.
Schirrmeister: This is a simple distinction. The formal bit is a necessary requirement, but it’s not sufficient for all the aspects.
Hand: It’s not going to be able to check cache.
Prikryl: It needs a combination of tools. If you zoom into a single solution, either formal or functional, you will always find holes. If you combine those together, which every vendor does, then you will have a more reliable product at the end of the day. But you need to take the best from each of them, because it gives confidence on different levels.
Hand: It’s no different than any verification. The verification of any complex SoC uses a variety of verification techniques. You’ll have some static, some formal, some simulation, some emulation, some prototyping, and it’s the idea of using the best technology for the task at hand. What are you trying to check? Am looking for a fault in the core compliance of the main instructions? Does it do anything incorrect? That’s a good starting point. What happens with my system when I’m doing a load followed by a store, followed by switching to a protected mode? It’s a complex set of interactions that you couldn’t even write a formal set of properties for. But you could easily write a simulation test for it. And now let’s say it’s an AI processor, an AI workload. You need to bring an emulator into play because that’s the only way you’re going to be able to run that scenario. Like any verification challenge, it’s going to be a mix of all of them. If you identify the most immediate need right now, that’s where compliance comes in. Once you have compliance, then you’ve got the architecture validation, and then you go into the micro-architecture, and then you’ve got the system. It’s all about what you are trying to solve at a given time.
SE: What is the process being followed by this committee. What are you hoping to achieve and what are the priorities?
Kelf: The committee is still trying to figure that out. It hasn’t been formed for that long and just got its charter worked out. It is going to be pretty customer-driven. It is going to involve the software world, as well as the hardware folks. It needs to define some set of tests that match what other certification bodies would do for other things. The first thing they did was a study looking at other things that are certified, like Wi-Fi. You had this organization, and all these Wi-Fi guys bought their chips, and they connected them together, and if one didn’t work, that was how they were tested. But that is not going to work for RISC-V. Maybe there’s a method where you have some standard software tests that you run on all these, and they had better work. We are going to end up with some very large, comprehensive test suite, which includes a lot of different facets, and that’s where it’s going. But it’s really early days.
Schirrmeister: You need to figure out, at the interface, does it work with the right CHI, or the nuances within the parts of CHI that are supported for the NoC.
Min: For now, the efforts are looking mostly inside the core, but the next step is to look at the system context. Maybe a plug-fest. I have a RISC-V-based PC or server and it does have slots. Does it work with this graphics card, that graphics card, or that storage controller?
Hand: We are still a fair way away from being able to go into a room with a rack that has every different implementation of RISC-V and run the same software on all of them. We’re not even close to that yet. Looking back over the past two or three RISC-V summits, there was talk about one of the compliance committees trying to get people to build development boards so you can actually run on it. We are still not there yet.
Leave a Reply