Partitioning of blocks into manageable pieces is becoming harder, but new tools and approaches are showing promise.
Multi-FPGA prototyping of ASIC and SoC designs allows verification teams to achieve the highest clock rates among emulation techniques, but setting up the design for prototyping is complicated and challenging. This is where machine learning and other new approaches are beginning to help.
The underlying problem is that designs are becoming so large and complex that they have to be partitioned into more manageable pieces. But the number of pieces is multiplying, and the difficulty of partitioning is rising proportionately.
“You take a big blob of logic and you try to partition it in a way that most of all preserves the functionality, meaning you cannot break anything in the design,” said Juergen Jaeger, product management group director at Cadence. “You want to efficiently use the resources you partition it into, whether it is an emulator or an FPGA prototype, and you want to achieve the best possible performance. It’s like juggling multiple balls in the air.”
The industry is at an interesting juncture as a new generation of FPGAs is just emerging. “For users that do relatively small designs, it’s really a great opportunity. For users that had to partition their designs into two FPGAs before, it’s a fantastic opportunity because they eliminated partitioning and they get a much better result,” observed Johannes Stahl, senior director of product marketing at Synopsys. “For instance, instead of a design running at 20MHz before, now it can be run it at 50MHz, so it’s a dramatic impact. This will continue to happen in many situations where these new FPGAs come online.”
In other words, a significant increase of logic density enables implementing smaller SoC designs in a single FPGA chip and running the prototype at the clock frequency close to real operating conditions, said Krzysztof Szczur, hardware verification products manager at Aldec.
At the same time, the bleeding-edge devices add another level of complexity because the most advanced FPGAs today are really not one chip.
“The high-end FPGAs from Xilinx and from Intel/Altera now contain multiple dies, multiple chips in it, that are connected with wire, so you also have a partitioning inside the FPGA,” said Jaeger. “As an additional complexity, if you’re now looking through functionality, which has to be there, as well as performance and efficient use of resources, the partitioning algorithms are trying to do the ‘min-cut algorithm.’ They are trying to minimize the number of signals that have to go between multiple partitions. In addition to that, the engineers performing the partitioning are also trying to balance the utilization per FPGA so that you don’t end up with one FPGA that is 90% full and another one that’s only 10% full, because that would have a negative impact on performance.”
There are, of course, cases where the same dramatic results that Stahl cited cannot be accomplished. For instance, some companies, when prototyping with their own board, experience challenges other than multi-FPGA partitioning — specifically, to do it over and over because debug signals needed to be added.
“If you have to bring out debug signals in a flow that’s mostly manual, you have to bring them out to the pins, and that impacts the partitioning,” said Stahl. “So you have to run partitioning all over again, and that’s fairly painful. One user hated that because he could never predict when the next partitioning run would close timing, and he could get the prototype out to the users again. This shows it’s really connected to debug. Debug and automatic partitioning are one single topic for the users with prototyping. It all has to work together.”
Another challenge in multi-FPGA prototyping is the connectivity of the partitions. “With increasing logic density, the FPGAs are being enclosed in larger packages, providing more I/Os,” said Aldec’s Szczur. “However, an increase of I/Os is not as spectacular as logic resources. For instance, the largest Virtex UltraScale provides 1,456 regular I/Os, which is roughly 21% more than in Virtex-7 family. To reduce this gap, FPGA vendors equip modern FPGAs with high-speed serial I/Os (e.g. 48 GTH lines in XCVU440), which increase connectivity bandwidth. Such I/Os often are coupled with PHYs for standard interfaces like PCI Express, USB 3.0 or QSFP. In some cases they also can be used for inter-chip connections in multi-FPGA prototypes, but in such cases their use is limited to transactional interfaces that implement a dedicated protocol handshake — and ideally use burst transfers to minimize the impact of increased latency of such links. Additionally, in high-end FPGAs, the standard I/Os are designed to transfer at higher data rates due to support of low-voltage differential signaling (LVDS), coupled with dedicated serializer/deserializer (SerDes) logic that facilitates implementation of such links. Nevertheless, setting up both GTH/GTX or standard I/Os in LVDS mode might be tricky, and if automated by the partitioning software can save a lot of time and headache.”
If the design size is larger than single FPGA it must be partitioned, each matching single FPGA capacity. There are two objectives. the first is to FPGA resource utilization under a threshold that ensures smooth place-and-route. The second is to minimize interconnections between partitions, which is the most significant factor of the prototype speed.
“The process is relatively simple in the case of two FPGAs, but the difficulty grows fast when adding the next partitions, especially if design structure (hierarchy) does not correspond well with the prototyping board layout,” Szczur said. “One method would be manual partitioning, which requires making changes in design sources. Partition blocks are created in HDL to match prototyping board resources and connectivity. Due to FPGA I/O limitations, each partition has to be manually wrapped with the interconnect physical layer and custom implementation of multiplexers or serializers. Not only is this method error-prone and very hard to scale or encompass design changes, but it requires modification of the design that can negatively influence ASIC back-end synthesis optimizations. To mitigate this risk one would keep separate HDL source sets and configurations for FPGA prototyping and ASIC design flows. But then it is questionable what is really being verified during FPGA prototyping. Much better than manual partitioning is to use software that compiles the original design HDL source files, which facilitates grouping module instances across the design hierarchy to split it into partitions.”
Fig. 1: An automated partitioning tool. Source: Aldec
Performing partitioning
Partitioning is done by software architects, who devise the overall structure and determine how such a partitioning algorithm should work. These architects are supported by teams of engineers who do the individual implementations. “There are people who are responsible for just the pin multiplexing, e.g., the connection between the partitions,” said Cadence’s Jaeger. “Others who are doing nothing but placement are focused on how to break it up. That is then broken down into specialists teams.”
Much depends on the size of the design/verification team. “The architecture is key in coming up with a good partition,” he explained. “The second most important is probably the global placement. How do I place now the various design blocks into the various FPGAs. The connectivity and pin multiplexing is mechanical work. There’s not much creativity there. It’s just implementing it. As you can imagine when you have these multiple things that all have to work together, you always have to start somewhere. You have to put a stake in the ground, ‘I put this here and put that there, and I put that there. And let’s see how that fits together now.’ All of the algorithms are also working with the seed, or random starting point. Then, those algorithms run through multiple iterations where they do variations to that starting point.”
This is an area where efforts are underway to bring in more predictability, including machine learning. “Predictive algorithms play a roll in that and help make it more predictable because the goal then is, when you make it more predictable, you don’t need as many iterations and you don’t need as many repetitions to a good result. That’s why we want to bring more machine learning into this,” Jaeger said.
It’s an issue of compute power, and it’s an issue of time. “An important aspect to this is that not every problem is solvable. Let’s say you have a certain design and you say, ‘I want to fit that into four FPGAs.’ But it may not fit for various reasons. There may be too many gates. There may be too many connections in between, and maybe something else. When you are familiar with FPGAs, it often takes days until you find out it cannot do place-and-route and that there is no solution. You want to know that as soon as possible. An ML algorithm helps you after a few minutes or half an hour, so you don’t wait two or three days until your software says, ‘Oops, I cannot do it.’ You know it up front, and you can make changes to it up front,” he added.
It’s not clear, however, whether the industry is ready to apply machine learning in this area today. “While machine learning could be implemented, the challenge is that you need to have a lot of runs to optimize this machine learning,” said Stahl. “If you spend enough compute power you can optimize, run many runs, and find a more optimal version. So like any optimization problem that is slightly heuristic in nature — and putting cables down is slightly heuristic by nature — it can benefit from artificial intelligence algorithms. We’re not yet there today in the market, but it has to go there in the future.”
Partitioning best practices
When constructing an FPGA prototype, there are some key considerations to keep in mind.
“The first approach that would be the most useful is to design with an FPGA in mind,” said Daniel Aldridge, principal systems architect in the development platforms team at Imagination Technologies. “Our preference as FPGA engineers would be that the IP core that you plan to put into an FPGA was ‘FPGA-friendly,’ meaning it has no components that will be massively consuming of FPGA resources or hard to split across FPGAs. If the design can take those into consideration up front, if the end product is going to end up in an FPGA or at least a variant of it is, that’s preferable. If the IP core has to be designed for silicon, then silicon area and speed are the primary considerations. Then we have to make do with what we have. Sometimes that means replacing components, like-for-like, with formally tested equivalents that can go into an FPGA. It comes down to, ‘Can you make what’s going into the FPGA as FPGA-friendly as possible?’”
With a multi-FPGA platform, where the design must be split, it’s preferable if it can be split on a registered boundary that has a protocol.
“That gives you register-to-register to be pulled between your FPGAs,” Aldridge said. “It gives you a relatively low pin count bus that you’ll end up having to split. If it’s something like an AMBA bus, you don’t end up with too many pins. It’s still in the hundreds, but not in the tens of thousands. But again, that isn’t always possible. So then we are looking to split a design that was always intended to sit right next to each other in silicon across FPGAs, across the PCB. And that’s where we start looking at tools to help automate that process. If you’re talking about splitting high-level blocks that are connected by valid and able protocols, hand-edited instantiations of those sub-components is a relatively easy thing to do. If you’re going to start splitting sub-hierarchical blocks, or take many large blocks apart that have thousands or tens of thousands of signals between them, then we start looking at having tools to help automate the process. That helps you automate the area that’s going to be consumed by the block so that you can work out which ones you should put where, and it helps you with which signals are going to go between which FPGAs and what kind of multiplexing you’re going to have to do. That’s where we start talking about the largest FPGAs now have a couple of thousand I/O pins. But if you’re going to put 10,000 signals down, you’re still going to have to do some sort of time division multiplexing to get the signals out from one IP block or sub-block of the IP to the other block.”
Block size has a big impact on partitioning, as well.
“If you can make a component no larger than an FPGA, it makes life easier,” Aldridge said. “In the ASIC world, that will be the square millimeter area that your sub-block might take up. Then, start thinking about what the interconnect between those blocks is, from an ASIC layout of these much larger designs. And from an FPGA point of view, when you’re prototyping, we want that to happen at the design phase. From the very beginning, when the architecture of the IP core is done, you should be thinking through what level should be breaking this down into, what hierarchical level should there be. Should I agree on a maximum I/O or a common I/O interface between them? Make sure there are pipelining stages, as that will help with FPGA timing, and also with ASIC timing. You hopefully can convince the designers up front that’s good for everybody.”
Where is multi-FPGA prototyping headed?
With so much dependent on the prototyping of designs today, there is continual evolution of what’s next.
One area largely tackled in the EDA realm is on the boundary between the FPGAs when multiple partitions are connected, there are different ways the wires can be run. That includes how to do the pin multiplexing and how fast it transitions once the data is transmitted.
“More and more, you go to a hybrid and mixture of different pin multiplexing schemes,” said Jaeger. “You can have a very traditional asynchronous pin multiplexing. You run the wire at a relatively low speed and you overclock it so you can recover the data on the other side. That works very well if you have a small pin multiplexing ratio. It gives you very low latency so the delays are pretty short in the wire. If you have hundreds of thousands of signals that you have to send over one wire, that would be a limiting factor. And so you go to a SerDes-based high-speed connectivity, where you’re running basically a transmission line and there you run at gigahertz speed over the wires. The disadvantage of that is setting up these endpoints though the serialization/deserialization adds delays, so that only make sense if you have a very high pin multiplexing ratio. That gives you benefits.”
Leave a Reply