Challenges In Developing A New Inferencing Chip

What’s involved in designing, developing, testing, and modifying an accelerator IC at the edge.


Cheng Wang, Flex LogixCheng Wang, co-founder and senior vice president of software and engineering at Flex Logix, sat down with Semiconductor Engineering to explain the process of bringing an inferencing accelerator chip to market, from bring-up, programming and partitioning to tradeoffs involving speed and customization.


SE: Edge inferencing chips are just starting to come to market. What challenges did you run into in developing one?

Wang: We got a chip out and back last September, and we were able to bring it up relatively quickly. The first part of bringing up an inferencing chip is just bringing up the interfaces, which means being able to read and write registers and memory. When you get a chip back, the first thing you do is plug it in, and then you hope and pray that it wakes up. Our chip woke up, which means a PCIe device is able to enter an L0 state. That establishes a Gen 1 speed automatically, which is essential, because once you establish a link to the chip you can start talking to it. If you can’t do that, you have a problem, and you have to go through the I²C back door to figure out if any registers are prohibiting you from establishing a connection. Next, you have to execute the inference portion of the chip, which means getting some number of layers of a neural network to work. That took a couple weeks. The biggest hurdle was to go from a couple layers, which show the functionality of the silicon, to having all of the layers, like you’d find in a really heavy duty inference model like YOLOv3, running at the speed that we’re targeting. The performance for our chip starts at 533 MHz. Getting any FGPA-like fabric, particularly one with standard cells, to reach 533 MHz across hundreds of configurations is a challenge.

SE: When a customer begins working with that, what do they need to know?

Wang: On the hardware side, that’s relatively straightforward. People are able to replicate what we have. Where they run into issues is when they use variations of the model. They may have certain operators that they’re using that we’re not supporting today, or they may have certain settings for those operators. But the issues are almost always on the software side, whether that’s translation of the model or whether they’re fully mapping of the operator that they’re targeting. It’s not usually in the operation of the silicon.

SE: How about characterization of the chip? How much of that falls on the customer versus the vendor?

Wang: That’s mostly between us and our service provider, which in our case is Global Unichip. GUC does the characterization for defects, the testing, the qualification for things like high-temperature operation. They also do sorting of parts based on power and performance, to make sure the parts being shipped are in the PPA range we have specified. When the customer gets the part, it’s expected to be a known good part.

SE: Do you see this as a standalone chip or something that will be included in a package?

Wang: It’s standalone silicon, but this is an accelerator. So it’s targeted to run alongside a processor that’s communicating with it via PCIe. We have up to four lanes of PCIe because we’re targeting some of the NVMe form factors for servers, which may have up to 32 NVMe slots with four slots for SSD drives. We can just replace those because the servers generally have much more capacity for NVMe than PCIe cards. A server may have six to eight slots for PCIe cards. We have dozens of NVMe slots, which gives you that much more throughput per box. That said, there are smaller boxes with only one PCIe slot. But this is not designed to be an SoC. It’s an accelerator.

SE: Can it be customized?

Wang: Many accelerators are customized. This is a broader machine learning inferencing accelerator. We’re not targeting a particular model or application, even though most applications today are vision-based. This is targeting a behavior. So this will run better on a vision-based batch size one with YOLO-based models on the order of 100 million weights. But you also can use it for natural language translation, where you can have trillions of weights, but where it doesn’t work well is on an edge device. For vision, usually you are filming with 4K or 8K, which produces a huge amount of data. That’s unfeasible to do all the processing in the cloud. You have to have some compute capability on the device itself and be able to detect that maybe 0.1% of events are interesting. Maybe those need to be sent to the cloud to be processed further, but you need to do some of the processing at the edge or it will hog all the bandwidth. This is different from translation. So you can say, ‘Hey Alexa,’ and it wakes up the device and sends a phrase to the data center. It’s a small amount of data on a huge model. What we’re trying to do now is process a lot of data on a modest-sized model. But it’s only hundreds of megabytes versus gigabytes or terabytes.

SE: And that’s one of the big changes, right? Initially, everyone thought that when we were doing AI and machine learning at the edge that it would involve massive amounts of data. In reality, inferencing can be done with a much smaller data set.

Wang: Absolutely. It doesn’t make sense to send everything to the cloud. There are too many cameras and too much data to process everything efficiently. The latency requirements are very stringent and the bandwidth requirements are too high. You have to have something closer to the device, where you’re not doing anything on a cellular or wireless network. That’s the first batch of processing. Once the first batch of intelligence has been established, whatever comes out of that is only going to give you interesting results every once in a while. Bounding objects in a frame is a much, much lower data footprint than the actual raw video. It’s much easier for the machines that come after that to digest the processed data than the raw data.

SE: What’s the next challenge?

Wang: The main issue is retraining the network, which is like retraining the last few layers of the model so it can recognize what it wants to recognize. The machine learning models, when you train them on the regular data sets, may be recognizing cats and dogs and people and bicycles. But when you’re in medical imaging or factory automation, you’re trying to detect very specific objects. It may be organs or bacterias or tools. You have to retrain your model to recognize very specific things. The machine learning only has a limited number of classes of images it can recognize. It may be 80 or 100, but for each model that has been trained it will only recognize that many types of objects. If you train them on something broad, it won’t be able to identify something specific, and vice versa.

SE: So that means in order to identify more things, you need more inference chips?

Wang: Training and inferencing are different. Training is like coding. You can train in a different device, but when you deploy, you have to deploy in a specific device. Now, do you need different devices to recognize an object? Not really. If you really need to recognize different classes of objects, you could have multiple models running on a device. If you’re in factory automation, you may first run a model to detect if there are any defects in the material you’re scanning. Then, you run a model to determine if there is any dirt or debris. You could always split your input across multiple accelerators, each running a different model in parallel, or a different portion of a model in serial. If you really have a model that you think is too large to run sufficiently, either because of the size of its weights or because you want to re-use the cache layer, you can split the model across different sub-models. You can do that in a pipeline model. Each model is partitioned, and then you can run the model more quickly in a pipeline fashion.

SE: Where does prioritization fit into that?

Wang: If you have one model, you’re executing the same model in input images. But if you have multiple models, and you only want to identify things that are interesting, then you run a heavier model. You can even run the heavier model on a different device, or in the cloud, because the assumption is that with most of the events that a machine learning accelerator is looking at, you do not need further processing.

SE: The more you can narrow this down, the better the power and performance, right?

Wang: Yes. The simpler the model that we can execute, even if we can have a high-performance accelerator, the more you can take advantage of a number of tricks to reduce power, such as when a device is not being fully utilized. You save a lot of power if your model is simpler. Even if your performance will allow you to use a heavier weight model, you may not need to do that.

SE: Where do people go wrong with this approach?

Wang: In terms of hardware inference, it’s when people are able to exceed the performance you think you can achieve. With CPUs, it’s a little more difficult to predict what kind of throughput you will get. With configurable hardware, we can still tell cycle-accurately when your data will arrive for each memory block. We partition the graph across multiple partitions, so we know exactly for each of those configurations cycle-accurately what the performance will be. Most of the machine learning accelerator SoCs are processor-based, and it’s much harder to predict the performance. If you have 64 processors, you will never get 64X the performance. In our case, you can get extremely close to 64X because we can partition the hardware and control it in a cycle-accurate way. In a processor-based architecture, it’s very difficult to have that level of scalability.

SE: So the big problem is that you really need to tightly define what you’re trying to do, right?

Wang: Yes, you really need to know how to program it, and that’s a really hard problem for people to solve. So even though we’re a chip company, we’re quickly becoming dominated by the software team — just like many other chip companies doing machine learning. You have to be able to partition a graph across multiple configurations, and for each configuration how to allocate resources. We also need to be able to synthesize RTL onto the FPGA fabric and have it run at speed. All of those are very tough problems to solve.

SE: So if you’re developing the software, how much of that is reflected in the design of the hardware?

Wang: That’s what goes into the next generation. With a programmable FPGA, there is always a fallback. But you also can determine that something would be more efficient if you harden it as a computation kernel, for example, instead of continuing to run it on the FPGA fabric. It comes down to the question of how frequently a feature is used, and how much pain does it cause in terms of performance bottlenecks or resource usage. For the next generation, we have some things that are now in soft logic that we want to push into hard logic.

SE: Part of the burden for improving power and performance depends on the company utilizing this hardware really understanding what they’re trying to do. That’s a whole different challenge for many companies.

Wang: We try to alleviate some of that burden for customers, because they are either using an off-the-shelf model or developing their own. But neither of them really understands the guts of the hardware, just as many hardware developers don’t understand the ISA of an x86, and we shouldn’t expect them to. We should plan to just give them good compiled performance out of code they have built. Our AEs will work with them, of course, to improve their model to get better PPA. But they should not be experts in power gating or clock gating or any of these lower-level techniques that we’re using to get better performance out of the accelerator. Still, with collaboration, we can help them get more out of their next-generation hardware.

SE: Five years ago, when AI really began getting a lot of design attention, most people thought it would be something akin to a HAL 9000 in the movie 2001. What’s happened instead is that we’ve automated some very specific functions.

Wang: Yes, and that’s really what computers are good at. It’s not going to do everything for you, but we’re solving a very difficult problem — how do you describe something that is natural to humans but difficult for computers. How do you describe a bird or a dog to a computer? This is something a computer can recognize, but we’re still a long way from computers being able to think like a human. Still, a computer now can write essays that are almost indistinguishable from humans. That, in a sense, is thinking like humans, but in a very specific way.

Dynamically Reconfiguring Logic
A different approach to using one design across multiple applications.
Making Sense Of New Edge-Inference Architectures
How to navigate a flood of confusing choices and terminology.
Configuring AI Chips
Keeping up with changes in algorithms and potential interactions.

Leave a Reply

(Note: This name will be displayed publicly)