As endpoint architectures get more complicated, EDA tool becomes key tool for experimenting with different options.
High-level synthesis (HLS) is experiencing a new wave of popularity, driven by its ability to handle machine-learning matrices and iterative design efforts.
The obvious advantage of HLS is the boost in productivity designers get from working in C, C++ and other high-level languages rather than RTL. The ability to design a layout that should work, and then easily modify it to test other configurations, could save even more effort finding solutions to thorny problems or developing new custom or commercial products quickly.
“The big advantage is when you’re creating a new piece of IP and need to make changes to get to the best overall architecture,” said Dave Pursley, senior principal product manager for HLS in the digital and signoff group at Cadence. “Designers are still trying to figure out what are the right algorithms to make it work effectively. So it’s not unusual to want to try more than one solution, and it’s really common recently with all the combinations of bit widths we’re seeing. If you’re doing a design in C with HLS, you can regenerate your code.”
By the time it reaches RTL, chances are much greater that any change would require starting a layout again from scratch.
“Working with a high-level abstraction model allows you to create the IP more quickly, first of all, Pursley said. “If you’re designing in RTL and a mode is added or removed, or they change something about the model, you’re done.”
HLS is a design-automation tool that originally was designed for hardware verification. It uses a higher level of abstraction than RTL and has the potential to reduce design time and verification costs by 25% to 50% overall, according to an estimate from Cadence on the impact of its own HLS product line. Synopsys, Mentor and Xilinx also sell their own HLS packages.
“This is all about how you write a floating-point unit,” said Pratik Mahajan, R&D director for the verification group at Synopsys. “When you use chips for training, maybe you want to use a 16-bit multiplier and a 32-bit adder. Making that decision in C++ is much easier. For one thing, software engineers know it and they can model with a high-level language. At the same time, while high-level synthesis is good in terms of creating the first level of RTL, it is not the best for area and power. This is why people are still hand-coding RTL for complex operations.”
HLS layouts can be tested and verified just like RTL. They can be used to troubleshoot power or timing problems, and they can read machine-learning inference-model definitions written in a framework like TensorFlow or Caffe.
This troubleshooting allows designers to optimize workload to chip, and chip to workload. It also makes it possible to verify that no unintentional changes were made in weighting or emphasis during the data center processing of training results into an inference model, which must survive the comparative chaos regardless of the size of an endpoint device at the edge of the network.
“When you implement the inference engine, you take the original network model and run it on an edge or mobile device,” according to Mike Fingeroff, HLS technologist at Mentor, a Siemens Business. “You reduce the area and power of the IC, but you look at performance, especially if it’s part of something like an ADAS in an automated vehicle that has some requirement for real-time performance. To get the best performance you have to tailor the hardware to a specific network and optimize the model to contribute to that.”
Fig. 1: Higher levels of abstraction are necessary for complex convolutional designs. Source: Mentor, a Siemens Business
Asking the right questions
HLS is fine for optimizing the usual challenges of chip design—power use, distribution, timing, performance. But its real value—especially in machine learning, where options are many but standards and best practices are few—comes from the ability to answer questions that may be difficult for designers involved to address, Fingeroff said.
“It does let you experiment and find out what gets the best performance. ASIC, FPGA or PE array,” Fingeroff said. “Maybe a PE array is the most flexible hardware option, or that you can optimize memory traffic using a fused layer architecture, maybe just for the top layers of the network.”
Trial and error isn’t the most efficient approach to solution development in the chip business or any other with high standards and high development costs, according to Mike Demler, analyst at The Linley Group and co-author of the February report A Guide to Processors for Deep Learning. By 2023, half of all new servers will include a deep learning accelerator (DLA), that report predicts. It also expects 1.9 billion client devices to ship with DLAs in 2022.
But if DLAs become so common, does it make trial and error when building ML devices almost unnecessary?
“The higher performance you can get, the more efficient you can get, the more you’ll be able to do,” Demler said. “But there are a lot of niche markets and devices and other places for voice recognition and other things if your product works, even if your flow is not super slick.”
Getting to that determination, however, sometimes involves digging into the design team’s structure. When a company puts together a software/hardware design team to develop the machine-learning application and design the device on which it would run, it’s not a bad idea to make sure where the final responsibility lies.
“In China I had a long conversation with the hardware engineer about what we were trying to do, and it eventually became clear he was not the one calling the shots,” said Kurt Shuler, vice president of marketing at Arteris IP. “It was the software architect calling the shots, so we all got together and that let us move forward once I realized the chip was defined by the algorithm, not the other way around.”
But the software architect doesn’t always have a good feel for the hardware. “The other problem we had was that, often, a software architect won’t be that good at abstracting down to the transistor level, and the hardware architect may not be good at abstracting up to the software, so you have to kind of walk them through that,” said Shuler.
Insisting on tight integration and optimization of software with hardware also may be a good way to coordinate development, but it doesn’t always reflect realistic performance requirements. Shuler noted that one way to help customers think about the problem is, rather than asking the hardware architect what would happen if the chip didn’t live up to expectations, to ask what the impact on the device would be if they were to remove the chip and replace it with an off-the-shelf inference chip that would have been completely generic to the application.
Joe Mallett, senior staff marketing manager for Synopsys’ Verification Group, agrees. “At the edge, what we’re seeing is people doing early architectural discovery,” he said. “They want to figure out the right algorithm quickly, which is typically an architectural investigation, and then they want to start deploying an FPGA or an ASIC flow. HLS can be used in both of those.”
Out of the data center and into ubiquity
The bulk of machine-learning activity—both training and inference—is still centered on cloud data centers. Inference devices are mostly commodity servers equipped with TPU, ASIC or FPGA inference accelerators attached to rack-mounted servers, driving consumer services like Google Translate and commercial ML infrastructure services that give end-user companies better access to platforms on which to build their own company servers, according to Jeff Miller, senior product marketing manager at Mentor, a Siemens Business.
“We are seeing a lot of very interesting approaches to machine learning in devices, though, especially in the custom IC space” Miller said. “We see a lot of custom digital and IP approaches, analog devices using neuromorphic designs, processors or co-processors from big IP vendors that people can use along with a traditional SoC segment for a customized edge device that doesn’t have to be built from scratch from the ground up,” Miller said. “We still have to be able to get the algorithms right, but it’s interesting to see useful progress on all fronts.”
It makes a lot of sense to put an inference accelerator on a smartphone and offer ML services through the cloud, because some of those services need the horsepower and, until recently, the cost of even semi-customized inference accelerators was too high for much experimentation, according to Steven Woo, distinguished inventor and vice president of enterprise solutions technology at Rambus.
There are already a lot of examples of people running things like voice-response servers on hardware that’s not optimized for ML inference. “Mostly they don’t even need to know TensorFlow or the other framework,” he said.
As costs drop for devices built on FPGA chips or other semi-custom hardware, more and more devices will be running full inferencing software on hardware that is not optimized for neural-network inferencing, Woo said.
“Every mobile phone processor has or soon will have a neural engine in it,” Woo said. “That’s a lot of platform for someone who decides they want to develop machine learning applications. Huawei, Qualcomm and other mobile chip companies are opening APIs and interfaces on their phones; pretty soon we’ll find developers can do some pretty neat things with those. There’s already a lot going on in medical detection and imaging, which looks like it will be pretty important.”
Many companies do spend a lot of time testing ASICs vs FPGAs, including comparisons of 16-bit vs. 8-bit vs. 4-bit precision in matrix multiplication of certain types, but they do this testing to make their implementation sharper and more operationally efficient. There are already pre-configured or cloud-based machine-learning functions that are likely to make the whole area just one more part of the persona/business tech infrastructure, Woo said.
“Flex Logix announced a neural network infrastructure around its eFPGA capabilities that really let you tailor within some of the units what you’re doing on the chip,” Woo said. “There is such a range of use cases and so many ways to approach it. If you need to tailor your device to get the kind of efficiency you need, you can do it. If you don’t really care if you’re using too much power or are not at that level of super optimization and you’re okay with that, you can do a lot.”
Neural networking already is gaining traction on the edge.
“The interesting thing is that it looks as if voice as an interface has already taken off because of what we can already do,” Woo said. “You can already talk to Siri on your phone, so neural networks are already in play there, and it’s getting pretty common in other things. We may already have a big milestone and are just now realizing it.”
Conclusion
High-level synthesis is a design automation method that is faster than RTL design and holds promise for quick, experimental designs especially in areas that are not well understood. Its most useful feature may not be design, but iteration—the ability to easily create many variations on a master layout and run verification and performance testing on those models to more quickly decide whether this type of neural network runs best on a purpose-built ML chip, FPGA, custom ASIC or even X86.
Either way, machine-learning inference devices are likely to continue to become more capable and more popular, creating both a resource and a market for chipmakers in the process, regardless how they choose to approach design and verification.
Related Stories
Leave a Reply