AI, edge applications are driving design teams to find new ways to achieve the best performance per watt.
The rush to the edge and new applications around AI are causing a shift in design strategies toward the highest performance per watt, rather than the highest performance or lowest power.
This may sound like hair-splitting, but it has set a scramble in motion around how to process more data more quickly without just relying on faster processors and accelerators. Several factors are driving these changes, including the slowdown in Moore’s Law, which limits the number of traditional options, the rollout of AI everywhere, and a surge in data from more sensors, cameras and images with higher resolutions. In addition, more data is being run though convolutional neural networks or deep learning inferencing systems, which bring huge data processing loads.
“As semiconductor scaling slows, but processing demands increase, designers are going to need to start working harder for those performance and efficiency gains,” said Russell Klein, HLS platform director at Mentor, a Siemens Business. “When optimizing any system, you need to focus on the biggest inefficiencies first. For data processing on embedded systems, that will usually be software.”
When Moore’s Law was in its prime, processor designers had so many gates they didn’t know what to do with them all, Klein said. “One answer was to plop down more cores, but programmers were reluctant to adopt multi-core programming paradigms. Another answer was to make the processor go as fast as possible without regard to area. A feature that would add 10% to the speed of a processor was considered a win, even if it doubled the size of that processor. Over time, high-end processors picked up a lot of bloat, but no one really noticed or cared. The processors were being stamped out on increasingly efficient and dense silicon. MIPS was the only metric that mattered, but if you start to care about system level efficiency, that bloated processor, and especially the software running on it, might warrant some scrutiny.”
Software has a lot of very desirable characteristics, Klein pointed out, but even well-written software is neither fast nor efficient when compared to the same function implemented in hardware. “Moving algorithms from software on the processor into hardware can improve both performance and power consumption because software alone is not going to deliver the performance needed to meet the demands of inferencing, high resolution video processing, or 5G.”
The need for speed
At the same time, traffic data speeds are increasing, and there are new demands on high speed interfaces to access that data. “High-speed interfaces and SerDes are an integral part of the networking chain, and these speed increases are required to support the latest technology demands of artificial intelligence (AI), Internet of Things (IoT), virtual reality (VR) and many more technologies that have yet to be envisioned,” noted Suresh Andani, senior director of IP cores at Rambus.
Best design practices for high-performance devices include defining and analyzing the solution space through accurate full-system modeling; utilizing system design and concurrent engineering to maximize first-time right silicon; ensuring tight correlation between models and silicon results; leveraging a system-aware design methodology; and including built-in test features to support bring-up, characterization and debug, he said.
There are many ways to improve performance per watt, and not just in hardware or software. Kunle Olukotun, Cadence Design Systems Professor of electrical engineering and computer science at Stanford University, said that relaxing precision, synchronization and cache coherence can reduce the amount of data that needs to be sent back and forth. That can be reduced even further by domain-specific languages, which do not require translation.
“You can have restricted expressiveness for a particular domain,” said Olukotun in a recent presentation. “You also can utilize parallel patterns and put functional data into parallel patterns based on representation. And you can optimize for locality and exploit parallelism.”
He noted that flexible mapping of data is much more efficient. That can take advantage of data parallelism, model parallelism, and dynamic precision as needed. In addition, the data flow can be made hierarchical using a wider interface between the algorithms and the hardware, allowing for parallel patterns, explicit memory hierarchies, hierarchical control and explicit parameters, all of which are very useful in boosting performance per watt in extremely performance-centric applications.
Flexibility in designs has been one of the tradeoffs in optimizing performance per watt, and many of the new AI chips under development have been struggling to combine optimally tuned hardware and software into designs while still leaving enough room for ongoing changes in algorithms and different compute tasks.
“You may spend 6 to 9 months mapping how to cut up work, and that provides a big impediment to embracing new markets quickly,” said Stuart Biles, a fellow and director of research architecture at Arm Research. “For large OSes, there is a set of functionality in the system where a particular domain is likely to execute on a general-purpose core. But you can add in flexibility for how you partition that and make the loop quicker. That basically comes down to how well you use an SoC’s resources.”
Biles noted that once a common subset is identified, then certain functions can be specialized with an eFPGA or using 3D integration. We’ve moved from the initial 3D integration to the microarchitecture, where you can cut out cycles and branch prediction. What’ you’re looking at is the time it takes to get from load/store to processor versus doing that vertically, and you can change the microarchitectural assumptions based up specific assumptions in 3D. That results in different delays.”
A different take on the same problem is to limit the amount of data that needs to be processed in the first place. This is particularly important in edge systems such as cars, where performance per watt is critical due to limited battery power and the need for real-time results. One way to change that equation is to sharply limit the amount of data being sent to centralized processing systems in the vehicle by pre-screening it at the sensor level. So while not actually speeding up the processing per watt, it achieves faster results using less power.
“You can provide a reasonable amount of compute power at the sensor, and you can reduce the amount of data that the sensor identifies through pre-selection,” said Benjamin Prautsch, group manager for advanced mixed-signal automation at Fraunhofer IIS’ Engineering of Adaptive Systems Division. “So if you’re looking at what is happening in a room, the first layer can identify if there are people in there. The same can be used on a manufacturing line. You also can run DNN calculations in a parallel way to be more efficient.”
Further, AI chips, like many high performance devices, have a tendency to develop hotspots, noted Richard McPartland, technical marketing manager at Moortec. “AI chips are designed to tackle immense processing tasks for training and inference,” he said. “They are typically very large in silicon area, with hundreds or even thousands of cores on advanced finFET processes consuming high current – 100 amperes or more at supply voltages below 1 volt. With AI chip power consumptions at a minimum in the tens of watts, but often well over 100 watts, it should be no surprise that best design practices includes in-chip temperature monitoring. And it’s not just one sensor, but typically tens of temperature sensors distributed throughout the clusters of processors and other blocks. In-chip monitoring should be considered early in the design flow and included up front in floor planning, and not added as an afterthought. At a minimum, temperature monitoring can provide protection from thermal runaway. But accurate temperature monitoring also supports maximizing data throughput by minimizing throttling of the compute elements.”
In-chip voltage monitoring with multiple sense points is also recommended for high-performance devices such as AI chips, he continued. “Again, this should be included early in the design flow to monitor the supply voltages at critical circuits, such as the processor clusters, as well as supply drops between the supply pins and the circuit blocks. Voltage droops occur when the AI chips start operating under load, and being software-driven, this can be difficult to predict in the chip design phase with the software written later by another team. Including voltage sense points gives visibility about what is going on with the internal chip supplies, and is invaluable in the chip bring-up phase, as well as for reducing power consumption through minimizing guard bands.”
Process detectors are also a must-have on high-performance devices such as AI chips, McPartland said. “These enable a quick and independent verification of process performance and variation, not just die-to-die but across large individual die on advanced nodes. Further, they can be used for power optimization, such as to reduce power consumption through voltage scaling schemes where the voltage guard bands are minimized on a per-die basis based on process speed. Lower power equates to higher processing performance in the AI world, where processing power is often constrained by thermal and power issues.
AI algorithm performance challenges
An important consideration of AI and other high-performance devices is the fact that actual performance is not known until the end application is run. This raises questions for many AI processor startups that insist they can build a better hardware accelerator for matrix math and other AI algorithms than the next guy.
“That’s their key differentiation,” said Ron Lowman, strategic marketing manager for IoT at Synopsys. “Some of those companies may be in their second or third designs, whereas the bigger players are in their third or fourth designs, and they’re learning something every time. The math is changing on them just as rapidly as they can get a chip out, which is helping the situation, but it’s a game for who can get the highest performance in the data center. That’s now moving down to edge computing. Those AI accelerators are being built on local and on-premise servers now, and they want to find their niche in performance per watt and for specific applications. But in that space, they still have to accommodate many different types of AI functions, be it for voice or audio or database extraction or vision. That’s a lot of different things. Then there’s the guys building the applications, like for ADAS. That’s a very specific use case, and they can be more specific to what they’re building, so they know exactly the model they may want, although that too changes pretty rapidly.”
If the design team has a better handle on the end application and the intended use cases, they can look at each different specific space, whether it’s for mobile or edge computing, or for automotive. “You can see that the TOPS, just the pure performance, has grown orders of magnitude over the last couple of years,” Lowman said. “The initial mobile devices that were going to handle AI had under a TOPS (tera operations per second). Now you’re seeing up to 16 TOPS in those mobile devices. That’s how they start, by saying, ‘This is the general direction because we have to handle many different types of AI functions in the mobile phone.’ You look at ADAS, and those guys were even ahead of the mobile phones. Now you’re seeing up 35 TOPS for a single instantiation for ADAS, and that continues to grow. In edge computing, they’re basically scaling down the data center devices to be more power-efficient, and those applications can range between 50 to hundreds of TOPS. That’s where you start.”
However, a first-generation AI architecture often is very inefficient for what they want to accomplish because they’re trying to do too much. If the actual application could be run, the architecture could be tuned significantly, because it’s not just a processor or the ability to just do the MAC. It’s a function of accessing the coefficients from memory, then processing them very effectively. It’s also not just adding a bunch of on-chip SRAM that solves the problem. Modeling the IP, such as DDR instantiations, and different bitwidths with different access capabilities, different types of DRAM configurations, or LPDDR versus DDR, optimal ways can be found before the system development is complete using prototyping tools and systems explorations tools.
“If the development team has the real algorithm, it’s much more effective,” Lowman said. “A lot of people use ResNet-50 as a benchmark because that’s better than TOPS. But people are well beyond that. You see voice applications for natural language understanding. ResNet 50 has maybe a few million coefficients, but some of these are in the billions of coefficients now, so it’s not even representative. And the more representative you can get of the application, the more accurately you can define your SoC architecture to handle those types of things.”
There are so many moving pieces on this, the more modeling you can do upfront with the actual IP, the better off you are. “This is where some traction is happening, seen in many aspects. The memory pieces that are so important, the processing pieces that are so important. Just even the interfaces for the sensor inputs, like MIPI, or audio interfaces. All that architecture can be optimized based on the algorithm, and it’s no different than it always has been. If you run the actual software, you can go ahead and optimize much more effectively. But there’s a constant need to grow the performance per watt. If the estimates are to be believed, with some saying that 20% to 50% of all electricity will be consumed by AI, that’s a huge problem. That is spurring the trend to move to more localized computing, and trying to compress these things into the application itself. All of those require different types of architectures to handle the different functions and features that you’re trying to accomplish,” Lowman said.
Power does play a role here because of the amount of memory capacity needed, the number of coefficients changes, as well as the number of math blocks.
“You can throw on tons of multiply/accumulates, put them all on chip, but you also have to have all the other things that are done afterward,” he said. “That includes the input of the data and conditioning of that input data. For instance, for audio, you need to make sure there are no bottlenecks. How much cache is needed for each of these data movements? There are all kinds of different architectural tradeoffs, so the more modeling you can do up front, the better your system will be if you know the application. If you create a generic one, and then run the one that you actually run in the system, you may not get the accuracy that you thought you had. There’s a lot of work being done to improve that over time, and make corrections for that to get the accuracy and power footprint that they need. You can start with some general features, but every generation I’ve seen is moving very quickly on more performance, less power, more optimized math, more optimized architectures, and the ability to do not just a standard SRAM but a multi-port SRAM. This means you’re doing two accesses at once, so you may have as many multiply/accumulates as you want. But if you can go ahead and do several reads and writes in a single cycle, that saves on power. You can optimize what that looks like when you’re accessing, and the number of multiply/accumulates you need to do for that particular stage in the pipeline.”
Conclusion
With so much activity in the high-performance and AI space, it’s an exciting time for the semiconductor ecosystem around these applications. There is a tremendous amount of startup activity, with the thinking evolving from a more generic mindset of, “We can do the math for neural networks,” to one in which everybody can do the math for specific neural networks in different fields, Lowman said. “You can do it for voice, you can do it for vision, you can do it for data mining, and there are specific types of vision, voice or sound where you can optimize for certain things.”
This only makes the AI market opportunity more exciting as the technology branches out into many different fields that are extensions of current ones or new areas all together, and the development technologies and tool ecosystem discovers new ways to make it all a reality.
—Ed Sperling contributed to this report.
Leave a Reply