Acceleration of performance improvements due to AI and disaggregation are driving significant changes at the leading edge of computing.
Supercomputers and high-performance computers are becoming increasingly difficult to differentiate due to the proliferation of AI, which is driving huge performance increases in commercial and scientific applications and raising similar challenges for both.
While the goals of supercomputing and high-performance computing (HPC) have always been similar — blazing fast processing — the markets they serve are very different. Supercomputers such as those on the Top 500 list, typically are showcases for scientific and academic computations, with performance often measured in exaflops. HPC, on the other hand, targets more conventional applications using high-bandwidth memory, fast inter-processor communication, and a high number of floating-point operations per second (FLOPS). But with the focus on AI training and inferencing, similarities are growing between these compute architectures.
“Fundamentally, HPC is based on high-bandwidth memory access, fast and low latency inter-processor communication, and lots of single and double-precision FLOPS,” explained Paul Hylander, chief architect at Eliyan. “In the last 20 years, HPC has been riding on the coattails of server-based computing because HPC doesn’t have high enough volumes to justify dedicated networking, processing, and memory development on its own. Now, with the massive expenditures going toward AI computing, there is a renewed emphasis on higher-bandwidth memories, higher-bandwidth networking, and better thermal solutions — as well as, importantly, chiplet solutions to enable scaling the amount of compute per node.”
Today supercomputers can split two basic categories. “There are supercomputers that are purely processor-based, including those that have accelerators — often GPUs and the like,” said Ashley Stevens, director of product management at Arteris. “There are classes of problems that have code that goes back years. Some of it goes back even to the 1960s in scientific areas like nuclear modeling and so forth, and that can only run on general-purpose computers. But then there’s a class of problems that are newer and can be re-coded to run on accelerator systems, as well. So at the moment, the highest performing systems and the most energy-efficient systems will have accelerators — often GPUs.”
Specifically, what makes a supercomputer a supercomputer is that it contains a node with coherent interconnect, as well as a node-to-node interconnect, so they communicate to each other. “Typically, Message Passing Interface (MPI) is used,” Stevens said. “So there are ways of splitting up problems to run on multiple nodes with an MPI between the two, or sometimes remote DMA (rDMA), where one computer can DMA data to another. That’s what defines a supercomputer. They have inter-system communication.”
Hybrid strategies
AI has had a profound impact on both supercomputing and HPC. The integration of CPUs and GPUs in heterogeneous environments has evolved significantly over the past five years. GPUs, once primarily used for gaming and Bitcoin mining, are now essential for accelerating computing tasks in AI. And what makes GPUs so popular is their ability to scale.
“Everything boils down to the number of cores you put in a system,” said Shivi Arora, director for ASIC IP Solutions at Alphawave Semi. “It depends if you’re going toward HPC data center, or you’re looking at a DPU/CPU kind of market. HPC and supercomputer are both aligning in the same direction. It’s the number of CPUs you can put on a system that defines what market you’re going to support.”
This kind of mix-and-match granularity has opened the door to hybrid systems, combining classical computing, supercomputing, and even quantum computing to meet the performance, reliability, and security needs of various applications.
“There’s the evolution of supercomputing, in general,” said Simon Rance, general manager and business unit leader for process and data management at Keysight Technologies. “But then you have quantum, as well, and quantum is starting to really gain momentum. In the high mathematical type of compute applications — things that have to do computations at a very rapid, aggressive rate — we’re seeing more of the supercomputing going into quantum. That’s the realm where quantum computing is really strong now. When it’s processing information from various sources, whether for AI, for example, to try and make sense of what it’s trying to process in real-time, that’s where we’re seeing that natural evolution of supercomputing.”
This exacerbates some familiar challenges, though. “When we look at what supercomputers could do five years ago versus today, the advancements have been astonishing,” said C.T. Rusert, worldwide leader for high performance computing at IBM. “We have supercomputers that can model calculations at unprecedented speeds at exascale levels, whereas five years ago we couldn’t do that. With that has come challenges. As we become a society that’s more conscious of our energy and efficiency, with these supercomputers, how do we make them more energy efficient?”
Those challenges now cross into both computing domains, with AI creating an insatiable demand for more horsepower to train multi-model models and solve massive and complex compute problems. “This concept of an AI factory, of consuming and producing tokens, is a tremendously compute-intensive study,” said Rob Knoth, group director for strategy and new ventures at Cadence. “This is spreading like wildfire, driving a change throughout all of our ecosystem in terms of what people consider a supercomputer, to what is considered an acceptable amount of compute in consumer devices, to the size of computation you’re going to have in your car, to what is going to be in a humanoid robot or a drone. There is a huge amount of compute that’s required, and each one is requiring different things in terms of the power computation, the thermal envelope it’s going to fit in, the connection to the power grid. How long can it be walking around or flying around without needing to get recharged? It’s fascinating, beautiful, terrifying, and inspiring how this word ‘supercomputer’ is being changed because of AI, and how it changes AI. The size of the supercomputers is what’s allowing us to make these new frontier models, to make these multimodal models, to be able to start talking about physical AI, talking about the ramifications of what it takes to create a serviceable humanoid robot, and how it’s that different than a chip going in a car, or a chip that’s going to go in a new data center.”
Key enablers
At the epicenter of this convergence are technological advancements such as high-bandwidth memory, high-bandwidth communication within and between different dies, and chiplet-based solutions that can massively scale. All of these are essential for meeting the demands of AI, which requires substantial computational power for training multimodal models and performing inference tasks.
“At the annual Supercomputing Conference, what has been going on for the last five to seven years is the topic of convergence,” said Steven Woo, fellow and distinguished inventor at Rambus. “At the highest level, if you look at the top supercomputers in the Top 500 list, it is dominated by machines that have not only traditional CPUs, like from Intel or AMD, but also have lots of graphics or AI engines from NVIDIA or AMD. If you look at these specialized AI clusters, at a high level, it’s not really different. As for the ratio of how many AI engines there are to traditional CPUs, that will change depending on the makeup of the supercomputer or the AI cluster. But if you’re talking about the 30,000-foot view, they’re very similar. Then you start realizing a lot of these benchmarks that people run in the supercomputing world can run pretty well on these AI super clusters, and vice versa, so it is causing a lot more discussion about this convergence. Does there need to be a separate class of machines that only serve the supercomputing market? And, at the same time, does AI become so fundamental that these two are merging together?”
This convergence also presents challenges. Energy efficiency and sustainability are major concerns because supercomputers consume huge amounts of power. Cooling systems and advanced packaging techniques are necessary to manage the thermal envelope and ensure efficient power delivery. Additionally, data movement has become more expensive than computation, necessitating new approaches to minimize data transfer and improve overall system efficiency.
Many of the technology drivers in AI find their way to supercomputing, and vice versa. “If you look at the supercomputing programs, those are mostly driven by nations,” Woo said. “The U.S. has programs that are run roughly every 10 years. About every five years or so, there’s a new supercomputer that comes out. So, five years is spent studying and thinking about prototypes and things, and five years is spent on execution to build the machine. The three biggest supercomputing programs include one sponsored by the U.S., Japan has always sponsored a very big one, and then China has its own program. The last time the U.S. did one, it was called the Exascale program. The U.S. traditionally has said the next bar is going to be 1,000 times more performance than the previous machine, and it was called Exascale. Also, the U.S. government works with industry, and provides a lot of investment money for both academia and industry to develop new technologies, and then those technologies find their way into the supercomputers. And they find their way into commercial products, as well.”
AI is helping to narrow the performance gap between supercomputers and HPC, as well. “NVIDIA’s Grace Blackwell came out last year, and Rubin is coming out this year, so you see this one-year progression with these amazing increases in performance. Both are incredibly important technology drivers, but AI seems to be on a faster development cycle right now. The goals between machines aren’t necessarily as lofty as in the supercomputing program, which is to be 1,000 times better. It’s hard to do that on a year-to-year basis in AI, but they do make tremendous strides every generation.”
The challenge of data movement
An additional pressure when it comes to supercomputing advances is data movement. “It’s been well understood for more than a decade that data movement was a big problem. The Exascale program had done a number of studies, and there were some great presentations about, if you just follow the technology development curves, you see that data movement is becoming more expensive than computation,” Woo said. “At the time there were projections, and very well thought-out and very well-articulated studies, which concluded this is going to be the problem. There are a couple of ways to deal with that. You either bring components closer together, or you find ways to build what people now call superchips.”
In the past, what had been the problem was the reticle. “The chip size could only be so big. But now they’re finding ways to exceed that by stitching together multiple reticle-size die, and they’re now right next to each other, so if you look at it from five feet away it looks like one big chip, and they’re connected on a substrate,” Woo explained. “All that’s been enabled by advanced packaging and things that the industry has been working on, based on things like HBM. There’s a virtuous interaction that’s going on in AI, HPC, and supercomputing, where the physics doesn’t change, the problems are big, and they have their own little differences, but data movement has been shown to be one of the biggest problems. You can logically say, ‘Let’s not move the data very far,’ but then that introduces other challenges the industry has to solve, like thermal. How do you deal with thermal? We know liquid cooling is destined to become mainstream in the next couple of years. The other challenge is power delivery. How do I get all the watts, current, and voltage into this relatively small area? We’re not used to doing that. It’s not that we can’t do it. It’s more like finding the economical ways. And how do you do it in a very manufacturable way?”
All of this raises some complex partitioning challenges, as well, because distance affects time to results. “We’ve got so much processing compute power now, but we’ve got the latency problem now between the processors, and the processing and displaying or giving the results back in the real time that we demand it to be,” said Keysight’s Rance. “That’s part of where we’ve evolved from supercomputing. It wasn’t just a supercomputer computing something. It’s the sharing of the information and bringing it back to then make a decision within a millisecond.”
Accuracy now an issue
AI adds another issue. Unlike traditional computation, AI is probabilistic. Results are based on distributions, which are not always completely accurate. This is not acceptable in supercomputing.
“It needs a different amount of accuracy,” Arteris’ Stevens said. “In scientific computing, typically you use double-precision 64-bit, occasionally 32-bit. But these AI things may be using only 8 or 16 bits. OpenAI is obviously AI rather than traditional supercomputer type applications, but there are requirements to run code that goes back for years. A lot of this lately is AI training. The things I was involved with in the past was more trying to run old Fortran code from the ’60s at a good performance. Today, the most efficient machines are ones with accelerators, because generally speaking, the more specialized the hardware is, the more efficient it is. The more general-purpose it is, the less efficient it is. GPUs are only suited to certain things. If some code is written in Fortran, it wouldn’t be easy to get it done. And even if they do, although they support IEEE floating point, they don’t necessarily support all of the different modes and corner cases that a normal computer would. So they’re okay for certain classes of problems, but not necessarily all classes. What we see now is probably more and more specialization, especially in AI. You’ve seen that already, with people focused more on one particular problem rather than more general-purpose computing. That makes it more efficient.”
More than technology
Beyond the technical aspects, the term “supercomputer” holds significant cultural and inspirational value. It represents the bleeding edge of technology and serves as a beacon for the next generation of engineers and scientists.
“A supercomputer isn’t just about engineering,” said Cadence’s Knoth. “At the Supercomputing Conference, there are lots of people who are going to tell you the exact scientific definition of ‘supercomputer,’ but I say it doesn’t matter. The word ‘supercomputer’ is a more important word for science communication than it is for science. It carries power with it, because it’s changed over time. There are pictures of ENIAC in a room, and then people pull their phone out of their pocket and say, ‘I’ve got that here.’ So, to me, the word supercomputer is more important in a cultural and inspirational context than it is in a technical context. Supercomputers are what help inspire the next generation of engineers. They are a term that helps democratize what we do to help other people understand what’s going on in engineering. Supercomputers shine a light on what is the bleeding edge. Where are we going? Why are we going? What are the really cool problems we’re solving? Rather than a lot of the stuff that’s right in front of you, they’re trailblazers.”
The role of energy efficiency and sustainability
As supercomputing and HPC systems continue to evolve, energy efficiency and sustainability have become critical considerations. The immense computational power of these systems requires immense amounts of energy.
To address these concerns, researchers and engineers are developing new technologies and approaches to improve the energy efficiency of supercomputing and HPC systems. This includes the use of advanced cooling systems to manage the thermal envelope and reduce energy consumption. Additionally, efforts are being made to optimize the design and architecture of these systems to minimize power consumption and improve overall efficiency.
Many see the biggest challenge for HPC and supercomputing to be energy consumption and power consumption. “Taking the worst-case example, the Stargate system that was announced by Microsoft, OpenAI, and Softbank is going to need 5 gigawatts,” Arteris’ Stevens said. “That’s bigger than any nuclear power station in the U.K. or the U.S., although there are some that big in the world. In other countries, typical nuclear reactors are about 1 or 1.5 gigawatts, so Stargate will need three of them. A nuclear power station takes at least 10 years to build. By that time are they still building the same thing? Things move pretty quickly in our industry, so you can imagine building a power station for it. What you targeted may not be what you end up doing by the end of 10 years. One of the biggest challenges is power consumption. The top supercomputers right now take about 30 megawatts of power, some of them even more. I was in a study for the Fugaku supercomputer almost 15 years ago. At that time, the assumption was the limit was 10 megawatts. But now we have systems taking three times that at 30 megawatts, and they have a plan to build a gigawatt type power plant. So, power efficiency is it’s going to be really important. The limit to computing performance is actually energy consumption, and that’s something that hasn’t been really considered.”
Putting the pieces together differently
Supercomputers paved the way for heterogeneous integration on a massive scale. The chiplet concept brings that approach down to the package level.
“We’re putting all these different things inside of a package now,” said Sue Hung Fung, principal product line manager for chiplets at Alphawave Semi. “That’s just one big monolithic die that’s disaggregated. Then we’re putting all that into a package, which is a system-in-package, and we’re building these things out for AI/ML because we’re seeing this huge driver of lots more data in the data center, and doing the LLM trainings for AI and doing inference. Depending on what type of cores that we put inside of the compute, that’s the kind of performance that we can get out of it. That would be specific for that application use case, depending on the type of core, depending on how many cores you use.”
Is that a supercomputer, or is that a high-performance computer? Or is it something in-between? The answer isn’t always obvious, and it’s becoming less so as the amount of computing in a given amount of time continues to accelerate.
Related Reading
Power Delivery Challenged By Data Center Architectures
More powerful servers are required to crunch more data, but getting power to those servers is becoming complicated.
HBM Options Increase As AI Demand Soars
But manufacturing reliable 3D DRAM stacks with good yield is complex and costly.
Memory Wall Problem Grows With LLMs
Hardware advances have addressed some of the problems. The next step may hinge on the algorithms.
Leave a Reply