AI Pushes High-End Mobile From SoCs To Multi-Die

Smart phone architectures look very different at the high end versus midrange and low-end devices.

popularity

Advanced packaging is becoming a key differentiator for the high end of the mobile phone market, enabling higher performance, more flexibility, and faster time to market than systems on chip.

Monolithic SoCs likely will remain the technology of choice for low-end and midrange mobile devices because of their form factor, proven record, and lower cost. But multi-die assemblies provide more flexibility, which is essential for AI inferencing and keeping up to date with rapid changes in AI models and communications standards. Ultimately, OEMs and chipmakers must decide the best way to accommodate changes in a design cycle and which market segments to target.

“SoC vendors not tied to a handset maker have to go after the IoT SoC lower-end capability with AI, and this one is for sure monolithic,” said Hezi Saar, executive director of product management for mobile, automotive and consumer IP at Synopsys, and chair of the MIPI Alliance. “If they need to go after the mid-tier for mobile, that’s higher capabilities than the IoT. It’s also likely a monolithic [SoC], with the potential option to add to it with multi-die. When you go higher end, it becomes apparent that you cannot just do monolithic. You need the ability to do a multi-die in order to accommodate the changes that will happen and the fast time to market, because that’s really where they make most of their money.”

In other words, the target market determines the architecture. “We see this big trend toward multi-die 3D, and mobile is adopting that, but at a more gradual pace than NVIDIA or AMD with their HPC chips, which have gone whole-hog on 3D and 2.5D with these gigantic 12 chips on the system,” said Marc Swinnen, director of product marketing, Ansys. “Low-end mobile can’t do that. It’s largely a cost issue. They’ve had to really focus on getting as much as possible into a small form factor, into a single chip, with low power and high speed.”

Monolithic SoCs contain all the components required to operate a system on a single piece of silicon, and may include embedded microcontrollers with one or more processor cores; a memory system, such as RAM or ROM; external interfaces such as cable ports (USB, HDMI); wireless communication (WiFi, Bluetooth); a graphics processing unit (GPU); and other components, such as analog/digital converters, voltage regulators, and an internal interface bus, according to Infineon.

Despite their compact size — and often because of it — monolithic SoCs are extremely efficient, and they frequently out-perform more complex systems on a per-processor basis. Distances that signals need to travel are short, the power needed to drive those signals is lower, and heat can be removed with a simple heat sink. Many IoT SoC vendors have a monolithic strategy because it saves their customers costs on packaging and integration.

“It’s always better to have things on a single die, although it’s difficult for us to do it,” said Ananda Roy, senior product manager, low power Edge AI at Synaptics. “It puts us at a competitive advantage, because some of our IoT competitors put two dies in one package, stack them up, or put them side by side, and call it a single-chip solution. But in reality those are just two different chips in one package. We have consciously tried to move toward a single-die solution because, from our customers’ perspective, it’s a lot easier to integrate, a lot easier to design into their hardware systems. We basically build multiple technologies on a single die.”


Fig. 1: An embedded IoT SoC. Source: Synaptics

In the high-end of the mobile market, it’s a different story. There, multiple chiplets are used to boost performance, and more interconnects are used to reduce resistance and capacitance. “In such cases, a compute engine is ‘mirrored’ and connected, via a high-performance horizontal die-to-die interface and advanced packaging technology, to scale the compute processing power,” said Mick Posner, senior product group director in Cadence’s Compute Solutions Group. “This technically could be expanded to scale the processing of die vertically in a 3D-IC stack, enabling far higher interconnect bandwidth.”

Multi-die assemblies also allow for greater diversity in compute elements, which can include a combination of CPUs and GPUs, as well as highly specialized accelerators. “3D stacking is not limited to the same processing units,” Posner said. “An AI or memory accelerator unit could be part of the stack, creating highly efficient, domain-specific application engines. Utilizing advanced 3.5D packaging would enable another die to be connected horizontally, as well, using a more traditional die-to-die interconnect such as UCIe. The other dies would not need to be in the same technology node as the processing nodes. Integration of various nodes enable tradeoffs between performance and cost, while selecting a node most suitable for the application function or for supply chain resilience.”

In the first couple decades of the Millennium, the mobile market drove much of the leading-edge technology. But with the reduction of benefits from planar scaling in the finFET era, the inability to scale SRAM, and the rising demands for massive compute power in the cloud, systems companies shifted from monolithic SoCs to 2.5D systems with multiple dies connected through an interposer. While the mobile market is still at the leading edge for process scaling, the high end of the mobile market has expanded beyond that to multi-chip assemblies — although it’s not clear if mobile devices will adopt 3D-ICs because they require some type of advanced cooling system, which is impractical today in a mobile device.

“2.5D is very fast, very effective, ultra-short reach, so very efficient power,” said Synopsys’ Saar. “[The dies can be] fabricated on a different process. This one can be a 2nm — the base die — and the AI accelerator can be something else. They have flexibility.”


Fig. 2: Monolithic SoC vs. multi-die. Source: Synopsys

High-end mobile is pushing to the gate-all-around (GAA) 2nm manufacturing process to enable high performance, but it is expensive and has a lengthy production time. “GAA takes X months to get back from the fab,” said Saar. “You need to compress all of this, and this is the biggest challenge. You are taping out something that in the past was production worthy. This time you know you will need to spin at least one more time, and maybe while you spin it there will be another evolution of the spec. I thought I need 7 billion parameters. Now I need 14 billion parameters, because the use-case in phones has changed. And in the future, I don’t know what that’s going to be, but they need to have that in mind when they put those features in. That’s why multi-die seems to be the right answer for the flexibility, uncertainty, and continuous evolution of specifications and risk mitigation on the market side that you must take.”

Each mobile phone vendor can decide how it will implement AI depending on how many markets it wants to capture, Saar noted. “You can have an AI accelerator on-chip. It can be in a separate chip. It can be a dedicated one. It could be couple of dedicated AI accelerators. It depends on the horsepower that you want. Let’s imagine I want to have a base die that goes for feature phone. I’m adding an AI accelerator die, and that’s a 3D connectivity between the two. And now I’m adding one more on the side for, let’s say, I/O extension, because I want to go to a multimedia market. Now I need more display capabilities. I need EDP (electronic data processing). The SoC vendor can sell the base die — standalone, monolithic — to that feature phone market. They can add the accelerator. Now it’s a smart phone configuration, and they can add the other one on the side. Then it becomes a consumer device, super robotics, or PC, and they can play with all these configurations so they can go attack different markets.”


Fig. 3: A 3D-IC for the data center (or future high-end mobile) with the AI accelerator on top. Source: Synopsys

By putting the AI accelerator on the second die, the vendor can get better performance because it’s optimized while still using the same base. “Now, instead of hundreds of millions of dollars of spinning silicon again and again, it’s more stable,” Saar said.

Another reason to go multi-die is to account for analog and digital signals. For example, Synaptics’ touch controller for foldable mobile OLED displays can differentiate merely holding the device, pocket dialing, water droplets, or sweat. “Our chip has an analog die and a digital die, and the analog is directly connected to the sensors, and the digital die processes all that information,” said Sam Toba, director of product marketing at Synaptics. “Within that digital die, we have an MCU core, and previously we had an in-house custom MCU core, which does have a lot of advantages. But once you get to these foldables, the amount of information that needs to be processed becomes much, much higher, and for that we decided to go with RISC-V. Si-Five’s E7 is a very powerful MCU core that is good for high levels of processing, and our vector co-processor sits just outside it.”

AI/ML algorithms then can determine the environment and detect real finger touching. “Our chip connects to the touch sensor, looks at all the signals, gets the analog into the analog die, and then processes that on the digital die,” said Toba. “That digital die includes the E7, the Hydra, all the algorithms, and memory. Once the chip determines that touch is meaningful, intentional, then it would report to the host SoC.”

Memory and communications complications
Like AI, memory also is changing and can vary with different markets. If an SoC vendor goes after all markets, they have couple of ways to do it, said Saar. “They can do monolithic. However, how would they accommodate the multiple spins of silicon? They have LPDDR 6 right now, which has been defined, but it will keep on moving. UFS 5.0 is defined right now, but it will keep moving. So would they spin another 2nm silicon? Or would they limit that to something else?”

There also is a wide range of networks to consider. Mobile phone chips need to be flexible enough to support new 5G/6G protocols while continuing to support older technology. “Supporting additional bandwidths in a single system adds complexity to the data processing and means a lot of power consumption, so you have to implement it very efficiently,” said Andy Heinig, head of department efficient electronics at Fraunhofer IIS/EAS. “Otherwise, a mobile device will run out of battery in a very short time on the one side. And you also have to remove the heat on the other side. You have these multi-physics requirements, and you need very efficient accelerators, very efficient implementation of DSPs, data processing, and so on. That’s the reason why everybody is talking more and more about application-specific processors.”

In leading-edge designs, much of this involves chiplets and heterogeneous integration. And in the analog/mixed signal world of smart phones, this can help offset some of the added costs of multi-die assemblies. This approach allows for “flexibility in picking the best process node for the IP — especially for SerDes I/O, RF, and analog IP that do not need to be on the ‘core’ process node,” according to a Cadence white paper.


Fig. 4: A decomposed SoC. Source: Cadence

Power, battery, and thermal considerations
At the high end of the mobile market, vendors are rushing to enable AI. “The iPhone 15 and 16 has AI hardware that’s been added to the onboard processing and a lot of the smarts and hardware is being put in at the silicon level into those chips,” said Ron Squiers, solution networking specialist at Siemens Digital Industries Software. “Other companies like NVIDIA are building the GPUs. Arm is building Zen 5 [CPUs], which act as an orchestrator for the AI hardware that’s on the platform. Amazon is developing their Trainium training and inferencing chips, so the hyperscalers are doing it as well as the mobile developers.”

While GPUs always will be needed in mobile for graphics processing, the latest versions can handle AI workloads equally well. For example, in its E-Series GPU, Imagination Technologies vastly changed how it scheduled and executed workloads in the ALU pipelines (see figure 5, below).

“It used to have a very complex and very deep pipeline with many pipeline stages, and a long pipeline delay,” said Kristof Beets, vice president of technology insights at Imagination. “We were feeding that consistently from a very big register storage, a very big SRAM that is 0.5 megabytes in size in those GPUs — so a very high amount of very closely coupled large memory. The problem is if you’re constantly fetching a lot of data from it on every cycle, then pushing it down this pipeline, and also writing out results on every cycle, that’s a lot of power consumption.”


Fig. 5: Burst Processors reduces data movement within the GPU. Source: Imagination

The new design uses a much more lightweight pipeline, with only two pipeline stages, and it re-uses a lot more data locally. “Rather than constantly accessing the really big SRAM, we’re going to try and re-use data we have nearby already. That can be previous results, or data that is in the pipeline next to us, because if you look at a lot of AI cases, you’re very often shuffling and rippling data through a whole array of processing operations, taking data from neighboring pipelines.”

The resulting frames per second per watt efficiency gains can translate into longer battery life in phones. “It may impact the operating costs, but one of the other interesting things we can do in mobile is turn that extra power saving into a higher clock frequency and higher performance, because we can stay within the same power and thermal budget,” said Beets.

No matter how a designer is achieving better performance, power remains a key concern. “Everybody’s interested in power these days, even the data center people, but mobile has a much longer traditional business, and they’re battery operated, so they are much more constituent on the low power,” said Ansys’ Swinnen.

In addition to battery life per day, phone makers must consider battery lifespan. Every aspect of the phone has an impact, including the SIM card. To that end, Infineon developed a tiny, 28nm eSIM, which requires much less energy than a traditional SIM card. eSIMS allow users to easily switch between different service providers, while manufacturers can be more flexible in their designs because physical access is not needed.

Conclusion
Mobile phone vendors take different approaches to chip design based on which price tier they are targeting and which AI functions and communications standards they want to enable, either now or in the future.

Design decisions often come down to business reasons, noted Synopsys’ Saar. “It’s like when you ask why a specific standard is catching on versus one that is probably technically superior. There are many reasons why, and it doesn’t matter right now if it is this or that. If one vendor controls the whole vertical chain, they don’t have to use a standard off-the-shelf virtual production (VP) camera interface or whatever storage interface. They can create their own, even if it’s inferior. In their mind they’re getting whatever level of benefit, maybe in higher level integration and operational excellence.”

Meanwhile, many new entrants to the market are carving their own path in this highly competitive market segment. “They used to just do handsets. Now they’re also doing the SoC,” said Saar. “For them, it’s a different story. They can optimize it differently. They don’t have to go broad because they only care about their handset. They only care about their use case. Some of them have an AI position in the overall markets, not just mobile. We’re getting into corporate strategy or world strategy that’s definitely beyond hardware. Maybe hybrid does make sense for them, because I want the phone to connect to my AI engine on the cloud because now I have differentiation. You’re buying my phone, you’re connecting to my cloud, you’re connecting to my email. The general SoCs don’t have that. They’re selling hardware.”

Related Reading
Chip Architectures Becoming Much More Complex With Chiplets
Options for how to build systems increase, but so do integration issues.
One Chip Vs. Many Chiplets
Challenges and options vary widely depending on markets, workloads, and economics.
Using Real Workloads To Assess Thermal Impacts
A number of methods and tools can be used to determine how a device will react under thermal constraints.



Leave a Reply


(Note: This name will be displayed publicly)