AI Chips: NoC Interconnect IP Solves Three Design Challenges

Regular topologies, large chips, and huge bandwidths are considerations in AI-centric chips in the data center.

popularity

New network-on-chip (NoC) interconnect IP is now available for artificial intelligence (AI) systems-on-chip (SoC). Arteris IP launched the fourth generation of the FlexNoC interconnect IP with a new optional AI package. The novel NoC interconnect technologies solves many data flow problems in today’s AI designs. Innovative features address the requirements of the next-generation of AI chips that hardware-accelerate neural network and machine learning processing. The interconnect is a crucial technology for AI chips, which often comprised of tens or hundreds of parallel processors, because optimizing data flow between processing elements and memories is key to system-level efficiency.

The new NoC technology benefits emerging AI chip architectures in three main ways: automatically generating regular topologies, effectively managing the data flows of large chips with long wires and enabling large on- and off-chip bandwidths.

1. Streamlining data flow
For chips that reside in the datacenter, AI SoC designers often prefer regular topologies, such as rings, meshes, or tori, because they frequently implement multiple instances of the same type of hardware accelerator in a grid or ring topology. This technique is called homogeneous parallel processing. Using multiple copies of the same hardware accelerator in a defined grid or ring helps ensure predictable data flow, reduces hardware accelerator R&D cost, and can help guarantee design scalability over time.

The FlexNoC 4 AI package enables designers to generate topologies automatically and facilitates editing of the generated topologies. Additionally, the optional physically aware topology engine can display on-chip interconnect elements on top of the chip floorplan. This allows SoC designers to see what the automation has created in context with the other IPs on the floorplan and offers control in the editing of the generated topologies.

In addition to editing generated topologies, engineers can edit individual routers at each node of the topology either one at a time or all together at topology generation time.

2. Overcoming the large chip hurdle
To address these challenges, the AI package offers two new technologies: source synchronous communications and VC-Link virtual channels.

When a power domain is stretched across large areas of the chip, clock skew issues become a problem, as a branch of the clock tree is stretched across long distances. One way to address this issue is to have asynchronous communications along these long links, but this results in a large area increase due to buffering. A better approach is source synchronous communication, where the clock signal is carried across a long distance in parallel with the data. Basically, the clock signal is retransmitted at each pipeline stage along the data path, with a single asynchronous signal crossing once the path enters the final clock domain. This approach saves more die area than does the totally asynchronous solution.


Figure 1: Tools such as source synchronous communications and virtual channels avoid timing closure issues by managing long cross-chip paths effectively.

To address the issue of wire routing congestion within obstructed areas of the floorplan, virtual channels allows sharing of a smaller set of wires for multiple channels of communication while maintaining quality of service (QoS) and non-blocking communications. The technology is designed such that implementers can use it only when necessary, which saves on die area due to the buffers required for any type of virtual channel implementation.

3. Managing the bandwidth conundrum
Bandwidth issues are also a challenge for AI chip designers. The interconnect resolves the challenges of on-chip data flows and off-chip memory access by providing ultra-wide data paths, intelligent multicast, and second-generation high-bandwidth memory (HBM2) multichannel memory support.

First, the new interconnect technology offers up to 2,048-bit wide data support and utilizes silicon-proven data rate adaptation technology to automatically multiplex or demux communications as required. In particular, 2,048-bit communication ensures that nearly any processing element can be fed with the required bandwidth and avoid data starvation.

Second, multicast communication allows one data write to go to many targets simultaneously. This capability is important at certain times during neural network processing, such as updating of weights and transmission of new image maps. The key to efficiently implementing multicast to avoid overutilizing available network bandwidth by broadcasting data from as close as possible to network targets. Arteris FlexNoC executes this intelligent multicast technology in a way that is highly efficient in both bandwidth and die area.

Finally, the new AI package implements efficient integration with HBM2 multichannel memory controllers with 8- or 16-channel interleaving. Integrated reorder buffering accumulates target responses to ensure that initiators receive responses in order despite the out-of-order interleaved communication.


Figure 2: The FlexNoC 4 AI Package integrates HBM2 memory via 8 to 16 network interface units (NIUs) to facilitate traffic aggregation and data width conversions.

New interconnect for AI hardware processing
Neural network hardware acceleration is occurring today in chips that are specifically targeting AI applications, but this capability will ultimately become part of subsystems on many types of SoCs. As these designs become larger and more bandwidth-hungry, new interconnect technology is required to ensure data flow efficiency, which keeps these hardware accelerators running at optimal efficiency while addressing physical and timing closure issues. FlexNoC 4 and the accompanying AI package were designed to solve these issues now and in the future.



Leave a Reply


(Note: This name will be displayed publicly)