NoC Development – Make Or Buy?

Significant effort is involved in developing coherent NoCs from scratch.

popularity

In the selection and qualification process for semiconductor IP, design teams often consider the cost of in-house development. Network-on-Chip (NoC) IP is no different. In “When Does My SoC Design Need A NoC?” Michael Frank and I argued that most of today’s designs – even less complex ones – can benefit from NoCs. In the blog “Balancing Memory And Coherence: Navigating Modern Chip Architectures,” I discussed the complexity that coherency adds to on-chip interconnect. After I described some of the steps of NoC development based on what ChatGPT 3.5 recommended in “Shortening Network-On-Chip Development Schedules Using Physical Awareness,” it’s time to look at more detail at the development efforts that design teams would have to invest to develop coherent NoCs from scratch.

ChatGPT, here we go again!

The prompt “Tell me how to develop an optimized network-on-chip for semiconductor design, considering the aspects of cache coherency” gives an excellent starting point in ChatGPT 4.0.

Understanding Protocols: First, one needs to understand cache coherency protocols. The recommendation is to study existing protocols before selecting one. Specifically, understand existing cache coherency protocols like MESI (Modified, Exclusive, Shared, Invalid), MOESI (Modified, Owned, Exclusive, Shared, Invalid), and directory-based protocols. Analyze their strengths and weaknesses in terms of scalability, latency, and bandwidth requirements. Then, choose a protocol that aligns with your performance goals and the scale of your NoC. Directory-based protocols are often preferred for larger-scale systems due to their scalability.

ChatGPT’s recommendation for the first step is a good start. I previously discussed the complexity of specific protocols like AMBA AXI, APB, ACE, CHI, OCP, CXL, and TileLink in “Design Complexity In The Golden Age Of Semiconductors.” One must read several thousand pages of documentation to understand the options here. And – by the way – these are orthogonal to the MESI/MOESI commentary from ChatGPT above, as these are implementation choices. In a practical scenario, many of these aspects depend on the building blocks the design team wants to license, like processors from the Arm, RISC-V, Arc, Tensilica, CEVA, and other ecosystems, as well as the protocol support in design IP blocks (think PCIe, UCIe, LPDDR) and accelerators for AI/ML.

NoC Architecture Design: Second, ChatGPT recommends focusing on NoC architecture design. Decide on the NoC topology (e.g., mesh, torus, tree, or custom designs) based on the expected traffic pattern and the scalability requirements. Each topology has its specific advantages, as my colleague Andy Nightingale recently explained here. Furthermore, teams must design efficient routers to handle the chosen cache coherency protocol with minimal latency, implementing features like virtual channels to avoid deadlock and increase throughput. The final part of this step involves optimizing the network for bandwidth and latency by tuning the buffer sizes, employing efficient routing algorithms, and optimizing link widths and speeds.

Cache Coherency Mechanism Integration: Next up, ChatGPT recommends integrating the actual mechanisms of cache coherency. Integrating the cache coherency mechanism with the NoC involves efficient propagation of coherency messages (e.g., invalidate, update) across the network with minimal latency. Designing an efficient directory structure for directory-based protocols that can scale with the network and minimize the coherency traffic requires careful considerations of the placement of directories and the granularity of coherence (e.g., block-level vs. cache-line level).

By the way, for my query, it leaves out the option to handle coherency fully in software.

Simulation and Analysis: At this point, ChatGPT correctly recommends using simulation tools to model your NoC design and evaluate its performance under various workloads. Tools like Gem5, NS-3, or custom simulators can be helpful. I would add SystemC models to the arsenal of tools design teams working on this from scratch could use. Teams need to analyze key performance metrics such as latency, throughput, and energy consumption and pay special attention to the performance of the cache coherency mechanisms.

The last bit is indeed critical as for coherent interconnects, the cost of a cache miss is drastically different from a cache hit.

Optimization and Scaling: This recommendation includes implementing adaptive routing algorithms and dynamic power management techniques to optimize performance and energy efficiency under varying workloads and ensuring the design can scale by adding more cores. This might involve modular design principles or hierarchical NoC structures.

Correct. But, in all practicality, at this point during the project, a lot of time has passed without writing a single line of RTL. Management will ask, “What’s up here?” So, some RTL coding has already happened at this point. Iterations happen fast. Engineers will blame marketing quickly for iterative feature change requests like adding/removing interfaces, changing user bits, Quality of Service (QoS) requirements, address maps, safety needs, buffering, probes, interrupts, modules, etc. All of these can cause significant changes to the RTL. At this point, the consideration has sometimes not started yet that the floorplan can cause more issues because of interface location, blockages, and fences.

Prototyping and Testing: Next, the recommendation is to use FPGA-based prototyping to validate your NoC design in hardware and to test the NoC in the context of the entire system, including the processor cores, memory hierarchy, and peripherals, to identify any issues with the cache coherency mechanism or other components.

True. Emulation and FPGA-based prototyping have become standard steps for most complex designs today. And especially the aspects of cache coherency in the context of the overall system and its software require very long test sequences.

Iterative Design and Feedback: The last recommendation is to use feedback from the simulation, prototyping, and testing phases to refine the NoC design iteratively and benchmark the final design using standard benchmark suites relevant to your target application domain to ensure that it meets the desired performance and efficiency goals.

The cost of “make”

Bar hiring a team of architects with relevant NoC development experience, the first five steps of Understanding Protocols, NoC Architecture Design, Cache Coherency Mechanism Integration, Simulation & Analysis, and Optimization & Scaling will take significant learning time, and writing the implementation specs is far from trivial.

Then, teams will spend most of the effort on RTL development and verification. Just imagine writing RTL protocol adapters for AMBA CHI-E, CHI-B, ACE, ACE-LITE, and AXI – connecting tens of IP blocks coherently – to address coherent and IO coherent use models. Even if you can reuse VIP from EDA vendors to check the protocol correctness, the effort is significant just for unit verification, as you will run thousands of tests.

For the actual interconnect, whether you use a heterogenous, ring, or mesh topology, the effort for development is significant. The logic that deals with directories to enable cache coherency can be complicated. And any change requests require, of course, re-coding!

Finally, when integrating everything in the system context, the effort to validate integration issues, including bring-up in emulation and associate debug, consumes another chunk of effort.

Our customers tell us that when it is all said and done, they would easily spend over 50 person-years just on coherent NoC development for complex designs.

Network-on-chip automation for productivity and configurability.

Automation potential: What to expect from coherent NoC IP

There is a lot of automation potential in the seven steps above!

  • The various relevant protocols can be captured in a library of protocol converters, reducing the need to internalize and implement all the protocols the reused IP blocks speak of. Ideally, they would already be pre-validated with popular IP blocks from leading IP vendors – think providers of Arm and RISC-V ISAs and vendors for interface blocks like LPDDR, PCIe, UCIe, etc., or graphics and AI/ML accelerators.
  • Graphical user interfaces and scripting APIs increase productivity in developing NoC architectures and topologies.
  • Like protocol converters, reusable blocks for directory and cache coherency management can increase development productivity. Their verification is especially critical, so ideally, your vendor has pre-verified them using VIP from EDA vendors and pre-validated the system integration with the ecosystem (think processor providers).
  • The refinement loop is probably the most critical one to optimize. Refinement can be deadly in manual scenarios. Besides reusable building blocks, you should look for configuration tools to automatically create performance models for architectural analysis, export new RTL configurations, and directly connect to digital implementation flows.

The verdict: Make or buy?

The illustration above summarizes some of the automation potential for NoCs. Saving more than 50 person-years is an attractive option compared to developing NoC IP from scratch. Check out what Arteris does in this domain with its Ncore Cache Coherent Interconnect IP and FlexNoC 5 Interconnect IP for non-coherent designs.



Leave a Reply


(Note: This name will be displayed publicly)