Can Machine Learning Chips Help Develop Better Tools With Machine Learning?

New AI chips require an extreme level of architectural complexity, making routing a challenge.


As we continue to be bombarded with AI- and machine learning-themed presentations at industry conferences, an ex-colleague told me that he is sick of seeing an outline of the human head with a processor in place of the brain. If you are a chip architect trying to build one of these data-centric architecture chips for machine learning or AI (as opposed to the compute-centric chips, which you probably architected in your previous life about three or four years ago), you’re dealing with a level of architectural complexity that’s in unchartered territory—defining new logic and memory interactions and verifying complex logical functionality that haven’t been seen in traditional CPUs and GPUs before. An oversimplification like a graphic of a chip inside a human head can certainly give you the urge to pull your hair out, ironically making your side profile look like the outline of that head with an AI chip inside.

Relatively speaking, the story for back-end physical design engineers is somewhat different. Some of the challenges for these new designs—like performance push, power saving, and timing closure—stay the same, but the intensity of some aspects of physical design blows up exponentially. These days a typical so-called processor for “machine learning” or “neural processing” has a lot of data that needs to be collected, stored, processed and passed. This involves designing large register banks that surround small chunks of logic. They are primarily large storage buffers for matrix multiplication operations, accumulators, and large numbers of MAC (multiply accumulate) units, which have to be arrayed thousands of times, putting extra stress on floorplanning tools.

If these repeated structures are not defined optimally within the correct hierarchy, they can take up a significant amount of area, not to mention add to the complexity of abutment-related hierarchical flows. Overall, the ratio of logic elements to memory macros in these types of designs has changed drastically. There are hundreds of memories and macros that have to be placed manually or by using some kind of semi-automated process  through trial and error of different combinations.

This manual, iterative work puts schedule pressure on the floorplanning aspects of the design, and among other things, routing track availability is one the challenges that often comes up. Once the standard cells are placed based on slack-, proximity- and pin-access, routing becomes a huge concern. Most of the large data buses have to pass through a small area, which can cause severe congestion while closing the designs. Routers start taking detours, causing an unnecessary increase in wirelength, which directly translates to an increase in switching power. In addition, a lack of pre-planning and pin mis-alignment of these large buses with feed-throughs and repeater bank assignments can cause painful, manual ECOs for timing closure.

To mitigate some of these challenges, though incremental, there are some innovative changes happening in physical implementation tools to ease the pain. Advanced hierarchical master-clone flows help with managing hierarchies of arrays of repeated blocks. Congestion problems can be managed by continuous congestion monitoring during the implementation flow with corrective action at each step. With a lot of effort up-front through bus pre-planning, length-matching, or resistance-matching, routers can create long, clean routes for buses. Careful time budgeting at the top level can avoid last-minute surprises for timing closure, but a lot more work has to be done up-front.

Frankly, these kinds of challenges aren’t things that physical design engineers haven’t seen before. As more such designs go through the full cycle, more learning is happening among design teams, CAD teams and EDA companies. It’s similar to “training” a dataset. This brings about a very profound question—as more machine learning architecture chips tape out, can a training set be built to learn from the first few designs that are taped out by brute force to generate recipes for pushing performance and reducing power for the next generation of designs?

In other words, just like fast processors were used to build faster processors, can machine learning-based deep-learning algorithms be used to build better AI chips that are faster and more power efficient? That will be some disruptive innovation that can change the way EDA tools work with design houses. If that happens, I don’t see the pictures with a chip inside the human head going away anytime soon. In fact, that picture will evolve to show that the chip inside the human head will now have a picture of a human head with a chip inside it! For all the movie fans out there, welcome to The Matrix!

Leave a Reply