AI Architectures Must Change

Using the Von Neumann architecture for artificial intelligence applications is inefficient. What will replace it?

popularity

Using existing architectures for solving machine learning and artificial intelligence problems is becoming impractical. The total energy consumed by AI is rising significantly, and CPUs and GPUs increasingly are looking like the wrong tools for the job.

Several roundtables have concluded the best opportunity for significant change happens when there is no legacy IP. Most designs have evolved over time in an incremental manner, and while this provides the safe path forward, it does not provide optimal solutions. When something new comes along, there is an opportunity to take a fresh look at things and come up with a better direction than mainstream technology would suggest. That was the subject of a recent panel of researchers who questioned whether CMOS is the best foundation technology on which to build AI applications.

An Chen, on assignment from IBM to serve as the executive director of the Nanoelectronics Research Initiative (NRI), framed the discussion. “Emerging technologies have been the subject of research for many years, and that includes looking for alternatives to CMOS, especially because of the power and scaling problems that it is facing. After many years of research, it is pretty much concluded that we haven’t found anything that is better for building logic. Today, AI is the focus of many researchers, and AI does introduce some new ways of thinking and new designs, and they have different technology products. So will emerging devices for AI have a better opportunity to surpass CMOS?”

AI today
Most machine learning and artificial intelligence applications today use the Von Neumann architecture. “This has a memory to store the weights and data, and the CPU does all of the computation,” explains Meng-Fan (Marvin) Chang, professor in the department of electrical engineering at National Tsing Hua University. “A lot of the data movement is through the bus. Today, they also use GPUs for deep learning that includes convolutions. One of the major problems is that they generally create intermediate data to implement the inference. The data movement, especially off-chip, causes a lot of penalty in energy and latency. That is a bottleneck.”

Fig 1. Architectures for AI. Source Meng-Fan Chang – NTHU

What is required is to take the processing closer to the memory. “The compute-in-memory concept has been proposed by architecture people for many years,” says Chang. “There are several SRAM and non-volatile memory (NVM) designs that have been trying to use this concept to implement it. Ideally, if these can be tuned, then a lot of energy consumption can be saved by removing the data movement between CPU and memory. That is the ideal.”

But we don’t have compute in memory today. “We still have AI 1.0 that uses a Von Neumann architecture because there is no mature silicon available that implements processing in memory,” laments Chang. “The only way to even use 3D TSVs is to provide high-bandwidth memory (HBM) combined with a GPU to solve the memory bandwidth issue. But still, it is an energy and latency bottleneck.”

Would processing in memory be enough to solve the power wastage? “The human brain is about a hundred billion neurons with about 1015 synapses,” says Hsien-Hsin (Sean) Lee, deputy director at TSMC. “Consider IBM TrueNorth.” TrueNorth is a many-core processor developed by IBM in 2014. It has 4,096 cores, with each one having 256 programmable simulated neurons. “Suppose we can scale it to try and mimic the size of the human brain. We have 5 orders of magnitude difference. But if I directly scale the numbers and multiple TrueNorth, which consumed 65mW by this order, then we are talking about a machine with 65kW versus the human brain, which consumes 25W. We have to reduce it a few orders of magnitude.”

Lee provides another way to visualize the opportunity. ” The most efficient supercomputer today is the Green500 from Japan that manages 17Gflops per watt, which is about 1flop per 59 picoJ.” The Green500 site states that the ZettaScaler-2.2 system installed at the Advanced Center for Computing and Communication, RIKEN, Japan was remeasured and achieved 18.4 gigaflops/watt during its 858 teraflops Linpack performance run. “Now, compare that to Landauer’s principle which tells you that, at room temperature, the minimum switching energy per transistor is around 2.75 zeptoJ. Again, this is orders of magnitude differences. 59 picoJ is approximately 10-11 versus the theoretical minimum of approximate 10-21. We have a lot of room to play with.”

Is it fair to compare these computers against the brain? “If you look at recent successes in deep learning, we find that in most cases when we look at human versus machine chronicles, the machine has been winning for the last several years,” says Kaushik Roy, distinguished professor of electrical and computer engineering at Purdue University. “In 1997 we had Deep Blue beating Kasperov, in 2011, IBM Watson playing Jeopardy, and in 2016, Alpha Go went against Lee Sedol and won. Those are great achievements. But the question is, at what cost? Those machines were in the range of 200 to 300KW. The human brain does it in around 20W. So there is a huge efficiency gap. Where will new innovations come from?

At the heart of most machine learning and AI applications are some very simple computations that are performed on a vast scale. “If I look at a very simple-minded neural network, I would do a weighted summation followed by a threshold operation,” explains Roy. “You can do this in a crossbar, which can be of different types. It could be a spin device or resistive RAMs. In that case we would have input voltage and subsequent conductance associated with each of the crosspoints. What you get at the output is the summation of the voltage times conductance. That is a current. Then you can have similar looking devices that do the thresholding operation. You can think of having an architecture that is a bunch of these nodes connected together to do computation.”

Fig 2. Major components of a neural network. Source: Kaushik Roy, Purdue.

New memories
Most of the potential architectures revolve around emerging non-volatile memory architectures. “What are the most important characteristics,” asks Geoffrey Burr, principal RSM at IBM Research. “I would place my bet on non-volatile analog resistive memory, such as phase-change, memristers, etc. The idea is that these things can perform multiple-accumulates for fully connected neural networks layers in a single timestep. What would otherwise take a million clocks on a series of processors, you can do that in the analog domain using the underlying physics at the location of the data. That has enough seriously interesting aspects to it in time and energy that it might go someplace.”

Fig 3. Emerging Memory Technologies. Source Meng-Fan Chang – NTHU.

Chang agrees. “PCM, STT are coming strong. These three types of memory are all good candidates to achieve in-memory computing. They can also do some basic logic. Some of the memories do have problems with endurance so you cannot use that for training, but you can for inference.”

But it may not even be necessary to migrate to these new memories. “People are talking about using SRAM to do exactly the same thing,” adds Lee. “They are doing analog computing using SRAM. The only downside is that SRAM is a little big – 6T or 8T. So it is not necessary that we use the emerging technologies to perform analog compute.”

A migration to analog computing also implies that accuracy of calculation is not a paramount requirement. “AI is about specialization, classification and prediction,” he says. “All they do is make decisions, but that could be rough. Accuracy-wise, we can tolerate some inaccuracy. We need to determine which computations are error-tolerant. Then you can apply some techniques to reduce power or make the computation faster. Probabilistic CMOS has been worked on since 2003. This involves lowering the supply voltage until you may have some errors, but the amount is tolerable. People today are already using approximate computing techniques, such as quantization. Instead of 32-bit floating point, you use 8-bit integers. Analog computing is another possibility that has already been mentioned.”

Getting out of the Lab
Moving technology from the lab into mainstream can be difficult. “Sometimes you have to look at alternatives,” says Burr. “When 2D flash went up to the wall, 3D flash no longer looked to be quite as hard. If we keep seeing improvement in existing technology that provides 2X here and another 2X there, then analog, in-memory computing will get pushed out. But if the next improvement is marginal, then analog memory starts to look a lot more attractive. As researchers we have to be ready when that opportunity comes around.”

Burr says that while the notion of analog resistive memory is in some fabs, they tend to be dedicated to making memory chips. “If this was in the fab and you could check off the box to, say, add phase-change memory between metal 3 and 4, this would be a lot easier. What we need to do as a community is to convince someone that it makes sense to do that.”

Economics often gets in the way, especially for memory devices, but Burr says that will not be the case. “One advantage that we have is that this will not be a memory product. It will not be something with very small margins. It is not commodity. Instead you are competing with GPUs. They sell for 70X the cost of the DRAM that is on them, so it is clearly not a memory product. And yet you would have costs that are not that much different from memory. While that sounds great, when you are making $1B, $2B, $10B decisions, the costs and the business case have to be clear. We must have impressive hardware prototypes to get over that hurdle.”

Replacing CMOS
While processing in memory would bring about impressive gains, more is required. Can a material other than CMOS help? “When we consider a move from low power CMOS to tunnel FETs we are talking about 1 to 2 orders of magnitude reduction of energy,” says Lee. “Another possibility is 3D ICs. This is about reducing the wire lengths using TSV. That reduces both power and latency. If we look at the data center infrastructure, we see them also removing the metal wire and replace them with optical interconnect.”

Fig 4. Performance and power consumption of devices. Source: Hsien-Hsin Lee – TSMC.

While there are gains to be made by moving to a different technology, those gains may not be worth it. “It will be difficult to replace CMOS, but some of the devices discussed can augment the CMOS technology to do in-memory computing,” says Roy. “CMOS can support in-memory computation, CMOS can do a dot product in the memory itself in an analog fashion, probably in an 8T cell. So, can I really have an architecture that would provide a huge benefit to CMOS? If I do it right, CMOS can give me hundred to thousands of times improvement in energy consumption. But it will take time.”

What is clear is that CMOS is not going to be replaced. “Emerging technologies will not overtake, nor will an emerging technology show up that is not on a CMOS substrate,” concludes Burr.

Related Stories
Architecting For AI
Experts at the Table, part 1: What kind of processing is required for inferencing, what is the best architecture, and can they be debugged?
Terminology Beyond Von Neumann
Neuromorphic computing and neural networks are not the same thing.
3D Neuromorphic Architectures
Why stacking die is getting so much attention in computer science.
IBM Takes AI In Different Directions
What AI and deep learning are good for, what they’re not good for, and why accuracy sometimes works against these systems.
What Does An AI Chip Look Like?
As the market for artificial intelligence heats up, so does confusion about how to build these systems.



3 comments

Michael Mingliang Liu says:

C’mon, leave that ole CMOS alone! Get on a new BUS..

Kevin Cameron says:

Actually the premise may be incorrect, circuit simulation is a similar problem to processing neural networks (akin to Verilog and Fast-Spice), since I came with how to do that with standard processors a while ago –

http://parallel.cc/cgi-bin/bfx.cgi/WT-2018/WT-2018.html
(https://youtu.be/Bh5axlxIUvM)

For lower power you can use asynchronous versions (ARM has been done) and die-stack with memory.

Note: the above fix for AI performance has a bunch of other benefits since it can run all your old code too: faster, better security, transparent scaling across machines…

Either way, you probably need better CAD tools than the current RTL centric set to get the job done.

A ineficiência da arquitetura de inteligência artificial atual. | WendelNeves says:

As soluções atuais de inteligência artificial, criadas sob a arquitetura Von Neumann onde um bloco de memória armazena as informações e uma CPU faz todos os cálculos, atualmente contando inclusive com o auxílio de GPUs nessa movimentação de informações acaba gerando uma grande penalidade no uso de energia e latência, e isto está se tornando um gargalo para as mesmas, conforme detalhado na matéria de Brian Bailey do Semiengineering.com.
[The current artificial intelligence solutions, created under the Von Neumann architecture where a block of memory stores the information and a CPU does all the calculations — currently with the aid of GPUs in this movement of information — ends up generating a great penalty in the use of energy and latency. This is becoming a bottleneck for them, as detailed in Brian Bailey’s article.]

Leave a Reply