Overview of NMAX Neural Inferencing

At HotChips 2018, Microsoft presented the attached slide in their Brainwave presentation: the ideal is to achieve high hardware utilization at low batch size. Existing architectures don’t do this: they have high utilization only at high batch sizes which means high latency. NMAX’ architecture loads weights quickly achieving almost the same high utilization at batch=1 as at large batch sizes... » read more