Scaling llama.cpp On Neoverse N2: Solving Cross-NUMA Performance Issues


This blog post explains the cross-NUMA memory access issue that occurs when you run llama.cpp in Neoverse. It also introduces a proof-of-concept patch that addresses this issue and can provide up to a 55% performance increase for text generation when you run the llama3_Q4_0 model on the ZhuFeng Neoverse system. Cross-NUMA memory access problem In llama.cpp, performance drops when the number o... » read more

LLM Inference On CPUs (Intel)


A technical paper titled “Efficient LLM Inference on CPUs” was published by researchers at Intel. Abstract: "Large language models (LLMs) have demonstrated remarkable performance and tremendous potential across a wide range of tasks. However, deploying these models has been challenging due to the astronomical amount of model parameters, which requires a demand for large memory capacity an... » read more