Author's Latest Posts


Shrinking LLMs With Self-Compression


Language models are becoming ever larger, making on-device inference slow and energy-intensive. A direct and surprisingly effective remedy is to prune complete channels whose contribution to the task is negligible. Our earlier work introduced a training-time procedure – Self-Compression [1, 4] – that lets back-propagation decide the bit-width of every channel, so unhelpful ones fade away. T... » read more