0:00
/
0:00

Adventures in Model Quantization and GPU performance

John Leimgruber, AI Plumbers: San Francisco Edition

On October 25th, in SF we got together to discuss “What’s missing in an open-source full-stack AI platform?”

​​The AI Plumbers Unconference: San Francisco Edition is an open-source meetup for builders of low-level AI systems to dive into the plumbing of modern AI, from modern data infrastructure to AI accelerators.

Watch #AIPlumbers presentation by John Leimgruber, Community LLM Quantizer also known as ubergarm is one of the most known and productive quantizers in AI world. I’m not joking about productivity - he has about 30TB of quants on Hugging Face, they even started to limmit his uploads (all fixed now, no worries, there will be more!). I wish there would be people with the job “quantizer” on Linkedin so at least we would know how many are there but also I’m not so sure all of them will even have Linkedin. Anyway, I don’t know to many, but I sure know the great ones!

Btw, if you haven’t seen it go watch Iwan Kawrakov talk from last #FOSDEM 2025!

But also do watch John’s talk to learn what does it take and how to start the journey into #quantization, the benchmarking of different quantizations, what metrics to use - what is speed? what is inside the quant? is fp8 a data type or a quantization type? So if you don’t have time to download the whole #ggml github discussions and PRs and grep to figure it all out - listen to somebody who did.

And in the true #AIplumbers tradition it doesn’t stop on SW optimizations - different #HW backends, memory usage optimizations, thermal considerations and more!

Key moments from the talk:

00:00 – 04:30 Personal Background and Journey

04:31 – 7:30 Quant Cooking Quick-Start

07:31 – 10:17 Benchmarking Quantization “Quality”

10:18 – 12:39 Quant comparison for “Quality”

12:40 – 16:15 Benchmarking Quantization “Speed”

16:16 – 18:20 LLM Tensors

18:21 – 20:38 llama-quantize — help

20:39 – 21:57 MXFP-4 Quantization with 4-bit blocks

21:58 – 27:27 GPU Tuning

John’s presentation is here!

Discussion about this video

User's avatar