Overhauling vision support in llama.cpp and llama-server

Playback speed

Share post at current time

0:00

Transcript

Overhauling vision support in llama.cpp and llama-server

Xuan-Son Nguyen (Engineer @Hugging Face), AI Plumbers Conference: 2nd edition

Aug 13, 2025

On July 15, In Berlin we got together at AI Plumbers Conference second edition — an open source meetup for low-level AI builders to dive deep into the plumbing of modern AI, from cutting-edge data infrastructure to AI accelerators. Take a look at how it was!

Community choices, perfectionism overhaul and the locally run demos - Xuan-Son Nguyen demonstrating how vision support was added to llama.cpp for multimodel use cases, the obstacles and the clever hacks. Still work to do - go try it on Hugging Face Spaces (of course) or locally and contribute to llama.cpp!

Key moments from the talk:

0:55 — Demo running locally llama-server with Qwen 3bn omni model with image and audio input

3:21 — Introduction: who is Xuan-Son Nguyen

4:10 — A little bit about history - how multimodel works

6:10 — History - adding and removing multimodel (LLaVA) support in llama.cpp

9:21 — History - what caused the problems in the llava.cpp / clip.cpp implementation

10:45 — How to fix it?

12:08 — Enter libmtnd

13:12 — libmtnd architecture

16:50 — libmtnd: minimal, simple, well-documented API (adding audio support didn’t require API change!)

17:30 — LM Studio is one of the earliest adopter of libmtnd

18:10 — Demo mtdt-CLI