0:00
/
0:00
Transcript

NVIDIA DYNAMO: Serving LLMs at AI-Factory Scale

Anish Maddipoti, Product Manager and Rohan Varma, AI Developer, NVIDIA, AI Plumbers: San Francisco Edition

On October 25th, in SF we got together to discuss “What’s missing in an open-source full-stack AI platform?”

​​The AI Plumbers Unconference: San Francisco Edition is an open-source meetup for builders of low-level AI systems to dive into the plumbing of modern AI, from modern data infrastructure to AI accelerators.

Watch #AIPlumbers presentation by NVIDIA team on Dynamo, a deep dive into production environment for inference at scale, i.e. both compute and memory demands exploding exponentially.

Disaggregated serving, intelligent scheduling, multi-tier memory management, KV-routing, and high-availability mechanics — all designed to push inference efficiency to the maximum.

This #AIPlumbers talk showcased production-grade engineering: from offline performance configurators that find optimal cluster layouts, to dynamic K8s scheduling that understands physical GPU topology, coordinated multi-GPU serving, etc. Lot’s of clever tricks on handling compute-bound vs memory-bound workloads, I’ve heard people discussing before, but now not in theory! And it’s all #opensource.

And also really hope to hear more at #FOSDEM26 from the Dynamo team - don’t miss it!

Key moments from the talk:

00:00 – 01:02 — Dynamo: Inference at Scale

01:03 – 02:49 — Inference Compute Requirements Scaling Exponentially

02:50 – 05:59 — Dynamo: A Systematic Approach to AI Inference at Scale

06:00 – 08:54 — Memory Management

08:55 – 12:19 — KV Router

12:20 – 15:00 — Production-Grade Serving with Dynamo

15:01 – 16:33 — Offline Perf Configurator

16:34 – 18:39 — Offline Perf Optimizer

18:40 – 26:00 — Topology-Optimized Dynamic K8s Scheduling

26:01 – 29:22 — Fault Tolerance

29:23 – 32:32 — How Dynamo Works

Discussion about this video

User's avatar

Ready for more?