Distributed Inference 101: Disaggregated Serving with NVIDIA Dynamo

NVIDIA Developer March 18, 2025
Video Thumbnail
NVIDIA Developer Logo

NVIDIA Developer

View Channel

About

Welcome to the NVIDIA Developer YouTube Channel Subscribe to this channel for easy-to-follow “how-to” videos to learn about the latest technologies for developers from NVIDIA. Whether you’re a student, professional developer, or tech enthusiast, discover: 🧑‍💻 CUDA Programming: Parallel computing, debugging, and performance tips ✨ Agentic & Generative AI: Build intelligent agents and generative apps with AgentIQ, NeMo, and open-source tools 🤖 Robotics: Unlock smart automation and robotics solutions 📊 Data Science & Analytics: Accelerate data workflows with GPU-powered libraries like RAPIDS and popular tools 🛠️ And More: Deep learning, computer vision, simulation, high-performance computing, SDK tutorials, and expert guides Join a vibrant developer community, stay ahead with emerging tech, get real-world examples, and tips from NVIDIA engineers. Subscribe and start creating, optimizing, and deploying innovations with NVIDIA. 🙌

Video Description

Disaggregated serving enables developers to serve large language models (LLMs) with maximum throughput given their latency requirements by separating prefill and decode phases of the LLM and executing them independently on GPUs. In this video, we demonstrate: How to harness the power of disaggregated serving Introduce more advanced features offered by NVIDIA Dynamo such as auto-discovery and conditional disaggregation. Explore and Download → https://github.com/ai-dynamo/dynamo #Inference #datacenter #AI #disaggregatedserving

You May Also Like