Distributed Training with PyTorch: complete tutorial with cloud infrastructure and code

Umar Jamil • December 19, 2023

View Channel

About

No channel description available.

Latest Posts

PT4M

Titans: Learning to Memorize at Test Time

Umar Jamil1 year ago

13559

PT4M

Paper: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Umar Jamil1 year ago

58775

PT4M

Flash Attention derived and coded from first principles with Triton (Python)

Umar Jamil1 year ago

70025

PT4M

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Umar Jamil1 year ago

113998

Video Description

A complete tutorial on how to train a model on multiple GPUs or multiple servers. I first describe the difference between Data Parallelism and Model Parallelism. Later, I explain the concept of gradient accumulation (including all the maths behind it). Then, we get to the practical tutorial: first we create a cluster on Paperspace with two servers (each having two GPUs) and then training a model in a distributed manner on the cluster. We will explore collective communication primitives: Broadcast, Reduce and All-Reduce and the algorithm behind them. I also provide a template on how to integrate DistributedDataParallel in your existing training loop. In the last part of the video we review advanced topics, like bucketing and computation-communication overlap during backpropagation. Code: https://github.com/hkproj/pytorch-transformer-distributed PDF slides: https://github.com/hkproj/pytorch-transformer-distributed/blob/main/notes/Slides.pdf Chapters 00:00:00 - Introduction 00:02:43 - What is distributed training? 00:04:44 - Data Parallelism vs Model Parallelism 00:06:25 - Gradient accumulation 00:19:38 - Distributed Data Parallel 00:26:24 - Collective Communication Primitives 00:28:39 - Broadcast operator 00:30:28 - Reduce operator 00:32:39 - All-Reduce 00:33:20 - Failover 00:36:14 - Creating the cluster (Paperspace) 00:49:00 - Distributed Training with TorchRun 00:54:57 - LOCAL RANK vs GLOBAL RANK 00:56:05 - Code walkthrough 01:06:47 - No_Sync context 01:08:48 - Computation-Communication overlap 01:10:50 - Bucketing 01:12:11 - Conclusion

Distributed Training with PyTorch: complete tutorial with cloud infrastructure and code

About

Latest Posts

Titans: Learning to Memorize at Test Time

Paper: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Flash Attention derived and coded from first principles with Triton (Python)

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Video Description

You May Also Like

ML Interpretability: feature visualization, adversarial example, interp. for language models

Kolmogorov-Arnold Networks: MLP vs KAN, Math, B-Splines, Universal Approximation Theorem

Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math

Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.

Mamba and S4 Explained: Architecture, Parallel Scan, Kernel Fusion, Recurrent, Convolution, Math

Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer