DeepSeek's GRPO (Group Relative Policy Optimization) | Reinforcement Learning for LLMs

Julia Turc • March 19, 2025

View Channel

About

No channel description available.

Latest Posts

PT4M

Why are Transformers replacing CNNs?

Julia Turc4 months ago

153787

PT4M

Transformers & Diffusion LLMs: What's the connection?

Julia Turc5 months ago

34092

PT4M

Text diffusion: A new paradigm for LLMs

Julia Turc6 months ago

99564

PT4M

The physics behind diffusion models

Julia Turc7 months ago

93298

Video Description

In this video, I break down DeepSeek's Group Relative Policy Optimization (GRPO) from first principles, without assuming prior knowledge of Reinforcement Learning. By the end, you’ll understand the core RL building blocks that led to GRPO, including: 🔵 Policy Gradient Methods 🔵 The REINFORCE Algorithm 🔵 Actor-Critic Models 🔵 PPO (Proximal Policy Optimization) 🔵 GRPO (Group-Relative policy Optimization) Papers: GRPO paper (DeepSeekMath): https://arxiv.org/pdf/2402.03300 DeepSeek-R1 paper: https://arxiv.org/pdf/2501.12948 PPO paper: https://arxiv.org/pdf/1707.06347 GAE paper: https://arxiv.org/pdf/1506.02438 TRPO paper: https://arxiv.org/pdf/1502.05477 Mother of all RL books (Barto & Sutton): http://incompleteideas.net/book/RLboo... 00:00 Intro 00:53 Where GRPO fits within the LLM training pipeline 04:17 RL fundamentals for LLMs 08:25 Policy Gradient Methods & REINFORCE 11:58 Reward baselines & Actor-Critic Methods 14:10 GRPO 21:42 Wrap-up: PPO vs GRPO 22:32 Research papers are like Instagram

DeepSeek's GRPO (Group Relative Policy Optimization) | Reinforcement Learning for LLMs

About

Latest Posts

Why are Transformers replacing CNNs?

Transformers & Diffusion LLMs: What's the connection?

Text diffusion: A new paradigm for LLMs

The physics behind diffusion models

Video Description

You May Also Like

Reverse-engineering GGUF | Post-Training Quantization

Training models with only 4 bits | Fully-Quantized Training

The myth of 1-bit LLMs | Quantization-Aware Training

Knowledge Distillation: How LLMs train each other

Llama 4 Explained: Architecture, Long Context, and Native Multimodality