Proximal Policy Optimization (PPO) for LLMs Explained Intuitively

Julia Turc • March 7, 2025

View Channel

About

No channel description available.

Latest Posts

PT4M

Why are Transformers replacing CNNs?

Julia Turc4 months ago

153787

PT4M

Transformers & Diffusion LLMs: What's the connection?

Julia Turc5 months ago

34092

PT4M

Text diffusion: A new paradigm for LLMs

Julia Turc6 months ago

99564

PT4M

The physics behind diffusion models

Julia Turc7 months ago

93298

Video Description

In this video, I break down Proximal Policy Optimization (PPO) from first principles, without assuming prior knowledge of Reinforcement Learning. By the end, you’ll understand the core RL building blocks that led to PPO, including: 🔵 Policy Gradient 🔵 Actor-Critic Models 🔵 The Value Function 🔵 The Generalized Advantage Estimate In the LLM world, PPO was used to train reasoning models like OpenAI's o1/o3, and presumably Claude 3.7, Grok 3, etc. It’s the backbone of Reinforcement Learning with Human Feedback (RLHF) -- which helps align AI models with human preferences and Reinforcement Learning with Verifiable Rewards (RLVR), which gives LLMs reasoning abilities. Papers: - PPO paper: https://arxiv.org/pdf/1707.06347 - GAE paper: https://arxiv.org/pdf/1506.02438 - TRPO paper: https://arxiv.org/pdf/1502.05477 Well-written blogposts: - https://danieltakeshi.github.io/2017/04/02/notes-on-the-generalized-advantage-estimation-paper/ - https://huggingface.co/blog/NormalUhr/rlhf-pipeline - https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/ Implementations: - (Original) OpenAI Baseslines: https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2 - Hugging Face: https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py - Hugging Face docs: https://huggingface.co/docs/trl/main/en/ppo_trainer Mother of all RL books (Barto & Sutton): http://incompleteideas.net/book/RLbook2020.pdf 00:00 Intro 01:21 RL for LLMs 05:53 Policy Gradient 09:23 The Value Function 12:14 Generalized Advantage Estimate 17:17 End-to-end Training Algorithm 18:23 Importance Sampling 20:02 PPO Clipping 21:36 Outro Special thanks to Anish Tondwalkar for discussing some of these concepts with me. Note: At 21:10, A_t should have been inside the min. Thanks @t.w.7065 for catching this.

Proximal Policy Optimization (PPO) for LLMs Explained Intuitively

About

Latest Posts

Why are Transformers replacing CNNs?

Transformers & Diffusion LLMs: What's the connection?

Text diffusion: A new paradigm for LLMs

The physics behind diffusion models

Video Description

You May Also Like

Reverse-engineering GGUF | Post-Training Quantization

Training models with only 4 bits | Fully-Quantized Training

The myth of 1-bit LLMs | Quantization-Aware Training

Knowledge Distillation: How LLMs train each other

Llama 4 Explained: Architecture, Long Context, and Native Multimodality

DeepSeek's GRPO (Group Relative Policy Optimization) | Reinforcement Learning for LLMs