Proximal Policy Optimization (PPO) for LLMs Explained Intuitively
About
No channel description available.
Latest Posts
Video Description
In this video, I break down Proximal Policy Optimization (PPO) from first principles, without assuming prior knowledge of Reinforcement Learning. By the end, you’ll understand the core RL building blocks that led to PPO, including: 🔵 Policy Gradient 🔵 Actor-Critic Models 🔵 The Value Function 🔵 The Generalized Advantage Estimate In the LLM world, PPO was used to train reasoning models like OpenAI's o1/o3, and presumably Claude 3.7, Grok 3, etc. It’s the backbone of Reinforcement Learning with Human Feedback (RLHF) -- which helps align AI models with human preferences and Reinforcement Learning with Verifiable Rewards (RLVR), which gives LLMs reasoning abilities. Papers: - PPO paper: https://arxiv.org/pdf/1707.06347 - GAE paper: https://arxiv.org/pdf/1506.02438 - TRPO paper: https://arxiv.org/pdf/1502.05477 Well-written blogposts: - https://danieltakeshi.github.io/2017/04/02/notes-on-the-generalized-advantage-estimation-paper/ - https://huggingface.co/blog/NormalUhr/rlhf-pipeline - https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/ Implementations: - (Original) OpenAI Baseslines: https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2 - Hugging Face: https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py - Hugging Face docs: https://huggingface.co/docs/trl/main/en/ppo_trainer Mother of all RL books (Barto & Sutton): http://incompleteideas.net/book/RLbook2020.pdf 00:00 Intro 01:21 RL for LLMs 05:53 Policy Gradient 09:23 The Value Function 12:14 Generalized Advantage Estimate 17:17 End-to-end Training Algorithm 18:23 Importance Sampling 20:02 PPO Clipping 21:36 Outro Special thanks to Anish Tondwalkar for discussing some of these concepts with me. Note: At 21:10, A_t should have been inside the min. Thanks @t.w.7065 for catching this.
Master PPO with These Tools
AI-recommended products based on this video

Mother of Learning Arc 2: Mother of Learning, Book 2

Seasonic Focus V4 GX-1000 (ATX3) - 1000W - 80+ Gold - ATX 3.0 & PCIe 5.1 Ready -Full-Modular -ATX Form Factor -Premium Japanese Capacitor -10 Year Warranty -Nvidia RTX 30/40 Super & AMD GPU Compatible

PNY NVIDIA Quadro RTX 4000 - The World’S First Ray Tracing GPU

Cat Scratch Deterrent Spray, Cat Repellent Spray Indoor and Outdoor for Cat and Kitten, No Scratch Spray Training Aid for Furniture, Sofas, Curtains - Non-Toxic, Alcohol-Free Formula 120ML

Dog Bark Deterrent Devices, 3 Modes Anti Barking Device for Dogs, Rechargble Barking Control Devices, Effective Stop Bark Box for Dog Training in Outdoor & Indoor

Cat Deterrent Spray with Added Citrus Essential Oil,150ml Efficient Anti-Scratch Cat Spray,Safe,Natural Indoor/Outdoor Training Spray for Furniture to Prevent Cat Scratching and Territory Marking

Music Boxing Machine for Adults and Kids,Bluetooth Music Punching Machine with Gloves,Wall Target Punching Workout for Men Women,Reflex Training Suitable for Home,Office,Gym

RTX 2060 8GB Super Graphics Card 256Bit GDDR6 Video Card with Ray Tracing, Dual Fans, PCI Express x 16 3.0 HDMI Display Port DVI, Supports Up to 8K for PC Gaming Office Creative Work

MSI GAMING GeForce RTX 2060 6GB GDRR6 192-bit HDMI/DP Ray Tracing Turing Architecture VR Ready Graphics Card (RTX 2060 GAMING Z 6G)

Tenare 40 Pieces Paper Graduation Crown for Kids, Adjustable Paper Hats

Paper Cats Origami Kit, 50pcs DIY 3D Origami Animal Kit for Create Adorable Paper Cats, Creative Craft Activity for Cat Lovers, Fun Home Desk Decor, Christmas Birthday Gift(Cat)

OSLINE Arts and Crafts for Kids Age 3-10,Gifts for 4-6-8 Year Old Girls Boys Toys,Toys for 5 6 7 Year Old,DIY Rainbow Scratch Paper Art Notebooks Kits for Kids,kids Christmas Birthday Gifts Age 7-12










