Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.

Umar Jamil • February 27, 2024

View Channel

About

No channel description available.

Latest Posts

PT4M

Titans: Learning to Memorize at Test Time

Umar Jamil1 year ago

13559

PT4M

Paper: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Umar Jamil1 year ago

58775

PT4M

Flash Attention derived and coded from first principles with Triton (Python)

Umar Jamil1 year ago

70025

PT4M

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Umar Jamil1 year ago

113998

Video Description

In this video, I will explain Reinforcement Learning from Human Feedback (RLHF) which is used to align, among others, models like ChatGPT. I will start by introducing how Language Models work and what we mean by AI alignment. In the second part of the video, I will derive from first principles the Policy Gradient Optimization algorithm, by explaining also the problems with the gradient calculation. I will describe the techniques used to reduce the variance of the estimator (by introducing the baseline) and how Off-Policy learning can make the training tractable. I will also describe how to build the reward model and explain the loss function of the reward model. To calculate the gradient of the policy, we need to calculate the log probabilities of the state-action pairs (the trajectories), the value function and the rewards, and the advantage terms (through Generalized Advantage Estimation): I will explain visually every step. After explaining Gradient Policy Optimization, I will introduce the Proximal Policy Optimization algorithm and its loss function, explaining all the details, including the loss of the value head and the entropy. In the last part of the video, I go through the implementation of RLHF/PPO, explaining line-by-line the entire process. For every mathematical formula, I will always given a visual intuition to help those who lack the mathematical background. PPO paper: Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O., 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. - https://arxiv.org/abs/1707.06347 InstructGPT paper: Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A. and Schulman, J., 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, pp.27730-27744. - https://arxiv.org/abs/2203.02155 Generalized Advantage Estimation paper: Schulman, J., Moritz, P., Levine, S., Jordan, M. and Abbeel, P., 2015. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438. - https://arxiv.org/abs/1506.02438 Slides PDF and commented code: https://github.com/hkproj/rlhf-ppo Chapters 00:00:00 - Introduction 00:03:52 - Intro to Language Models 00:05:53 - AI Alignment 00:06:48 - Intro to RL 00:09:44 - RL for Language Models 00:11:01 - Reward model 00:20:39 - Trajectories (RL) 00:29:33 - Trajectories (Language Models) 00:31:29 - Policy Gradient Optimization 00:41:36 - REINFORCE algorithm 00:44:08 - REINFORCE algorithm (Language Models) 00:45:15 - Calculating the log probabilities 00:49:15 - Calculating the rewards 00:50:42 - Problems with Gradient Policy Optimization: variance 00:56:00 - Rewards to go 00:59:19 - Baseline 01:02:49 - Value function estimation 01:04:30 - Advantage function 01:10:54 - Generalized Advantage Estimation 01:19:50 - Advantage function (Language Models) 01:21:59 - Problems with Gradient Policy Optimization: sampling 01:24:08 - Importance Sampling 01:27:56 - Off-Policy Learning 01:33:02 - Proximal Policy Optimization (loss) 01:40:59 - Reward hacking (KL divergence) 01:43:56 - Code walkthrough 02:13:26 - Conclusion

Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.

About

Latest Posts

Titans: Learning to Memorize at Test Time

Paper: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Flash Attention derived and coded from first principles with Triton (Python)

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Video Description

You May Also Like

ML Interpretability: feature visualization, adversarial example, interp. for language models

Kolmogorov-Arnold Networks: MLP vs KAN, Math, B-Splines, Universal Approximation Theorem

Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math