Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Umar Jamil • August 7, 2024

View Channel

About

No channel description available.

Latest Posts

PT4M

Titans: Learning to Memorize at Test Time

Umar Jamil1 year ago

13559

PT4M

Paper: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Umar Jamil1 year ago

58775

PT4M

Flash Attention derived and coded from first principles with Triton (Python)

Umar Jamil1 year ago

70025

PT4M

ML Interpretability: feature visualization, adversarial example, interp. for language models

Umar Jamil1 year ago

11544

Video Description

Full coding of a Multimodal (Vision) Language Model from scratch using only Python and PyTorch. We will be coding the PaliGemma Vision Language Model from scratch while explaining all the concepts behind it: - Transformer model (Embeddings, Positional Encoding, Multi-Head Attention, Feed Forward Layer, Logits, Softmax) - Vision Transformer model - Contrastive learning (CLIP, SigLip) - Numerical stability of the Softmax and the Cross Entropy Loss - Rotary Positional Embedding - Multi-Head Attention - Grouped Query Attention - Normalization layers (Batch, Layer and RMS) - KV-Cache (prefilling and token generation) - Attention masks (causal and non-causal) - Weight tying - Top-P Sampling and Temperature and much more! All the topics will be explained using materials developed by me. For the Multi-Head Attention I have also drawn all the tensor operations that we do with the code so that we can have a visual representation of what happens under the hood. Repository with code and notes: https://github.com/hkproj/pytorch-paligemma Prerequisites: 1) Transformer explained: https://www.youtube.com/watch?v=bCz4OMemCcA 🚀🚀 Join Writer 🚀🚀 Writer is the full-stack generative AI platform for enterprises. We make it easy for organizations to deploy AI apps and workflows that deliver impactful ROI. We train our own models and we are looking for amazing researchers to join us! Did I already say we have plenty of GPUs? https://writer.com/company/careers/ Chapters 00:00:00 - Introduction 00:05:52 - Contrastive Learning and CLIP 00:16:50 - Numerical stability of the Softmax 00:23:00 - SigLip 00:26:30 - Why a Contrastive Vision Encoder? 00:29:13 - Vision Transformer 00:35:38 - Coding SigLip 00:54:25 - Batch Normalization, Layer Normalization 01:05:28 - Coding SigLip (Encoder) 01:16:12 - Coding SigLip (FFN) 01:20:45 - Multi-Head Attention (Coding + Explanation) 02:15:40 - Coding SigLip 02:18:30 - PaliGemma Architecture review 02:21:19 - PaliGemma input processor 02:40:56 - Coding Gemma 02:43:44 - Weight tying 02:46:20 - Coding Gemma 03:08:54 - KV-Cache (Explanation) 03:33:35 - Coding Gemma 03:52:05 - Image features projection 03:53:17 - Coding Gemma 04:02:45 - RMS Normalization 04:09:50 - Gemma Decoder Layer 04:12:44 - Gemma FFN (MLP) 04:16:02 - Multi-Head Attention (Coding) 04:18:30 - Grouped Query Attention 04:38:35 - Multi-Head Attention (Coding) 04:43:26 - KV-Cache (Coding) 04:47:44 - Multi-Head Attention (Coding) 04:56:00 - Rotary Positional Embedding 05:23:40 - Inference code 05:32:50 - Top-P Sampling 05:40:40 - Inference code 05:43:40 - Conclusion

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

About

Latest Posts

Titans: Learning to Memorize at Test Time

Paper: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Flash Attention derived and coded from first principles with Triton (Python)

ML Interpretability: feature visualization, adversarial example, interp. for language models

Video Description

You May Also Like

ML Interpretability: feature visualization, adversarial example, interp. for language models

Kolmogorov-Arnold Networks: MLP vs KAN, Math, B-Splines, Universal Approximation Theorem

Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math

Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.

Mamba and S4 Explained: Architecture, Parallel Scan, Kernel Fusion, Recurrent, Convolution, Math

Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer

Distributed Training with PyTorch: complete tutorial with cloud infrastructure and code

Quantization explained with PyTorch - Post-Training Quantization, Quantization-Aware Training

Retrieval Augmented Generation (RAG) Explained: Embedding, Sentence BERT, Vector Database (HNSW)

BERT explained: Training, Inference, BERT vs GPT/LLamA, Fine tuning, [CLS] token

Coding Stable Diffusion from scratch in PyTorch

Coding LLaMA 2 from scratch in PyTorch - KV Cache, Grouped Query Attention, Rotary PE, RMSNorm

LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU

Segment Anything - Model explanation with code

LoRA: Low-Rank Adaptation of Large Language Models - Explained visually + PyTorch code from scratch

LongNet: Scaling Transformers to 1,000,000,000 tokens: Python Code + Explanation

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

About

Latest Posts

Titans: Learning to Memorize at Test Time

Paper: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Flash Attention derived and coded from first principles with Triton (Python)

ML Interpretability: feature visualization, adversarial example, interp. for language models

Video Description

You May Also Like

ML Interpretability: feature visualization, adversarial example, interp. for language models

Kolmogorov-Arnold Networks: MLP vs KAN, Math, B-Splines, Universal Approximation Theorem

Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math

Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.

Mamba and S4 Explained: Architecture, Parallel Scan, Kernel Fusion, Recurrent, Convolution, Math

Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer

Distributed Training with PyTorch: complete tutorial with cloud infrastructure and code

Quantization explained with PyTorch - Post-Training Quantization, Quantization-Aware Training

Retrieval Augmented Generation (RAG) Explained: Embedding, Sentence BERT, Vector Database (HNSW)

BERT explained: Training, Inference, BERT vs GPT/LLamA, Fine tuning, [CLS] token

Coding Stable Diffusion from scratch in PyTorch

Coding LLaMA 2 from scratch in PyTorch - KV Cache, Grouped Query Attention, Rotary PE, RMSNorm

LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU

Segment Anything - Model explanation with code

LoRA: Low-Rank Adaptation of Large Language Models - Explained visually + PyTorch code from scratch

LongNet: Scaling Transformers to 1,000,000,000 tokens: Python Code + Explanation

Master AI Development with These Must-Haves

acer Nitro 50 N50-620-UA91 Gaming Desktop | 11th Gen Intel Core i5-11400F 6-Core Processor | NVIDIA GeForce GTX 1650 | 8GB DDR4 | 512GB NVMe M.2 SSD | Intel Wi-Fi 6 AX201 | Keyboard and Mouse

BOSGAME P2 Plus Mini PC - Intel Core i7-12700H, 32GB DDR5, 512GB NVMe SSD | Thunderbolt 4, Triple 4K Display| Dual 2.5G LAN, WiFi 6, Bluetooth 5.2 | Win11 Pro Desktops for Work Gaming

Dell Inspiron 3030 Desktop - Intel Core i7-14700 Processor, 16GB DDR5 RAM, 1TB SSD, Intel UHD 770 Graphics, Windows 11 Pro, Onsite &amp; Migrate Service - Mist Blue

Loading...

Dell Inspiron 3030 Desktop - Intel Core i7-14700 Processor, 16GB DDR5 RAM, 1TB SSD, Intel UHD 770 Graphics, Windows 11 Pro, Onsite & Migrate Service - Mist Blue