Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer
About
No channel description available.
Video Description
In this video I will be introducing all the innovations in the Mistral 7B and Mixtral 8x7B model: Sliding Window Attention, KV-Cache with Rolling Buffer, Pre-Fill and Chunking, Sparse Mixture of Experts (SMoE); I will also guide you in understanding the most difficult part of the code: Model Sharding and the use of xformers library to compute the attention for multiple prompts packed into a single sequence. In particular I will show the attention computed using BlockDiagonalCausalMask, BlockDiagonalMask and BlockDiagonalCausalWithOffsetPaddedKeysMask. I will also show you why the Sliding Window Attention allows a token to "attend" other tokens outside the attention window by linking it with the concept of Receptive Field, typical of Convolutional Neural Networks (CNNs). Of course I will prove it mathematically. When introducing Model Sharding, I will also talk about Pipeline Parallelism, because in the official mistral repository they refer to microbatching. I release a copy of the Mistral code commented and annotated by me (especially the most difficult parts): https://github.com/hkproj/mistral-src-commented Slides PDF and Python Notebooks: https://github.com/hkproj/mistral-llm-notes Prerequisite for watching this video: https://www.youtube.com/watch?v=bCz4OMemCcA Other material for better understanding Mistral: Grouped Query Attention, Rotary Positional Encodings, RMS Normalization: https://youtu.be/Mn_9W1nCFLo Gradient Accumulation: https://youtu.be/toUSzwR0EV8 Chapters 00:00:00 - Introduction 00:02:09 - Transformer vs Mistral 00:05:35 - Mistral 7B vs Mistral 8x7B 00:08:25 - Sliding Window Attention 00:33:44 - KV-Cache with Rolling Buffer Cache 00:49:27 - Pre-Fill and Chunking 00:57:00 - Sparse Mixture of Experts (SMoE) 01:04:22 - Model Sharding 01:06:14 - Pipeline Parallelism 01:11:11 - xformers (block attention) 01:24:07 - Conclusion
Essential AI Tools for Enthusiasts
AI-recommended products based on this video

NEXPOW Car Jump Starter,Car Battery Jump Starter Pack 1500A Peak Q10S for Up to 7.0L Gas and 5.5L Diesel Engine12V Auto Battery Booster,Jumper Cables,Portable Lithium Jump Box with LED Light/USB QC3.0

Firefly Variety 8 Pack - Fire Starter Accessory for Swiss Army Victorinox Knives (Neon Green-Yellow Glow)

9-in-1 5000A 150PSI Car Battery Booster Jump Starter with Air Compressor (All Gas/9L Diesel), Portable Car Battery Booster Pack, Safe Durable Car Jump Starter with Extended Jumper Cables, Glove, Light














![BERT explained: Training, Inference, BERT vs GPT/LLamA, Fine tuning, [CLS] token](https://imgz.pc97.com/?width=500&fit=cover&image=https://i.ytimg.com/vi/90mGPxR2GgY/hqdefault.jpg)




