Quantizing LLMs - How & Why (8-Bit, 4-Bit, GGUF & More)

Adam Lucek November 18, 2024
Video Thumbnail
Adam Lucek Logo

Adam Lucek

@adamlucek

About

teach them to long for the endless immensity of the sea For inquiries, refer to the email in my links section.

Video Description

Quantizing models for maximum efficiency gains! Resources: Model Quantized: https://huggingface.co/AdamLucek/Orpo-Llama-3.2-1B-15k Quantization Colab Notebook: https://colab.research.google.com/drive/1NlHlHU-fdubXcuZ08eb7zpaidF7388r6?usp=sharing HF 8-bit Blog: https://huggingface.co/blog/hf-bitsandbytes-integration HF 4-Bit Blog: https://huggingface.co/blog/4bit-transformers-bitsandbytes GGUF Overview: https://huggingface.co/docs/hub/gguf Llama.cpp: https://github.com/ggerganov/llama.cpp/tree/master GGUF Model Made in Video: https://huggingface.co/AdamLucek/Orpo-Llama-3.2-1B-15k-Q4_K_M-GGUF Maxime Labonne Quantization Blog: https://mlabonne.github.io/blog/posts/Introduction_to_Weight_Quantization.html Chapters: 00:00 - What Is Quantization? 02:19 - How Are Weights Stored? 03:22 - What is Binary? 06:26 - What are Floating Point Numbers? 10:38 - What Data Types are Used for LLMs? 12:02 - Does Quantization Negatively Affect LLMs? 15:08 - Code: Quantizing with BitsAndBytes 17:34 - Code: Comparing Quantized Layers 18:36 - Code: Comparing Text Generation 21:57 - Code: GGUF Quantization Overview 23:41 - Code: Quantizing with Llama.cpp 25:44 - Final Thoughts on Quantization #ai #coding #deeplearning