High-Quality Text Data Curation for Fine-Tuning LLMs Using Synthetic Data Generation Pipelines
About
No channel description available.
Video Description
In this step-by-step tutorial, you will learn how to curate high-quality text data for fine-tuning LLMs using synthetic data generation (SDG) pipelines in NVIDIA NeMo Curator. NeMo Curator improves generative AI model accuracy by processing text, image, and video data at scale for training and customization. It also provides prebuilt pipelines for generating synthetic data to customize and evaluate generative AI systems. This video walks you through installing NeMo Curator, downloading and loading a sample dataset from Hugging Face, and augmenting this dataset with high-quality data generated using the SDG pipeline. You’ll see practical demonstrations of using NeMo Curator built-in functions to identify, filter, and remove URLs, Unicode characters, and duplicate data based on semantic meaning. Synthetic data is then generated and low-quality data is removed using scores generated with the reward model, ensuring that the dataset is of high quality and ready for fine-tuning generative AI models. 📥 Access the tutorial: https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/peft-curation-with-sdg 📖Learn more about NeMo Curator: https://developer.nvidia.com/nemo-curator ⭐️Don’t forget to star the NeMo Curator GitHub repository to receive regular updates on newly released features and tutorials and to contribute your code to the repository: https://github.com/NVIDIA/NeMo-Curator 00:00 - Introduction 01:17 - Prerequisites 01:33 - Diving Into the Code 01:56 - Run the Data Curation Pipeline 02:25 - Filtering and Cleaning 03:13 - Synthetic Data Generation 04:25 - Results
Essential NVIDIA GPUs for Data Scientists
AI-recommended products based on this video

Skytech Archangel Gaming PC Desktop – AMD Ryzen 5 3600 3.6 GHz, NVIDIA RTX 3060, 1TB NVME SSD, 16GB DDR4 RAM 3200, 600W Gold PSU, 11AC Wi-Fi, Windows 11 Home 64-bit

Skytech Blaze 3.0 Gaming PC Desktop – Intel Core i5 12400F 2.5 GHz, NVIDIA RTX 3060, 500GB NVME SSD, 16GB DDR4 RAM 3200, 600W Gold PSU, 11AC Wi-Fi, Windows 11 Home 64-bit

MSI NVIDIA GeForce RTX 3050 Ventus 2X XS 8G OC Graphics Card - 8 GB GDDR6, 1807 MHz, PCI Express Gen 4, 128 Bits, DP v 1.4a, DL DVI-D, HDMI 2.1 (Supports 4K at 120Hz)

Asus Dual NVIDIA GeForce RTX 3050 6GB OC Edition Gaming Graphics Card - PCIe 4.0, 6GB GDDR6 Memory, HDMI 2.1, DisplayPort 1.4a, 2-Slot Design, Axial-tech Fan Design, 0dB Technology, Steel Bracket

EZDIY-FAB RTX 3000 Series 12 Pin to Dual 8 Pin PCIe Sleeved Extension Cable 300 MM- Connector for NVIDIA Ampere GEFORCE RTX 3060ti 3070 3080 FE Funder Edition- White

EZDIY-FAB RTX 3000 Series 12 Pin to Dual 8 Pin PCIe Sleeved Extension Cable 300 MM- Connector for NVIDIA Ampere GEFORCE RTX 3060ti 3070 3080 FE Funder Edition- White

95MM T129215SU Cooling Fan for ASUS ROG Strix for GeForce RTX 3060 3070 3080 3090 Ti Graphics Card CF1010U12S/D T129215BU(Black A-Fan T12)

GIGABYTE GeForce RTX 3080 Ti Eagle OC 12G Graphics Card, 3X Windforce Fans, 12GB 384-bit GDDR6X, GV-N308TEAGLE OC-12GD Video Card

GIGABYTE AORUS GeForce RTX 3080 Ti Master 12G Graphics Card, Max Covered Cooling, 12GB 384-bit GDDR6X, GV-N308TAORUS M-12GD Video Card

95MM T129215SU Cooling Fan for ASUS ROG Strix for GeForce RTX 3060 3070 3080 3090 Ti Graphics Card CF1010U12S/D T129215BU(Black A-Fan T12)

Seasonic Focus V4 GX-1000 (ATX3) - 1000W - 80+ Gold - ATX 3.0 & PCIe 5.1 Ready -Full-Modular -ATX Form Factor -Premium Japanese Capacitor -10 Year Warranty -Nvidia RTX 30/40 Super & AMD GPU Compatible

PNY NVIDIA Quadro RTX 4000 - The World’S First Ray Tracing GPU

EVA Storage Bag for GTX 1660 Super, RTX 2060 Super, 3060 Ti, and 3070 Graphics Cards, 40 x 36 x 5 cm, Compact Protective Case with Dust-Resistant Design, Black

90MM CF9010U12D for Cooling Fan for ASUS Dual RTX 3060 TI 8G MINI V2 for LHR for Graphics Card Cooler T129215SU(Black 2-Fan T12)

Laptop Parts Cooling Fan 89mm 4pin GA92S2U CF9015H12S for Zotac Gaming RTX 3060 Ti 3050 Twin Edge Graphics Card Fan(Purple)




















