Defending against AI jailbreaks
Anthropic
@anthropic-aiAbout
We’re an AI safety and research company. Talk to our AI assistant Claude on claude.com. Download Claude on desktop, iOS, or Android. We believe AI will have a vast impact on the world. Anthropic is dedicated to building systems that people can rely on and generating research about the opportunities and risks of AI.
Latest Posts
Video Description
Anthropic researchers, Mrinank Sharma, Jerry Wei, Ethan Perez and Meg Tong discuss a system based on Constitutional Classifiers that guards models against jailbreaks. Read more: https://www.anthropic.com/news/constitutional-classifiers 0:00 Introduction 0:39 Defining jailbreaks and their importance 3:35 Universal jailbreaks 10:24 The Swiss cheese model for safety 11:25 Explaining Constitutional Classifiers 14:11 Ensuring model helpfulness 17:30 Understanding the constitution and synthetic data 19:00 Flexibility of the constitutional approach 24:15 Origins of the constitutional classifiers approach 32:24 Progress on robustness 38:47 The public demo: Purpose, setup 47:42 Understanding whether the approach is safe in practice 54:05 The public demo: Approaches people tried to bypass classifiers 56:14 Benefits of the classifier approach for Claude users 1:00:18 Memorable moments from the project 1:08:20 Differences in approach between this project and other research 1:11:11 The evolution of AI safety research
No Recommendations Found
No products were found for the selected channel.



