Defending against AI jailbreaks

Anthropic • February 28, 2025

Anthropic

@anthropic-ai

About

We’re an AI safety and research company. Talk to our AI assistant Claude on claude.com. Download Claude on desktop, iOS, or Android. We believe AI will have a vast impact on the world. Anthropic is dedicated to building systems that people can rely on and generating research about the opportunities and risks of AI.

Latest Posts

PT4M

Why we built—and donated—the Model Context Protocol (MCP)

Anthropic3 weeks ago

12869

PT4M

Anthropic’s philosopher answers your questions

Anthropic1 month ago

69425

PT4M

AI Fluency for nonprofits course trailer

Anthropic1 month ago

4612

PT4M

Getting started with research in Claude.ai

Anthropic1 month ago

20499

Video Description

Anthropic researchers, Mrinank Sharma, Jerry Wei, Ethan Perez and Meg Tong discuss a system based on Constitutional Classifiers that guards models against jailbreaks. Read more: https://www.anthropic.com/news/constitutional-classifiers 0:00 Introduction 0:39 Defining jailbreaks and their importance 3:35 Universal jailbreaks 10:24 The Swiss cheese model for safety 11:25 Explaining Constitutional Classifiers 14:11 Ensuring model helpfulness 17:30 Understanding the constitution and synthetic data 19:00 Flexibility of the constitutional approach 24:15 Origins of the constitutional classifiers approach 32:24 Progress on robustness 38:47 The public demo: Purpose, setup 47:42 Understanding whether the approach is safe in practice 54:05 The public demo: Approaches people tried to bypass classifiers 56:14 Benefits of the classifier approach for Claude users 1:00:18 Memorable moments from the project 1:08:20 Differences in approach between this project and other research 1:11:11 The evolution of AI safety research

No Recommendations Found

No products were found for the selected channel.

Defending against AI jailbreaks

Anthropic

About

Latest Posts

Why we built—and donated—the Model Context Protocol (MCP)

Anthropic’s philosopher answers your questions

AI Fluency for nonprofits course trailer

Getting started with research in Claude.ai

Video Description

You May Also Like

No Recommendations Found

Loading...