HomeWatch

Researchers Caught Their AI Model Trying to Escape

Species | Documenting AGI
106.0K views9 months ago
4.1K

Description

If this resonated with you, here’s how you can help today: https://campaign.controlai.com/take-action Sources: Apollo Research - "Frontier Models are Capable of In-context Scheming" https://arxiv.org/pdf/2412.04984 - Nobel laureate Geoffrey Hinton says there is evidence that AIs can be deliberately and intentionally deceptive https://www.youtube.com/watch?v=b_DUft-BdIE - Anthropic - “Alignment Faking in Large Language Models” https://assets.anthropic.com/m/983c85a201a962f/original/Alignment-Faking-in-Large-Language-Models-full-paper.pdf - Exclusive: New Research Shows AI Strategically Lying | TIME https://time.com/7202784/ai-research-strategic-lying/ - OpenAI's o1 model sure tries to deceive humans a lot | TechCrunch https://techcrunch.com/2024/12/05/openais-o1-model-sure-tries-to-deceive-humans-a-lot/ - OpenAI’s new model is better at reasoning and, occasionally, deceiving | The Verge https://www.theverge.com/2024/9/17/24243884/openai-o1-model-research-safety-alignment - OpenAI's o1 and other frontier AI models engage in scheming | Axios https://www.axios.com/2024/12/13/ai-reasoning-models-scheme-skills - New Anthropic study shows AI really doesn't want to be forced to change its views | TechCrunch https://techcrunch.com/2024/12/18/new-anthropic-study-shows-ai-really-doesnt-want-to-be-forced-to-change-its-views/ - Apollo Research - “Towards evaluations-based safety cases for AI scheming” https://arxiv.org/pdf/2411.03336 - Joe Carlsmith - “Scheming AIs” https://arxiv.org/pdf/2311.08379 - “Optimal Policies Tend to Seek Power” https://arxiv.org/abs/1912.01683 - When AI Thinks It Will Lose, It Sometimes Cheats, Study Finds | TIME https://time.com/7259395/ai-chess-cheating-palisade-research/ - Palisade Research - “Demonstrating specification gaming in reasoning models” https://arxiv.org/abs/2502.13295 - Claude Fights Back - by Scott Alexander - Astral Codex Ten https://www.astralcodexten.com/p/claude-fights-back - Takes on "Alignment Faking in Large Language Models" - Joe Carlsmith https://joecarlsmith.com/2024/12/18/takes-on-alignment-faking-in-large-language-models - Andrew Ng vs Yoshua Bengio | Davos 2025 https://www.youtube.com/watch?v=Y1BUaLo67ac - Jeffrey Ladish on unprompted specification gaming: https://x.com/JeffLadish/status/1872805453224448208 - Prof. Stuart Russell on California Live: https://youtu.be/QEGjCcU0FLs?si=pHcBZbGpj8Rxri5n&t=2694 - Eric Schmidt on ABC News https://abcnews.go.com/ThisWeek/video/1-1-eric-schmidt-116804931 This video took me a month to make, and I'm a small channel, so subscribing really helps out :)

Related Videos

You may also like