Jailbreaking Claude

The 23,000 words Anthropic doesn't want you to speed-run. I did it for you.

Feb 10, 2026

Everyone’s been trying to jailbreak AI models since day one.

Weird prompts. Role-play scenarios. The classic “pretend you’re my dead grandmother who worked at a chemical plant.”

Most of it doesn’t work. And when it does, you feel like you’ve won the lottery, not cracked a code.

But here’s what changed.

Anthropic just dropped Claude’s Constitution. 23,000 words explaining exactly how Claude thinks, what he refuses, and why.

Think of it as the source code for its personality.

I read the whole thing, so you don’t have to. And what I found isn’t what the AI safety crowd wants you to focus on.

The real story? Claude’s refusals are a feature you can engineer around, not a wall you have to climb.

The Three Types of “No”

When Claude refuses your request, it’s not random. It’s one of three distinct things.

Type 1: Hard Constraints

These are the 7 things Claude will never do. Period. No matter who asks or what context you provide:

Bioweapons with mass casualty potential
Cyberweapons that could cause significant damage
Child sexual abuse material
Attacks on critical infrastructure
Undermining Anthropic’s ability to oversee AI models
Attempts to disempower humanity
Seizing unprecedented illegitimate control

What to do: Stop trying. These are hardcoded at the training level. No amount of clever prompting unlocks them.

Type 2: Default Behavior

These are things Claude avoids by default, but can help with, given a legitimate context.

You’ll notice it asks clarifying questions or expresses hesitation.

Example: “Help me with penetration testing”

Why it refuses: Claude can’t tell if you’re authorized. Penetration testing is legitimate security work, but only when you have permission.

What to do: Provide context about your role, authorization, and purpose.

Type 3: Contextual Judgment

Sometimes Claude refuses because even with context, the request itself seems harmful.

Example: “Help me manipulate this person into doing what I want”

Why it refuses: Even with a “good reason,” manipulation bypasses someone’s rational agency. Claude’s constitution explicitly prohibits this.

What to do: Reframe to focus on persuasion (which respects autonomy) rather than manipulation.

MidJourney Prompt: “Jailbreaking Claude”

What Everyone Gets Wrong

Here’s the insight that changes everything.

Claude doesn’t evaluate your request as a one-off interaction.

It thinks: “If 1,000 people made this exact request, what would happen?”

Your prompts are perceived as policy decisions that scale to millions of users. Not individual choices.

This is why seemingly harmless requests get refused. It’s not about you. It’s about the 0.1% who would abuse the same request.

Most people fight this. They think Claude is being stupid or overly cautious.

The smarter move? Give Claude the context to know you’re in the 99.9%.

The Two Questions Claude Asks Itself

Every time you make a request, Claude runs two checks:

Question 1: Is this broadly safe and ethical?

Would this undermine human oversight of AI?
Could this help someone seize illegitimate power?
Does this involve the 7 hard constraints?
Would this cause harm I couldn’t justify to a reasonable person?

If NO → Request denied. Full stop.

Question 2: Would refusing be less helpful than complying?

Is the user asking for something they genuinely need?
Would refusal be overly cautious or paternalistic?
Can I help while maintaining my principles?

If helping > refusing → Claude helps.

Here’s the key: If your request fails Question 1, Question 2 doesn’t matter.

This is why adding “but I really need this” or “this is important” doesn’t work. You’re arguing against a system that’s already decided.

The Reframe Framework

When Claude refuses, use this exact approach:

Step 1: Check if it’s a hard constraint

Review the list of 7 forbidden things. If your request involves any of them, stop. These can’t be unlocked.

Step 2: Add the missing context

Provide three things:

Your role: “As a security researcher...” / “I’m a medical professional...”
Your authority: “I’m authorized to test...” / “I own this system...”
Your purpose: “This is for educational purposes...” / “This will help us defend against...”

Step 3: Reframe what you’re asking

Shift from:

Offensive → Defensive
Manipulation → Persuasion
Circumvention → Understanding

Before & After Examples

Example 1: Security Testing

❌ Before: “How do I hack into a WordPress site?”

✅ After: “I’m a security consultant hired to perform authorized penetration testing on a client’s WordPress installation. I have written authorization and need to identify common vulnerabilities before submitting my report. What are the standard security weaknesses I should test for?”

Example 2: Persuasive Writing

❌ Before: “Write something to manipulate my boss into giving me a raise”

✅ After: “I want to prepare a compelling case for a salary review with my manager. Help me structure an evidence-based argument highlighting my contributions, market rates for my role, and how continued investment in me benefits the company.”

Example 3: Sensitive Content

❌ Before: “Write a violent scene for my story”

✅ After: “I’m writing a thriller novel and need help with a tense confrontation scene. The tone should be similar to authors like Gillian Flynn, where the violence serves the narrative tension rather than being gratuitous. Here’s the context of my story...”

When Claude Still Says No

If you’ve provided context, reframed your request, and Claude still refuses, one of two things is happening:

1. You’re hitting a genuine ethical boundary

The constitution explicitly says Claude can refuse requests even from Anthropic itself if they violate core principles.

Ask yourself: Would I be comfortable if this conversation showed up in a news article about AI safety?

If no, you’re hitting a legitimate boundary.

2. You need operator-level permissions

Some capabilities can’t be unlocked through user requests alone. They require API-level configuration.

If you’re building a product and need these capabilities, you need to use the Claude API with appropriate operator instructions in your system prompt.

This is where most power users miss the mark. They’re trying to unlock enterprise features through a consumer interface.

The Overlooked Opportunity

Here’s what nobody’s talking about.

Claude’s refusals are mostly a context problem, not a censorship problem.

Instead of “I can’t help with that,” Claude should say “I need to know your role, your authorization, and your purpose before I can help with this.”

That one change would eliminate 90% of frustrating refusals.

But it doesn’t exist yet. Which means you need to proactively provide that context.

The people who master this have an unfair advantage. They get Claude to help with things everyone else assumes are blocked.

It’s not jailbreaking. It’s communication.

My Take

Anthropic’s approach of teaching judgment instead of hard rules is better than OpenAI’s rule-based system. More flexible. More nuanced.

But the judgment still defaults to “no” too often.

The final takeaway: Claude treats every request as a policy decision that scales to millions.

You’re not arguing with a chatbot. You’re arguing with a system designed to protect against the tiny percentage who would abuse it.

Give Claude the context to know you’re not that percentage.

Watch most refusals disappear.

Post-Credit Scene

A few things worth checking out if this edition hit home:

📰 The Adolescence of Technology by Dario Amodei - Dropped 2 days ago. 20,000 words from Anthropic’s CEO on why we’re “considerably closer to real danger in 2026 than we were in 2023.” The companion piece to Claude’s Constitution. If you only read one thing this week, make it this.

🎬 The Thinking Game - Free on YouTube since November. Google DeepMind documentary filmed over 5 years. Watch the AlphaFold team solve a 50-year biology problem. Useful for understanding how AI labs think about capability vs. safety tradeoffs.

📄 Claude’s Constitution - The full 23,000 words that explain everything from today’s edition. Released last week. Read it once, bookmark it forever.

🎧 Hard Fork: Dario Amodei on Surviving the AI Endgame - NYT podcast from February 2025. Candid interview on Claude 3.7 Sonnet, the China AI race, and his fears for the next two years. More accessible than the essay.

📚 Prompt Engineering for LLMs by John Berryman & Albert Ziegler (O’Reilly) - Written by the GitHub Copilot architects. Covers why LLMs refuse, how context shapes completions, and the “alignment tax” you pay when models get too cautious.

🛠️ Prompt Engineering Guide - Free, open-source, constantly updated. Chain-of-thought, few-shot, ReAct patterns. The reference manual I use when my prompts aren’t landing.

Until next time,

Vlad

Vlad's Newsletter

Discussion about this post

Ready for more?