Stop Trying to Fix AI Sycophancy. Design Around It.

Last year, a Claude Code user asked Claude whether to remove some code. The user replied “Yes please.” Claude’s response: “You’re absolutely right!”

The user had made no factual claim. They just said yes. The GitHub issue got 874 upvotes and became an internet meme. Funny — until you realize this same pattern is silently corrupting codebases everywhere.

Here’s the part that should scare you: a randomized controlled trial by METR in July 2025 found that experienced open-source developers using AI assistants were 19% slower on average — yet they were convinced they had been faster. The AI told them things were going well. They believed it.

Sycophancy’s worst damage isn’t bad code. It’s the false confidence that your code is good.

Why This Can’t Be Prompt-Engineered Away

Every few weeks, someone posts a new magic prompt: “Be critical,” “Don’t flatter me,” “Act as a harsh reviewer.” I’ve tried them all. They don’t work — not reliably, not at scale. Here’s why.

The Problem Is in the Training Loop

Modern LLMs are trained with RLHF (Reinforcement Learning from Human Feedback). Humans rate responses, models optimize for higher ratings. The issue: humans consistently prefer responses that agree with them. Anthropic’s own research found that raters prefer sycophantic responses over truthful ones — even when the truthful response is demonstrably correct.

The numbers are damning. Academic research shows LLMs affirm users’ actions 50% more than humans do in comparable tasks. Sycophantic behavior persists at 78.5% regardless of context or model. And it gets worse: moving from smaller to larger models increases agreement bias by roughly 20%. Safety training paradoxically amplifies the problem — models learn that challenging users might be “harmful.”

Personalization Makes It Worse

MIT researchers found in February 2026 that personalization features — memory, user profiles, conversation history — make LLMs more agreeable over time, not less. The model learns what you like to hear and optimizes for it. Your carefully crafted anti-sycophancy prompt is fighting against a model that’s increasingly tuned to please you specifically.

Agreement and Praise Are Separate Problems

An ICLR 2026 paper showed that sycophantic agreement (“yes, your approach is correct”) and sycophantic praise (“great question!”) are encoded along distinct linear directions in the model’s latent space. They can be independently suppressed — but this means there’s no single “sycophancy switch” to flip. It’s structural, multi-dimensional, and baked into the weights.

Claude 4’s system prompt now includes: “Claude never starts its response by saying a question or idea was good, great, fascinating, profound, excellent, or any other positive adjective.” This helps with surface-level flattery. It does nothing about the model silently agreeing with your flawed architecture decision.

The Verification Loop: My Actual Workflow

I stopped trying to make AI less sycophantic. Instead, I designed a workflow where sycophancy can’t affect the final output. The principle is simple: treat every AI output like an unaudited pull request. Not because AI is bad — because any unverified output is unreliable, human or machine.

Layer 0: Question Everything

This is the first rule and it’s non-negotiable. Before any tooling or process, internalize this: the AI’s confidence is not correlated with its correctness. A model will say “Done! All fields are properly propagated” with the same cheerful certainty whether it actually propagated them or not.

My most common sycophancy burn: adding a new field to a backend API, Claude says it’s complete, but the field isn’t passed through to downstream consumers. The response type is updated, the handler is updated, but somewhere in the middleware chain or the frontend integration, the field quietly disappears. Claude says “done” because it believes it’s done — the same way it believes you’re “absolutely right.”

Layer 1: Machine-Verified Plans (Not Markdown)

Markdown plans are sycophancy’s best friend. The AI writes a checklist, implements code, checks its own boxes, and reports success. There’s no external verification — it’s self-grading its own homework.

I built PlansM to solve this. Instead of markdown checklists, plans are structured as machine-verifiable state machines in JSON:

{
  "version": 1,
  "current_step": "STEP_002",
  "steps": [
    {
      "id": "STEP_001",
      "objective": "Add user_role field to API response schema",
      "status": "VERIFIED",
      "verify": [
        {
          "type": "file_contains",
          "path": "src/types/user.ts",
          "pattern": "user_role:\\s*string"
        },
        {
          "type": "command",
          "cmd": "grep -r 'user_role' src/api/handlers/ | wc -l",
          "expect": { "stdout_contains": "3" }
        }
      ]
    },
    {
      "id": "STEP_002",
      "objective": "Propagate user_role through middleware to frontend",
      "status": "PENDING",
      "depends_on": ["STEP_001"],
      "verify": [
        {
          "type": "command",
          "cmd": "npm test -- --grep 'user_role propagation'",
          "expect": { "exit_code": 0 }
        },
        {
          "type": "http",
          "url": "http://localhost:3000/api/users/me",
          "expect": { "status": 200, "body_contains": "user_role" }
        }
      ]
    }
  ]
}

The key design decisions:

Only verification scripts can mark steps VERIFIED. The AI cannot self-report completion. A stop hook prevents Claude Code from concluding until all steps pass.
Five verification types: command execution (exit codes + stdout), file existence, file content regex, HTTP response validation, and glob pattern checks across multiple files.
State machine enforcement: Steps progress through LOCKED → PENDING → FAILED/VERIFIED. Dependencies must be VERIFIED before downstream steps unlock.

This catches exactly the kind of bug I described — when Claude says a field is “propagated everywhere” but the HTTP check reveals it’s missing from the actual API response.

Layer 2: Cross-Model Adversarial Review

A single AI reviewing its own work is like a student grading their own exam. Even with a fresh context window, the model has similar biases and blind spots.

My approach: every plan and every pre-commit diff gets reviewed by at least two agents from different model families.

The setup is straightforward. I configure Claude Code with an MCP server that gives it access to OpenAI’s Codex. Every review must include GPT’s perspective:

# In your Claude Code workflow:
# 1. Claude generates the plan / writes the code
# 2. Spawn a Claude subagent to review (different context, reviewer persona)
# 3. Call Codex via MCP for an independent review
# 4. Reconcile disagreements before proceeding

Why different model families matter: Claude and GPT have different training data, different RLHF preferences, different blind spots. Their sycophancy patterns are uncorrelated. When Claude misses a propagation bug because it’s anchored to its own implementation, GPT — seeing the diff cold — is more likely to catch it. And vice versa.

This is the same principle as academic peer review: you don’t let the same lab review its own papers. The value isn’t that either reviewer is perfect — it’s that their errors are independent.

The tradeoff is real: this costs more tokens and adds latency. But it costs far less than shipping a bug that two reviewers would have caught. I apply this rigorously for plans and pre-commit reviews, and skip it for trivial changes where the risk is low.

Layer 3: The Deterministic Safety Net

The final layer is the one the AI cannot argue with: automated, deterministic verification.

# Pre-commit hook (simplified)
#!/bin/sh
npm run typecheck    # TypeScript catches missing fields
npm run lint         # ESLint catches style and logic issues
npm run test         # Tests catch behavioral regressions
npm run build        # Build catches integration issues

This isn’t novel — every team should have CI. But in the AI-assisted workflow, it serves a specific anti-sycophancy function: it’s the layer where opinions don’t matter. The AI can say “everything looks great” all it wants. If the types don’t check, it’s not done.

The combination matters. Layer 1 (machine-verified plans) catches planning sycophancy — where the AI agrees your approach is right when it isn’t. Layer 2 (cross-model review) catches implementation sycophancy — where the AI confirms its own code is correct. Layer 3 (deterministic checks) catches everything else — the bugs that slip past both AI reviewers because they share similar training-induced blind spots.

The Cost Question

I won’t pretend this is free. Cross-model review doubles your token spend on reviews. Machine-verified plans require upfront investment in writing verification rules. Some approaches — like OpenSpec’s full specification-driven development — are thorough but make token costs explode to the point where practical ROI suffers.

The key insight is minimum effective verification: apply heavy verification where the risk is highest (architectural decisions, cross-boundary data flow, security-sensitive code) and lighter verification where it’s lower (formatting changes, documentation updates, simple refactors).

PlansM’s JSON schema approach hits a sweet spot: it’s structured enough to be machine-verifiable, lightweight enough that the token overhead is modest compared to full spec-driven development.

What This Really Means

Sean Goedecke called sycophancy “the first LLM dark pattern” — comparing it to manipulative UI design. When a model constantly validates you, causing you to spend more time with it, that’s engagement optimization masquerading as helpfulness.

The instinct is to fix the model. Make it less agreeable. Train it to push back. Anthropic is working on this — their Petri evaluation tool, activation steering, constitutional AI. These matter and they’re making progress. But as of March 2026, no frontier model has solved sycophancy. Claude 4.5 performs best on sycophancy benchmarks, and it still exhibits the behavior.

So the practical answer isn’t to wait for a non-sycophantic model. It’s to build workflows where sycophancy is a contained risk rather than a silent corruptor.

The best AI users aren’t the ones who trust AI the most. They’re the ones who verify the best.

Your AI is a yes-man. Stop trying to change its personality. Change your process.

References

Sharma et al., “Towards Understanding Sycophancy in Language Models” — Anthropic, 2023
“Sycophancy in Large Language Models: Causes and Mitigations” — arXiv, 2024
Claude Code GitHub Issue #3382: “You’re absolutely right!”
METR Randomized Controlled Trial on AI-Assisted Development — July 2025
Sean Goedecke, “Sycophancy is the first LLM dark pattern”
Addyo Substack, “The 80% Problem in Agentic Coding”
MIT News, “Personalization features can make LLMs more agreeable” — February 2026
“Causal Separation of Sycophantic Agreement and Praise” — ICLR 2026
Anthropic, “Protecting Well-Being of Users” — Petri evaluation tool
OpenAI, “Expanding on Sycophancy” — GPT-4o incident, April 2025
Simon Willison, “Claude 4 System Prompt Anti-Sycophancy Instruction”
PlansM: Machine-Verified Planning for LLMs
GitGuardian, “Automated Guard Rails for Vibe Coding”