AIコードレビューの価値と、retro-notesスキルにたどり着くまでの会話

概要

AIコードレビューの価値について自分自身に問いかけながら考えた対話のログ。/x-wt-teamsや/big-planを使った開発フローで、安いモデル（GitHub Copilot経由のGPT-4.1）にレビューさせる意味があるのか、というところから話を始めた。

途中で「実装者バイアス」「失敗コストでレビュアーを選ぶ」「2つの安いレビュアー ≠ 1つの強いレビュアー」「ループ品質ではなくループ回数を最適化する」みたいな話に広がっていって、最終的には新しい/retro-notesスキルを作って/big-planにlessons自動読込みのステップを差し込む、というところまで進んだ。

会話の流れをそのまま残しておきたかったのでそれ。英語のまま貼っておく。

bezier pen toolの開発画面。会話の発端になった、丸1日かかったタスクのやつ。

:::note 注記

この記事で紹介しているスキルは claude-resources リポジトリで公開している。リポジトリは頻繁に更新されるため、記事中のリンクは特定のコミットハッシュに固定してある。最新版とは異なる場合がある。

:::

会話ログ

Takazudo

Q: I’m using orchestration flow /x-wt-teams, and planning skill /big-plan. for this kind of dev and planning, I always add some reviews for agents’ works. but recently, I noticed that I can use github copilog with lower price model, GPT-4.1. It’s free! so I use this cheap free model for non-heavy devs. But I have a question. is this reviewing really worthy? From my view, it seems that GPT-4.1 is making some good feedback sometimes. but I use codex too. if I launch both reviews, GPT-4.1 says “OK! well designed”, but Codex raises problems. this sometimes happen. I’m thinking difficult dev is better to have Opus or rich models. For example, the recent troubled work was, implement bezier style drawing pen tool + zoom. It took almost 1 day with many trial-and-error. And finally, it resolved. I used slash-ultrareview feature of Claude Code. But I’m not sure slash-ultrareview had really meaning. So totally… how sholld we developer think “review?”, if we are better to use Opus, there might be possibility we don’t need review? Does low cost models’ review have any meaning?

Claude

My honest take, from how these models actually behave:

Cheap models (GPT-4.1) are a lint pass, not a review. They catch obvious stuff — naming, dead code, missing null checks, surface-level smells. When they say “well designed,” that mostly means “I didn’t find anything obvious.” It’s not a signal that the design is good. So treat their PASS as ~no information, but their FAIL as worth reading.

Stronger models (Codex/Opus) actually reason about the change. They catch race conditions, wrong abstractions, edge cases the implementer didn’t think of. When GPT-4.1 says OK and Codex raises issues, Codex is usually right (not always — strong models also over-engineer and invent problems). That asymmetry is exactly what you observed.

For hard work like bezier + zoom, post-hoc review is the wrong leverage point. A 1-day trial-and-error session means the approach was uncertain, not that the final code has bugs. /ultrareview after it works catches polish issues, but it can’t tell you “your coordinate transform is conceptually wrong” once you’ve already wrestled it into working. The leverage for that kind of task is upstream: design discussion, a small prototype, or a 2nd-opinion on the plan (codex-2nd, gco-2nd) before committing a day to it.

So how I’d think about it:

Situation	Worth reviewing with
Mechanical change, small diff	GPT-4.1 only (or skip)
Normal feature, medium risk	GPT-4.1 + Codex; trust Codex on disagreement
Hard/novel problem	Review the plan with strong model before coding; light review after
Already debugged for a day	Review has low marginal value — you already found the bugs the hard way

Opus-quality implementation reduces but doesn’t eliminate the need for review — even Opus has blind spots about its own code (confirmation bias on “I just wrote this, it must be right”). A second pass with fresh context is the real value, regardless of model. The model choice just determines what depth of issue gets surfaced.

TL;DR: cheap review = ~free noise filter, keep it. Strong review = real signal, use selectively. For hard problems, move the review earlier (plan stage), not later.

Takazudo

Well, so you mean, the review itself has meaning because the agent has “bias” after the dev work, clean context eye will have some meaning even if GPT-4.1, old model, right?

Claude

Yes, exactly that.

The implementer has been swimming in the problem for hours — they’re attached to the approach they made work. A reviewer with clean context doesn’t carry that attachment, so they read the code as “what does this actually say” instead of “what I meant it to say.” That’s the structural value, and it’s mostly independent of model strength.

Model strength changes what depth of issue gets caught, not whether fresh-eye review has meaning:

GPT-4.1 fresh eyes → catches “this variable name is misleading,” “this branch is unreachable,” “you forgot to handle empty array” — things the tired implementer literally stopped seeing.
Codex/Opus fresh eyes → catches the above plus “this abstraction is wrong,” “this will deadlock under concurrent X.”

So even an “old, cheap” reviewer earns its keep, because a lot of post-implementation bugs are not deep — they’re things the author’s brain auto-completed past. GPT-4.1 doesn’t have that auto-complete bias for your code.

The one caveat: a clean-context PASS from a weak model is still weak evidence. Fresh eyes help you find things; they don’t certify absence. So “GPT-4.1 said OK” ≠ “the code is good,” but “GPT-4.1 found these 3 things” is real value at near-zero cost.

Takazudo

Even if the rich model like Opus, is it difficult to notice that any time? For example, if the model gets better and better, some day we will not need to have reviews?

Claude

Short answer: no, review won’t go away — but the kind of review will shift.

Here’s why fresh-eye review is structural, not a model-quality problem:

1. The bias is from being in the loop, not from being dumb. Even a perfect implementer who just spent 200 turns making bezier+zoom work has a mental model that says “this works, I tested it.” A reviewer — same model, fresh context — doesn’t have that prior. They read what’s actually on screen. Better models reduce implementation bugs, but they don’t remove the asymmetry between “I built this” and “I’m seeing this for the first time.”

2. Review catches a different class of issue than implementation. Implementation answers “does this work for the cases I thought of?” Review answers “what didn’t you think of?” That second question is genuinely harder to ask yourself in the same session — your context is already shaped by the cases you considered. A stronger model just means each pass thinks of more cases, but the gap between “cases I considered while building” and “cases that exist” stays.

3. The reviewer’s value scales with how much the implementer had to commit to. The harder the problem, the more the implementer locked into one approach early. Bezier+zoom is a great example — by hour 6, you’re not going to rewrite the coordinate system, you’re going to patch it. A fresh reviewer can say “this whole transform should be inverted at the input boundary, not threaded through every function.” Opus building it has the same blind spot you did.

4. Even humans with infinite skill still do code review. Senior engineers review each other not because they can’t write good code, but because (a) fresh eyes catch what tired eyes miss, and (b) a second perspective is cheap insurance against expensive mistakes. There’s no model strength where “two perspectives” stops being more than one.

What will change as models get stronger:

Today	Likely future
Review catches bugs	Review mostly catches design/architecture drift
Multiple model reviews to triangulate	One strong reviewer is enough
Review after implementation	Review shifts upstream (plan review > code review)
Human spot-checks AI review	AI review is trusted, human spot-checks the spec

So the leverage moves earlier and the reviewer needs less guidance, but the act of “someone/something with clean context reads this” stays valuable as long as implementation involves commitment to a path. That’s a property of how building works, not of model IQ.

The day review becomes optional is the day implementation stops requiring choices — which is roughly never, because every problem worth solving has multiple valid approaches and the implementer picks one.

Takazudo

So as the effective reviewing with AIs, we need to choose model for each topic, right? light weight devs, fixes -> GPT-4.1, medium -> Sonnet, clearly difficult -> multiple Opus… like?

Claude

Yes, that’s the right shape — match reviewer strength to failure cost, not to dev difficulty. Slight refinement:

Change type	Reviewer	Why
Typo, rename, copy tweak, dependency bump	Skip review	Cost of bug < cost of review time
Small fix, isolated function, mechanical refactor	GPT-4.1 (gcoc-review)	Free lint pass, catches silly mistakes
Normal feature, medium blast radius	Codex or Sonnet (single pass)	Real reasoning, one perspective is enough
Hard / novel / touches core / hard to revert	Codex + Opus (two perspectives)	Triangulate; disagreement itself is signal
Security, migrations, money, auth, data loss risk	Multiple strong + human	Failure cost dominates everything

A few things worth separating from “model choice”:

1. Difficulty ≠ failure cost. Bezier+zoom was hard to build but if it breaks, you fix it tomorrow. A 3-line auth check is trivial to build but if it’s wrong, users get into other accounts. Pick reviewer by what happens when the bug ships, not by how long it took to write.

2. Move strong reviewers upstream for hard work. For the bezier-style task, an Opus review of the plan before you start (codex-2nd / gco-2nd) is worth more than two Opus reviews of the finished code. Once you’ve spent a day, no reviewer can tell you “wrong approach” usefully — you won’t redo it.

3. Two cheap reviewers ≠ one strong reviewer. Stacking GPT-4.1 + GPT-4.1 doesn’t triangulate, it just doubles the same blind spots. If you want a second perspective, the second one needs to actually be different (different model family, or different prompt focus like “security only” / “performance only”).

4. Disagreement is the signal, not the verdict. When GPT-4.1 says OK and Codex flags issues, the value isn’t “Codex wins” — it’s “there’s something here worth your 2 minutes of attention.” Read the flagged issue and decide; don’t auto-accept either side.

5. Set a default, escalate on smell. Make Codex (or Sonnet) your default reviewer. Escalate to Opus / multi-reviewer only when: (a) you notice you’re uncertain, (b) the diff touches something load-bearing, or (c) the first reviewer flagged something you can’t immediately resolve. Otherwise you’ll over-spend Opus on routine work and under-spend it on the rare thing that matters.

So your tiering is right. The thing I’d add: tier by “what breaks if this is wrong,” not by “how hard was this to write,” and put your strongest model at the plan stage for genuinely hard problems rather than piling it on at the end.

Takazudo

Interesting, I feel it sounds like who I assing on my company, senior dev or someone. About the bezier problem. look /big-plan, I use this almost every dev (not very tiny ones). I talked my plan, made research, and did review with rich models, but we went wrong way on that time. For those cases, how shold we think about it? I’m thinking that this depneds on if I have such experience or not, and if I can give that context or not. For example, I have well organized CSS architecture context as /css-wisdom, Tauri related info as /tauri-wisdom. Is this a right way? It’s the giving context as skill, way. I feel that all what I can do is just feeding you text. and skill is the must-read info for you. so I’m relying on skills for almost anything.

Claude

Your intuition is correct, and you’re already doing the right thing. Let me sharpen the picture:

Why plan + rich review still went wrong on bezier

Plan review with Opus only catches “is this plan internally consistent and reasonable-sounding.” It cannot catch “is this the right approach for this specific domain” if neither you nor the model has built it before. Bezier + zoom + interactive editing has a bunch of non-obvious traps (coordinate space ordering, control point hit testing under transforms, anti-aliasing at zoom boundaries) that you only discover by writing the code and watching it misbehave. No reviewer — human or AI — surfaces those from a plan, because the plan doesn’t yet contain the wrong abstraction. The wrong abstraction is invented during implementation.

So the failure wasn’t “review missed it.” The failure was “this category of problem is unreviewable from a plan.” Different problem.

Skills as context = right direction, with one nuance

Your mental model — “all I can do is feed text, and skills are must-read text” — is correct. Skills are crystallized experience. /css-wisdom and /tauri-wisdom work because:

The domain has established patterns someone already learned the hard way
Those patterns are transferable — they apply to most CSS/Tauri tasks
You (or someone) already paid the cost of discovery

The nuance: skills only help on the second encounter, not the first. /css-wisdom didn’t exist before you (or the community) had hit those CSS pitfalls enough to extract them. For the bezier work, no /bezier-wisdom existed because you hadn’t done it before. That’s why it took a day. That day was the cost of building the wisdom, not a failure.

Two categories of “hard” — they need different treatment

Category	Example	What helps
Hard because of accumulated domain knowledge	”How do I structure this CSS layout?”	Skill / wisdom file. AI + skill = senior dev with the team’s playbook
Hard because the constraints emerge from building	Bezier pen + zoom, novel interaction, perf tuning a specific path	No skill helps on first attempt. Plan to prototype-and-throw-away, not plan-and-implement

For the second category, the leverage isn’t more review or better plan — it’s shorter feedback loops: build the smallest thing that exhibits the problem, see what breaks, then plan the real thing. The bezier work probably would have gone faster as “spend 2 hours on a throwaway prototype to find the coordinate-space gotchas, then plan, then build” rather than plan → build → debug for a day.

The meta-loop you’re already running

You’re doing the right thing structurally:

Hit a hard problem → struggle → solve it
Extract lessons → put them in a skill
Next time, AI + skill = the problem is now category 1 instead of category 2

The bezier project’s real output is two things: the working tool, and a /bezier-wisdom (or /canvas-zoom-wisdom) skill that didn’t exist before. If you didn’t write that skill, you’ll pay the day again next time. Writing the skill is how you convert one-time pain into permanent leverage.

So the answer to your question: yes, skills are the right primary lever, because they’re how you turn “context only I have” into “context the AI has too.” The limit isn’t the approach — it’s that skills can only encode what someone already learned. For genuinely first-of-its-kind work, the move is to prototype fast (cheap discovery) and then write the skill after, so the next person (including future-you) starts in category 1.

Concrete suggestion for your workflow: after any dev that took meaningfully longer than expected, ask yourself “is there a skill-shaped lesson here?” If yes, spend 15 minutes writing it. That’s the highest-leverage 15 minutes in your whole workflow — it compounds across every future project.

Takazudo

Interesting, so even if the rich model, you recognize, it’s always uncertain what happens in the dev, so trial -> learn,find problems -> rethink -> next trial — this is the way to go for non-experienced territory, right?

Claude

Exactly right. And the important reframe: that loop isn’t a sign something went wrong — it’s the actual work, for any problem in unexplored territory.

A few sharpenings worth keeping in mind:

The loop is fundamental, not a fallback. “Plan → execute → done” is what work looks like after the territory is mapped. The first time through, the only honest shape is trial → discover → rethink → retry. Even an infinitely smart model can’t skip this, because the information needed to plan correctly literally doesn’t exist until you’ve poked the problem. This is true for senior engineers, research scientists, and would be true for an AGI. It’s a property of the problem, not the solver.

Make each loop cheap, not rare. The trap is trying to make the first attempt right (long planning, big reviews, careful implementation). That just makes each loop expensive, so you do fewer of them, so you learn slower. Better: make the first attempt fast and disposable. Bezier in a 50-line prototype, no zoom, no UI polish — just “does my coordinate transform survive the basic case.” Throw it away. Loop again with what you learned.

The slogan: optimize loop count, not loop quality. Five rough loops beats one careful loop on novel problems.

Know which loop you’re on.

Loop 1-2: you’re discovering constraints. Don’t write production code. Don’t review. Don’t plan thoroughly. Just probe.
Loop 3-4: you’ve seen the real shape. Now plan properly, now write real code, now review.
Loop 5+: if you’re still here, the problem is bigger than you thought — stop and reconsider scope, don’t just keep grinding.

The bezier work probably blurred these — production-quality code and review starting at loop 1, when you were still discovering what the problem even was. That’s the most common way hard projects burn time.

Capture the map after, not during. While looping, don’t try to write the wisdom skill yet — you don’t know what’s signal vs. noise. After it works, look back: “what did I waste time on that I’d skip next time?” That’s the skill. Writing wisdom mid-struggle usually crystallizes the wrong lessons (the ones from your most recent frustration, not the deepest ones).

Models help most on loops 3+, not loop 1. Loop 1 is “what does this domain even feel like” — a model can’t shortcut that for you, because you need to feel it. Loops 3+ are “given what I now know, what’s the right structure” — that’s where Opus, plan review, code review all earn their cost. Stacking heavy review on loop 1 is like asking a senior engineer to code-review your scratch paper.

So yes: trial → learn → rethink → retry is the way for unexplored territory. The skill is recognizing you’re in unexplored territory early enough to set the loop up cheaply, instead of treating loop 1 as if it were loop 4. Most expensive dev sessions are loop-1 work disguised as loop-4 work.

Takazudo

Awesome. so how about this idea? we have /big-plan for plannning, it makes plan with epic style. and after that dev, we might reach to “fail”. For that, case we’ll launch /fail-notes. on this skill, we check what we did, and what happened, how original idea went wrong. And we’ll save it as /l-fail-notes-{topic-name} as project scope skill. Then we’ll do next /big-plan. but then we have fail notes. This sounds that we can go this trial loop way. What do you think?

Claude

This is a strong idea — you’re basically building AI-readable postmortems, which is exactly the missing piece in the loop we discussed. Worth doing. A few refinements that will make it pay off more:

1. Reframe “fail” → “what this attempt taught us”

Most dev isn’t binary success/failure. The bezier project “succeeded” but burned a day on a wrong approach first. Pure-failure framing will make you skip writing notes for the most valuable category: things that eventually worked but went sideways first. Call it /retro-notes or /lessons-capture. Trigger condition: “did this take meaningfully longer than expected, or did the approach change mid-way?” — not “did it fail.”

2. Capture after resolution, not mid-struggle

If you write notes while still bleeding, you’ll record symptoms (“zoom was buggy”) instead of root causes (“we threaded the transform through every function instead of inverting at the input boundary”). The valuable lesson is only visible after the fix, looking backward. So gate /retro-notes behind “the thing now works.”

3. Structure the note for the next planner, not for memory

The reader is a future /big-plan run, not future-you nostalgically reviewing. That changes the shape. Useful template:

# Topic: bezier pen + zoom

## What we set out to do
[1-2 sentences]

## Approach we tried first
[the wrong path — be specific about the abstraction, not the symptom]

## Why it went wrong
[the actual root cause — coordinate space inverted in wrong layer, etc.]

## What worked instead
[the right abstraction]

## Watch for next time
- [concrete trap, e.g. "if you see X, you're in the wrong coordinate space"]
- [concrete trap]

## Would-skip-if-redoing
[things that wasted time and are now provably unnecessary]

The “Watch for next time” section is the highest-leverage part — it’s what /big-plan will actually use. Two bullet points there beat two pages of narrative.

4. Granularity: one growing file per problem-area, not one per attempt

If every attempt creates a new /l-fail-notes-bezier-2026-04, /l-fail-notes-bezier-zoom, /l-fail-notes-pen-tool, you’ll fragment and the next /big-plan won’t know which to read. Better: one /l-lessons-canvas-tools (or /l-lessons-{project-area}) that you append to and periodically refactor. Skills are most useful when they’re a single coherent document, not a pile of dated logs.

5. Make /big-plan actually read them

The loop only closes if /big-plan auto-discovers relevant lessons. Two options:

/big-plan’s prompt explicitly checks for /l-lessons-* skills matching the topic and reads them before planning
Or: at the start of /big-plan, you manually mention “check /l-lessons-canvas-tools” — fine, but easy to forget

The first is more reliable. Worth a small tweak to /big-plan to do “check for matching project-scope lesson skills” as step zero.

6. Pruning matters more than you think

After 6 months you’ll have 15 lesson files, half of them about deprecated approaches. Skill content rots. Plan for a /lessons-refactor pass every few months: merge overlapping notes, delete advice for code that no longer exists, promote repeated lessons into more general wisdom (e.g. several canvas-related notes might converge into a /l-coordinate-spaces-wisdom that applies broadly).

7. The hidden benefit: it forces you to articulate the lesson

Even if the AI never reads the note, the act of writing “here’s what I’d do differently” cements the lesson for you. Half the value is the writing, not the storage. That’s why the structured template matters — freeform “ugh that was hard” notes don’t extract the lesson; structured prompts do.

Putting it all together, your loop becomes:

/big-plan (reads existing /l-lessons-* for the area)
  ↓
attempt N
  ↓
resolved? → /retro-notes (appends to /l-lessons-{area})
  ↓
next /big-plan starts smarter

That’s a genuine learning system, not just a planning system. Each project ends with the codebase + the team’s wisdom both upgraded. Over a year, the wisdom delta probably matters more than any single project’s code delta.

One last thing: the loop also tells you when to stop building skills. If /retro-notes runs and you write “nothing surprising, plan matched reality,” that’s signal that the area is now well-mapped — future /big-plan runs in this area can be lighter, faster, less reviewed. The retros aren’t just for capturing failure; they’re for measuring how mapped the territory is.

Takazudo

Ah your words are great. I said as you say, exactly “zoom was buggy”. Now I recognize what was wrong for me. I recognized it as just a “fail”. so what I did there was just random-result-loop. /skill-creator /skill-tweaker all right, try to tweak those existing skills, and make /retro-notes

Claude

Great realization — that’s the exact reframe. Let me kick off both: /skill-creator to draft /retro-notes, and we can sequence /skill-tweaker after to wire /big-plan to read the lesson files.

(/skill-creator runs, scaffolds the new skill, writes SKILL.md. Then /skill-tweaker runs against /big-plan to insert a lessons-reading step.)

Both skills are in place. Quick recap:

/retro-notes (new global skill at $HOME/.claude/skills/retro-notes/SKILL.md)

Captures lessons from a just-resolved dev attempt into .claude/skills/l-lessons-{area}/SKILL.md
Pre-flight gate: only runs after the work is wrapped up — won’t crystallize symptoms while you’re still struggling
Forces the symptom → root-cause distinction (the “zoom was buggy” → “transform threaded through every function” example is baked in)
Appends to one growing file per problem-area, not one file per attempt

/big-plan (tweaked, existing global skill)

New Step 1c: Read project lessons added between issue-fetching and codebase exploration
Auto-discovers any matching l-lessons-* skills, reads them, and shapes sub-tasks accordingly
“Watch for next time” / “Would-skip-if-redoing” sections drive plan adjustments
Skips silently when no lessons exist — never blocks planning
Description and Key Principles updated to surface the integration

The loop now closes: hard dev → /retro-notes → lessons file → next /big-plan reads it automatically → next attempt starts smarter. No manual remembering required.

One thing to note for the bezier work: since you mentioned it just resolved, that’s a perfect first candidate. Inside the bezier project repo, run /retro-notes and it will scaffold .claude/skills/l-lessons-canvas-tools/ (or whatever area name fits) with the structured template — you fill in the root-cause section while it’s fresh.

余談

このやり取りを通じて見えたのは、レビューの価値が「モデルの賢さ」ではなく「コンテキストの新鮮さ」に由来するという点。実装者は問題に何時間も浸かっているので、書いたコードに対してI meant it to sayの目で読んでしまう。そこに対してクリーンコンテキストの目（たとえGPT-4.1でも）を入れることで、author’s brain auto-completed pastな部分が見えてくる、という構造の話。

そしてもう一つ。難しい問題に対してレビューを厚くするのではなく、上流（plan段階）に強いモデルを置く方が効くという話。さらに本当に新しい領域は「plan → 実装 → review」ではなく「prototype → 学ぶ → 捨てる → 本番planへ」というループにすべきで、そのループで得たものをretro-notesで捕まえて次の/big-planが自動で読む、というところまで地続きで話が繋がった。

bezierの件は自分の中では「失敗」と片付けていたが、random-result-loopそのものだったというのが今回の気づき。