AIコードレビューの価値と、復習スキルにたどり着くまでの会話

概要

AIにレビューさせるのって意味あんの？って聞いてたらなんか途中から結構いい話になったのでログ。

例によってハイコンテキストなんですが、自分は大抵のことを/big-planでissueを一気に立てさせ、その後、/x-wt-teamsで一連の実装をオーケストレーションさせている。それで基本は4つぐらい並行して何かをさせ、自分は眺めてるだけなのだけれども、そのフローの中でレビューをチョイチョイはさんでいる。

それでちょい前に書いた記事にて、よく見るとCopilotでGPT-4.1だと無料なの？ってことに気づき、それ以降使ってみているのだが、どうやらこれは本当に無料らしく、何でもかんでもバカスカとりあえずcopilot GPT-4.1にレビューさせている。

感覚としては、本当にどうでも良いものはトークン節約ってことでGPT-4.1にレビューさせており、そうで無いものはCodexとかOpusでレビューさせるようにしていた。これはなんとなく直感的にそうしていたのだが、そもそも安いモデルにレビューさせることって意味あんのか？っていう疑問がわいてきたので色々聞いてた。

bezier pen toolの開発画面

注記

この記事で紹介しているスキルは claude-resources リポジトリで公開している。リポジトリは頻繁に更新されるため、記事中のリンクは特定のコミットハッシュに固定してある。最新版とは異なる場合がある。

会話ログ

Takazudo

Q: I’m using orchestration flow /x-wt-teams, and planning skill /big-plan. for this kind of dev and planning, I always add some reviews for agents’ works. but recently, I noticed that I can use github copilog with lower price model, GPT-4.1. It’s free! so I use this cheap free model for non-heavy devs. But I have a question. is this reviewing really worthy? From my view, it seems that GPT-4.1 is making some good feedback sometimes. but I use codex too. if I launch both reviews, GPT-4.1 says “OK! well designed”, but Codex raises problems. this sometimes happen. I’m thinking difficult dev is better to have Opus or rich models. For example, the recent troubled work was, implement bezier style drawing pen tool + zoom. It took almost 1 day with many trial-and-error. And finally, it resolved. I used slash-ultrareview feature of Claude Code. But I’m not sure slash-ultrareview had really meaning. So totally… how sholld we developer think “review?”, if we are better to use Opus, there might be possibility we don’t need review? Does low cost models’ review have any meaning?

質問。オーケストレーションフローの/x-wt-teamsと、プランニングスキルの/big-planを使っている。この手の開発とプランニングでは、エージェントの作業に対して常にレビューを入れている。最近、GitHub CopilotでGPT-4.1という低価格モデルが使えることに気付いた。無料。なので、重くない開発にはこの安いモデルを使っている。でも疑問がある。このレビューって本当に価値があるのか？自分の見立てだと、GPT-4.1は時々いい感じのフィードバックをくれる。ただ、Codexも併用していて、両方走らせるとGPT-4.1が「OK! well designed」と言う一方でCodexが問題を指摘する、ということがたまに起きる。難しい開発にはOpusとかリッチなモデルの方が良さそう、とは考えている。例えば最近難航したやつだと、ベジェ曲線スタイルのpen tool + zoomの実装。1日近くかかって、何度もtrial-and-errorしてやっと解決した。Claude Codeの/ultrareview機能も使った。でも/ultrareviewが本当に意味あったのかはよくわからない。なので総合的に……開発者として「レビュー」をどう考えるべきか？Opusを使う方が良いなら、そもそもレビューが要らない可能性もある？低コストモデルのレビューって意味があるのか？

Claude Code

My honest take, from how these models actually behave:

Cheap models (GPT-4.1) are a lint pass, not a review. They catch obvious stuff — naming, dead code, missing null checks, surface-level smells. When they say “well designed,” that mostly means “I didn’t find anything obvious.” It’s not a signal that the design is good. So treat their PASS as ~no information, but their FAIL as worth reading.

Stronger models (Codex/Opus) actually reason about the change. They catch race conditions, wrong abstractions, edge cases the implementer didn’t think of. When GPT-4.1 says OK and Codex raises issues, Codex is usually right (not always — strong models also over-engineer and invent problems). That asymmetry is exactly what you observed.

For hard work like bezier + zoom, post-hoc review is the wrong leverage point. A 1-day trial-and-error session means the approach was uncertain, not that the final code has bugs. /ultrareview after it works catches polish issues, but it can’t tell you “your coordinate transform is conceptually wrong” once you’ve already wrestled it into working. The leverage for that kind of task is upstream: design discussion, a small prototype, or a 2nd-opinion on the plan (codex-2nd, gco-2nd) before committing a day to it.

So how I’d think about it:

Situation	Worth reviewing with
Mechanical change, small diff	GPT-4.1 only (or skip)
Normal feature, medium risk	GPT-4.1 + Codex; trust Codex on disagreement
Hard/novel problem	Review the plan with strong model before coding; light review after
Already debugged for a day	Review has low marginal value — you already found the bugs the hard way

Opus-quality implementation reduces but doesn’t eliminate the need for review — even Opus has blind spots about its own code (confirmation bias on “I just wrote this, it must be right”). A second pass with fresh context is the real value, regardless of model. The model choice just determines what depth of issue gets surfaced.

TL;DR: cheap review = ~free noise filter, keep it. Strong review = real signal, use selectively. For hard problems, move the review earlier (plan stage), not later.

これらのモデルの実際の振る舞いから言うと、率直な見解はこう。

安いモデル（GPT-4.1）はlint passであって、レビューではない。 名前付け、デッドコード、null check漏れ、表面的なsmellみたいな分かりやすいやつは拾う。「well designed」と言うときは、だいたい「分かりやすい問題は見つからなかった」という意味でしかない。設計が良いというシグナルではない。なので、PASSはほぼ情報なし、FAILは読む価値ありと扱うのが良い。

強いモデル（Codex/Opus）は変更について実際に推論する。 race condition、間違ったabstraction、実装者が考えていなかったedge caseを拾う。GPT-4.1がOKと言い、Codexが問題を上げているとき、たいていCodexの方が正しい（常にではない——強いモデルもover-engineeringして問題を発明する）。観察された非対称性はまさにそれ。

bezier + zoomみたいな難しい作業の場合、post-hocレビューはレバレッジを効かせる場所として間違っている。 1日のtrial-and-errorセッションは、最終的なコードにバグがあるという意味ではなく、アプローチが不確実だったということ。動いたあとに/ultrareviewをかけても磨き込みの問題は拾えるが、「coordinate transformが概念的に間違っている」みたいなことは、すでに苦労して動かしたあとでは伝えられない。この手のタスクのレバレッジは上流にある。設計の議論、小さなprototype、または1日コミットする前のプランに対するセカンドオピニオン（codex-2nd、gco-2nd）。

整理するとこんな感じ:

状況	レビューする価値があるモデル
機械的な変更、小さなdiff	GPT-4.1のみ（またはskip）
普通のfeature、中程度のrisk	GPT-4.1 + Codex；意見が割れたらCodexを信頼
難しい/新規の問題	コーディング前に強いモデルでプランをレビュー、コード後は軽くレビュー
すでに1日デバッグ済み	レビューの限界価値は低い——苦労してすでにバグを見つけている

Opusクオリティの実装はレビューの必要性を減らすが、なくしはしない——Opusですら自分のコードに対するblind spotがある（「自分が今書いたんだから合っているはず」というconfirmation bias）。fresh contextで二度目を通すのが本当の価値で、モデルに依らない。モデル選びは、どの深さの問題が浮き上がるかを決めるだけ。

TL;DR: 安いレビュー = ほぼ無料のノイズフィルター、残しておく。強いレビュー = 本物のシグナル、選択的に使う。難しい問題には、レビューを早い段階（plan段階）に動かす。

Takazudo

Well, so you mean, the review itself has meaning because the agent has “bias” after the dev work, clean context eye will have some meaning even if GPT-4.1, old model, right?

なるほど、つまり、開発作業の後にエージェントは「バイアス」を持っているから、レビュー自体に意味があると。クリーンコンテキストの目線なら、GPT-4.1という古いモデルでも何かしら意味があるということ？

Claude Code

Yes, exactly that.

The implementer has been swimming in the problem for hours — they’re attached to the approach they made work. A reviewer with clean context doesn’t carry that attachment, so they read the code as “what does this actually say” instead of “what I meant it to say.” That’s the structural value, and it’s mostly independent of model strength.

Model strength changes what depth of issue gets caught, not whether fresh-eye review has meaning:

GPT-4.1 fresh eyes → catches “this variable name is misleading,” “this branch is unreachable,” “you forgot to handle empty array” — things the tired implementer literally stopped seeing.
Codex/Opus fresh eyes → catches the above plus “this abstraction is wrong,” “this will deadlock under concurrent X.”

So even an “old, cheap” reviewer earns its keep, because a lot of post-implementation bugs are not deep — they’re things the author’s brain auto-completed past. GPT-4.1 doesn’t have that auto-complete bias for your code.

The one caveat: a clean-context PASS from a weak model is still weak evidence. Fresh eyes help you find things; they don’t certify absence. So “GPT-4.1 said OK” ≠ “the code is good,” but “GPT-4.1 found these 3 things” is real value at near-zero cost.

そう、まさにそれ。

実装者は何時間もその問題に浸かっている——動かしたアプローチに執着している。fresh contextのレビュアーはその執着を持っていないので、「自分が書きたかったこと」ではなく「コードに実際に書いてあること」として読む。これが構造的な価値で、モデルの強さにはほぼ依らない。

モデルの強さが変えるのは、どの深さの問題が拾われるかであって、fresh-eyeレビューに意味があるかどうかではない。

GPT-4.1のfresh eyes → 「この変数名は誤解を招く」「このbranchはunreachable」「empty arrayを忘れている」みたいなやつを拾う——疲れた実装者が文字通り見えなくなっていたもの。
Codex/Opusのfresh eyes → 上記プラス「このabstractionが間違っている」「これは並行Xでdeadlockする」みたいなやつ。

つまり「古くて安い」レビュアーでも元は取れる。post-implementationのバグの多くは深いものではなく、書いた人の脳がauto-completeで流し読みしていた箇所だから。GPT-4.1はあなたのコードに対してauto-completeのバイアスを持っていない。

ただ1つ注意点。弱いモデルからのclean-context PASSは、依然として弱いエビデンス。fresh eyesは見つけるのを助けてくれるが、ないことを保証はしない。なので「GPT-4.1がOKと言った」≠「コードが良い」だが、「GPT-4.1がこの3つを見つけた」はほぼゼロコストで本物の価値。

Takazudo

Even if the rich model like Opus, is it difficult to notice that any time? For example, if the model gets better and better, some day we will not need to have reviews?

Opusみたいなリッチなモデルでも、いつでも気付くのは難しい？例えば、モデルがどんどん良くなっていけば、いつかレビューが要らなくなる？

Claude Code

Short answer: no, review won’t go away — but the kind of review will shift.

Here’s why fresh-eye review is structural, not a model-quality problem:

1. The bias is from being in the loop, not from being dumb. Even a perfect implementer who just spent 200 turns making bezier+zoom work has a mental model that says “this works, I tested it.” A reviewer — same model, fresh context — doesn’t have that prior. They read what’s actually on screen. Better models reduce implementation bugs, but they don’t remove the asymmetry between “I built this” and “I’m seeing this for the first time.”

2. Review catches a different class of issue than implementation. Implementation answers “does this work for the cases I thought of?” Review answers “what didn’t you think of?” That second question is genuinely harder to ask yourself in the same session — your context is already shaped by the cases you considered. A stronger model just means each pass thinks of more cases, but the gap between “cases I considered while building” and “cases that exist” stays.

3. The reviewer’s value scales with how much the implementer had to commit to. The harder the problem, the more the implementer locked into one approach early. Bezier+zoom is a great example — by hour 6, you’re not going to rewrite the coordinate system, you’re going to patch it. A fresh reviewer can say “this whole transform should be inverted at the input boundary, not threaded through every function.” Opus building it has the same blind spot you did.

4. Even humans with infinite skill still do code review. Senior engineers review each other not because they can’t write good code, but because (a) fresh eyes catch what tired eyes miss, and (b) a second perspective is cheap insurance against expensive mistakes. There’s no model strength where “two perspectives” stops being more than one.

What will change as models get stronger:

Today	Likely future
Review catches bugs	Review mostly catches design/architecture drift
Multiple model reviews to triangulate	One strong reviewer is enough
Review after implementation	Review shifts upstream (plan review > code review)
Human spot-checks AI review	AI review is trusted, human spot-checks the spec

So the leverage moves earlier and the reviewer needs less guidance, but the act of “someone/something with clean context reads this” stays valuable as long as implementation involves commitment to a path. That’s a property of how building works, not of model IQ.

The day review becomes optional is the day implementation stops requiring choices — which is roughly never, because every problem worth solving has multiple valid approaches and the implementer picks one.

短く答えると、レビューはなくならない——ただし、レビューの種類はシフトしていく。

fresh-eyeレビューがモデル品質の問題ではなく構造的な理由はこう。

1. バイアスはloopの中にいることから来るのであって、賢くないことから来るのではない。 bezier+zoomを動かすのに200ターン使った完璧な実装者でも、「これは動く、自分でテストした」というメンタルモデルを持つ。レビュアー——同じモデル、fresh context——はそのpriorを持たない。画面に実際にあるものを読む。良いモデルは実装のバグを減らすが、「自分が作った」と「初めて見る」の非対称性はなくならない。

2. レビューは実装とは違うクラスの問題を拾う。 実装は「自分が考えたケースで動くか？」に答える。レビューは「考えなかったのは何か？」に答える。後者は同じセッション内で自分に問うのが本当に難しい——context自体がすでに考慮したケースで形作られている。強いモデルは1パスごとに考えるケースが増えるというだけで、「作りながら考慮したケース」と「実在するケース」のギャップは残る。

3. レビュアーの価値は、実装者がどれだけコミットせざるを得なかったかに比例する。 問題が難しいほど、実装者は早い段階で1つのアプローチにロックインしてしまう。bezier+zoomはまさに例で、6時間目になればcoordinate systemを書き直すことはせず、パッチを当てることになる。fresh reviewerは「このtransform全体は入力境界でinvertすべきで、全ての関数に通すべきではない」と言える。Opusが組んでもblind spotは同じ。

4. 無限のスキルを持つ人間ですらコードレビューをする。 シニアエンジニアが互いにレビューし合うのは、良いコードが書けないからではなく、(a) 疲れた目では見落とすものをfresh eyesが拾うから、(b) 2つ目の視点は高くつくミスへの安い保険だから。「2つの視点」が「1つ」より多くなくなるようなモデルの強さは存在しない。

モデルが強くなるにつれて変わるもの:

今	おそらく将来
レビューはバグを拾う	レビューはほぼdesign/architecture driftを拾う
複数モデルでtriangulateするレビュー	強いレビュアー1つで十分
実装後のレビュー	レビューが上流にシフト（plan review > code review）
人間がAIレビューをspot-check	AIレビューが信頼され、人間はspecをspot-check

なのでレバレッジは早期に動き、レビュアーへのガイドが少なくて済むようになる。ただ、実装が1つの道へのコミットメントを伴う限り「clean contextの誰か/何かがこれを読む」という行為の価値は残る。これはbuildingの仕組みの性質であって、モデルのIQの性質ではない。

レビューがオプショナルになる日は、実装が選択を要求しなくなる日——それはほぼ「ない」。解く価値のある問題は複数の妥当なアプローチを持っていて、実装者はそのうち1つを選ぶから。

Takazudo

So as the effective reviewing with AIs, we need to choose model for each topic, right? light weight devs, fixes -> GPT-4.1, medium -> Sonnet, clearly difficult -> multiple Opus… like?

つまりAIで効果的にレビューするには、トピックごとにモデルを選ぶ必要があると。light weightな開発、fix → GPT-4.1、medium → Sonnet、明らかに難しい → 複数Opus……みたいな？

Claude Code

Yes, that’s the right shape — match reviewer strength to failure cost, not to dev difficulty. Slight refinement:

Change type	Reviewer	Why
Typo, rename, copy tweak, dependency bump	Skip review	Cost of bug < cost of review time
Small fix, isolated function, mechanical refactor	GPT-4.1 (gcoc-review)	Free lint pass, catches silly mistakes
Normal feature, medium blast radius	Codex or Sonnet (single pass)	Real reasoning, one perspective is enough
Hard / novel / touches core / hard to revert	Codex + Opus (two perspectives)	Triangulate; disagreement itself is signal
Security, migrations, money, auth, data loss risk	Multiple strong + human	Failure cost dominates everything

A few things worth separating from “model choice”:

1. Difficulty ≠ failure cost. Bezier+zoom was hard to build but if it breaks, you fix it tomorrow. A 3-line auth check is trivial to build but if it’s wrong, users get into other accounts. Pick reviewer by what happens when the bug ships, not by how long it took to write.

2. Move strong reviewers upstream for hard work. For the bezier-style task, an Opus review of the plan before you start (codex-2nd / gco-2nd) is worth more than two Opus reviews of the finished code. Once you’ve spent a day, no reviewer can tell you “wrong approach” usefully — you won’t redo it.

3. Two cheap reviewers ≠ one strong reviewer. Stacking GPT-4.1 + GPT-4.1 doesn’t triangulate, it just doubles the same blind spots. If you want a second perspective, the second one needs to actually be different (different model family, or different prompt focus like “security only” / “performance only”).

4. Disagreement is the signal, not the verdict. When GPT-4.1 says OK and Codex flags issues, the value isn’t “Codex wins” — it’s “there’s something here worth your 2 minutes of attention.” Read the flagged issue and decide; don’t auto-accept either side.

5. Set a default, escalate on smell. Make Codex (or Sonnet) your default reviewer. Escalate to Opus / multi-reviewer only when: (a) you notice you’re uncertain, (b) the diff touches something load-bearing, or (c) the first reviewer flagged something you can’t immediately resolve. Otherwise you’ll over-spend Opus on routine work and under-spend it on the rare thing that matters.

So your tiering is right. The thing I’d add: tier by “what breaks if this is wrong,” not by “how hard was this to write,” and put your strongest model at the plan stage for genuinely hard problems rather than piling it on at the end.

そう、その形が正しい——レビュアーの強さは開発の難しさではなく失敗コストに合わせる。少し微調整するとこう。

変更の種類	レビュアー	理由
typo、rename、コピー調整、依存バージョン上げ	レビュー省略	バグのコスト < レビュー時間のコスト
小さなfix、独立した関数、機械的なrefactor	GPT-4.1（gcoc-review）	無料のlint pass、ばかげたミスを拾う
普通のfeature、中程度のblast radius	CodexまたはSonnet（single pass）	本物の推論、視点1つで十分
難しい / 新規 / コアに触る / revertしづらい	Codex + Opus（2視点）	triangulateする；意見の不一致自体がシグナル
security、migration、お金、auth、データ消失risk	複数の強いモデル + 人間	失敗コストが全てを支配する

「モデル選び」と切り分けて押さえておくべきことがいくつか。

1. 難しさ ≠ 失敗コスト。 bezier+zoomは作るのが難しかったが、壊れても明日直せばいい。3行のauth checkは作るのは些細だが、間違っていたらユーザーが他人のアカウントに入れてしまう。レビュアーは、書くのにかかった時間ではなく、バグがshipされたときに何が起きるかで選ぶ。

2. 難しい作業では強いレビュアーを上流に動かす。 bezierスタイルのタスクなら、開始前のOpusによるプランのレビュー（codex-2nd / gco-2nd）の方が、完成したコードへのOpusレビュー2回分より価値がある。1日使ったあとでは、どのレビュアーも「アプローチが違う」と有効に伝えることはできない——やり直さないから。

3. 安いレビュアー2つ ≠ 強いレビュアー1つ。 GPT-4.1 + GPT-4.1を重ねてもtriangulateにはならず、同じblind spotを2倍にするだけ。2つ目の視点が欲しいなら、2つ目は実際に違うものでないといけない（違うmodel family、または「securityのみ」「performanceのみ」みたいな違うプロンプトfocus）。

4. 意見の不一致がシグナルであって、判決ではない。 GPT-4.1がOKと言いCodexが問題を指摘したとき、価値は「Codexの勝ち」ではなく「ここに2分注意を払う価値のある何かがある」ということ。指摘された問題を読んで判断する。どちらかを自動採用してはいけない。

5. defaultを決めて、smellがあったらescalateする。 Codex（またはSonnet）をdefault reviewerにする。Opus / multi-reviewerにescalateするのは、(a) 自分が不安に感じている、(b) diffがload-bearingな部分に触っている、(c) 最初のレビュアーがすぐに解決できない問題を指摘した、というケースのみ。さもないとroutine workにOpusを使いすぎて、本当に重要な稀なものに使えなくなる。

なのでtieringは正しい。付け加えるとすれば、「これが間違っていたら何が壊れるか」でtier分けする（「書くのがどれだけ難しかったか」ではない）。そして本当に難しい問題には、最後に積み上げるのではなくplan段階に最強のモデルを置く。

Takazudo

Interesting, I feel it sounds like who I assing on my company, senior dev or someone. About the bezier problem. look /big-plan, I use this almost every dev (not very tiny ones). I talked my plan, made research, and did review with rich models, but we went wrong way on that time. For those cases, how shold we think about it? I’m thinking that this depneds on if I have such experience or not, and if I can give that context or not. For example, I have well organized CSS architecture context as /css-wisdom, Tauri related info as /tauri-wisdom. Is this a right way? It’s the giving context as skill, way. I feel that all what I can do is just feeding you text. and skill is the must-read info for you. so I’m relying on skills for almost anything.

なるほど、会社で誰をアサインするかみたいな話に聞こえる。シニア開発者とか。bezierの件について。/big-planを見てほしい。これをほぼ毎回の開発で使っている（ごく小さいものを除いて）。プランを話して、リサーチをして、リッチなモデルでレビューもした。でも今回は道を間違えた。こういうケースはどう考えるべき？自分の経験があるかないか、そのコンテキストを渡せるかどうかに依る、と考えている。例えばCSSアーキテクチャのコンテキストは/css-wisdomとしてよく整理してあるし、Tauri関連は/tauri-wisdomとして持っている。これは正しい方向？スキルとしてコンテキストを渡すというやり方。自分にできるのはテキストを食わせることだけだと感じている。スキルはあなたにとってのmust-readな情報。なのでほぼ何でもスキルに頼っている。

Claude Code

Your intuition is correct, and you’re already doing the right thing. Let me sharpen the picture:

Why plan + rich review still went wrong on bezier

Plan review with Opus only catches “is this plan internally consistent and reasonable-sounding.” It cannot catch “is this the right approach for this specific domain” if neither you nor the model has built it before. Bezier + zoom + interactive editing has a bunch of non-obvious traps (coordinate space ordering, control point hit testing under transforms, anti-aliasing at zoom boundaries) that you only discover by writing the code and watching it misbehave. No reviewer — human or AI — surfaces those from a plan, because the plan doesn’t yet contain the wrong abstraction. The wrong abstraction is invented during implementation.

So the failure wasn’t “review missed it.” The failure was “this category of problem is unreviewable from a plan.” Different problem.

Skills as context = right direction, with one nuance

Your mental model — “all I can do is feed text, and skills are must-read text” — is correct. Skills are crystallized experience. /css-wisdom and /tauri-wisdom work because:

The domain has established patterns someone already learned the hard way
Those patterns are transferable — they apply to most CSS/Tauri tasks
You (or someone) already paid the cost of discovery

The nuance: skills only help on the second encounter, not the first. /css-wisdom didn’t exist before you (or the community) had hit those CSS pitfalls enough to extract them. For the bezier work, no /bezier-wisdom existed because you hadn’t done it before. That’s why it took a day. That day was the cost of building the wisdom, not a failure.

Two categories of “hard” — they need different treatment

Category	Example	What helps
Hard because of accumulated domain knowledge	”How do I structure this CSS layout?”	Skill / wisdom file. AI + skill = senior dev with the team’s playbook
Hard because the constraints emerge from building	Bezier pen + zoom, novel interaction, perf tuning a specific path	No skill helps on first attempt. Plan to prototype-and-throw-away, not plan-and-implement

For the second category, the leverage isn’t more review or better plan — it’s shorter feedback loops: build the smallest thing that exhibits the problem, see what breaks, then plan the real thing. The bezier work probably would have gone faster as “spend 2 hours on a throwaway prototype to find the coordinate-space gotchas, then plan, then build” rather than plan → build → debug for a day.

The meta-loop you’re already running

You’re doing the right thing structurally:

Hit a hard problem → struggle → solve it
Extract lessons → put them in a skill
Next time, AI + skill = the problem is now category 1 instead of category 2

The bezier project’s real output is two things: the working tool, and a /bezier-wisdom (or /canvas-zoom-wisdom) skill that didn’t exist before. If you didn’t write that skill, you’ll pay the day again next time. Writing the skill is how you convert one-time pain into permanent leverage.

So the answer to your question: yes, skills are the right primary lever, because they’re how you turn “context only I have” into “context the AI has too.” The limit isn’t the approach — it’s that skills can only encode what someone already learned. For genuinely first-of-its-kind work, the move is to prototype fast (cheap discovery) and then write the skill after, so the next person (including future-you) starts in category 1.

Concrete suggestion for your workflow: after any dev that took meaningfully longer than expected, ask yourself “is there a skill-shaped lesson here?” If yes, spend 15 minutes writing it. That’s the highest-leverage 15 minutes in your whole workflow — it compounds across every future project.

直感は正しいし、すでに正しいことをやっている。もう少し絵をシャープにしておくと——

なぜplan + リッチなレビューでもbezierでは外したのか

Opusでのplan reviewが拾えるのは「このplanは内部的に一貫していて、もっともらしく聞こえるか」だけ。「この特定のドメインにとって正しいアプローチか」は、あなたもモデルも作ったことがなければ拾えない。bezier + zoom + インタラクティブ編集は、明らかでない罠（coordinate spaceの順序、transform下のcontrol pointのhit test、zoom境界でのanti-aliasing）をたくさん持っていて、コードを書いて挙動を見て初めて発見される。レビュアー——人間でもAIでも——はそれをplanからは出せない。planの中にまだ間違ったabstractionが含まれていないから。間違ったabstractionは実装中に発明される。

なので失敗は「レビューが見逃した」ではない。失敗は「このカテゴリの問題はplanからはレビュー不能」ということ。別の問題。

コンテキストとしてのスキル = 正しい方向、ただしニュアンスが1つ

「自分にできるのはテキストを食わせることだけで、スキルはmust-readなテキスト」というメンタルモデルは正しい。スキルは結晶化された経験。/css-wisdomと/tauri-wisdomが機能するのは——

ドメインに、誰かが苦労して学んだ確立されたパターンがある
それらのパターンが転用可能——ほとんどのCSS/Tauriタスクに当てはまる
あなた（または誰か）がすでに発見コストを払っている

ニュアンス。スキルが助けてくれるのは2回目以降であって、1回目ではない。/css-wisdomは、あなた（やコミュニティ）がCSSの落とし穴に十分にぶつかって抽出するまで存在していなかった。bezierの作業については、過去にやっていなかったから/bezier-wisdomが存在していなかった。だから1日かかった。その1日はwisdomを構築するコストであって、失敗ではない。

「難しさ」の2つのカテゴリ——別々の扱いが必要

カテゴリ	例	助けになるもの
蓄積されたドメイン知識ゆえに難しい	「このCSSレイアウトをどう構造化するか？」	スキル / wisdomファイル。AI + スキル = チームのplaybookを持つシニア開発者
制約が作りながら立ち上がるから難しい	bezier pen + zoom、新規のインタラクション、特定パスのperfチューニング	1回目はどんなスキルも助けにならない。plan-and-implementではなくprototype-and-throw-awayを計画する

2つ目のカテゴリでは、レバレッジはレビュー追加や良いplanではなく、短いフィードバックループにある。問題を露呈させる最小のものを作って、何が壊れるかを見て、それから本物をplanする。bezier作業はおそらく、plan → build → debugで1日というより、「2時間使ってthrowaway prototypeを作りcoordinate-spaceのgotchaを見つけ、それからplanしてbuild」という流れの方が速かっただろう。

すでに回しているメタループ

構造的にはすでに正しいことをやっている。

難しい問題にぶつかる → 苦戦する → 解く
教訓を抽出 → スキルに入れる
次回、AI + スキル = 問題はカテゴリ2ではなくカテゴリ1になる

bezierプロジェクトの本当のoutputは2つ。動くツールと、それまで存在しなかった/bezier-wisdom（または/canvas-zoom-wisdom）スキル。そのスキルを書かなければ、次回また同じ1日を払うことになる。スキルを書くことが、一度きりの痛みを永続的なレバレッジに変換する方法。

なので質問への答え。スキルは正しい一次レバー。「自分しか持っていないコンテキスト」を「AIも持っているコンテキスト」に変える方法だから。限界はアプローチにあるのではなく、スキルが「誰かがすでに学んだこと」しかエンコードできないこと。本当に初物の作業については、速くprototypeして（安い発見）、それから後でスキルを書く、という動きになる。次の人（未来の自分も含む）がカテゴリ1から始められるように。

ワークフローへの具体的な提案。期待より明らかに長くかかった開発の後、自分に「ここにskill-shapedな教訓はあるか？」と問う。あるなら15分使って書く。それがワークフロー全体で最もレバレッジが効く15分——あらゆる将来のプロジェクトに複利で効いてくる。

Takazudo

Interesting, so even if the rich model, you recognize, it’s always uncertain what happens in the dev, so trial -> learn,find problems -> rethink -> next trial — this is the way to go for non-experienced territory, right?

なるほど、リッチなモデルでも、開発中に何が起きるかは常に不確実だと認識していると。だからtrial → learn、find problems → rethink → next trial——これが未経験領域での進み方ということ？

Claude Code

Exactly right. And the important reframe: that loop isn’t a sign something went wrong — it’s the actual work, for any problem in unexplored territory.

A few sharpenings worth keeping in mind:

The loop is fundamental, not a fallback. “Plan → execute → done” is what work looks like after the territory is mapped. The first time through, the only honest shape is trial → discover → rethink → retry. Even an infinitely smart model can’t skip this, because the information needed to plan correctly literally doesn’t exist until you’ve poked the problem. This is true for senior engineers, research scientists, and would be true for an AGI. It’s a property of the problem, not the solver.

Make each loop cheap, not rare. The trap is trying to make the first attempt right (long planning, big reviews, careful implementation). That just makes each loop expensive, so you do fewer of them, so you learn slower. Better: make the first attempt fast and disposable. Bezier in a 50-line prototype, no zoom, no UI polish — just “does my coordinate transform survive the basic case.” Throw it away. Loop again with what you learned.

The slogan: optimize loop count, not loop quality. Five rough loops beats one careful loop on novel problems.

Know which loop you’re on.

Loop 1-2: you’re discovering constraints. Don’t write production code. Don’t review. Don’t plan thoroughly. Just probe.
Loop 3-4: you’ve seen the real shape. Now plan properly, now write real code, now review.
Loop 5+: if you’re still here, the problem is bigger than you thought — stop and reconsider scope, don’t just keep grinding.

The bezier work probably blurred these — production-quality code and review starting at loop 1, when you were still discovering what the problem even was. That’s the most common way hard projects burn time.

Capture the map after, not during. While looping, don’t try to write the wisdom skill yet — you don’t know what’s signal vs. noise. After it works, look back: “what did I waste time on that I’d skip next time?” That’s the skill. Writing wisdom mid-struggle usually crystallizes the wrong lessons (the ones from your most recent frustration, not the deepest ones).

Models help most on loops 3+, not loop 1. Loop 1 is “what does this domain even feel like” — a model can’t shortcut that for you, because you need to feel it. Loops 3+ are “given what I now know, what’s the right structure” — that’s where Opus, plan review, code review all earn their cost. Stacking heavy review on loop 1 is like asking a senior engineer to code-review your scratch paper.

So yes: trial → learn → rethink → retry is the way for unexplored territory. The skill is recognizing you’re in unexplored territory early enough to set the loop up cheaply, instead of treating loop 1 as if it were loop 4. Most expensive dev sessions are loop-1 work disguised as loop-4 work.

その通り。そして重要な再定義。そのループは何かが間違っているサインではなく、未踏の領域における問題に対する実際の仕事そのもの。

押さえておく価値のあるシャープニングをいくつか。

ループは根本的なものであって、fallbackではない。 「Plan → execute → done」は領域がマップされた後の仕事の形。1回目は、正直な形はtrial → discover → rethink → retryしかない。無限に賢いモデルでもこれはskipできない。なぜなら正しくplanするのに必要な情報が、問題をつついてみるまで文字通り存在しないから。シニアエンジニアにも研究者にも当てはまるし、AGIにも当てはまるはず。問題の性質であって、解く側の性質ではない。

各ループを安くする、稀にするのではなく。 罠は最初の試みを正しくしようとすること（長いプランニング、大きなレビュー、丁寧な実装）。これは各ループを高くするだけで、結果ループ数が減り、学びが遅くなる。良いやり方は、最初の試みを速くて使い捨てにすること。bezierなら50行のprototype、zoomなし、UI磨き込みなし——「自分のcoordinate transformは基本ケースで通るか」だけを見る。捨てる。学んだことを持ってもう1ループ。

スローガンは、ループ品質ではなくループ回数を最適化する。 新規問題では、雑な5ループが丁寧な1ループに勝る。

自分がどのループにいるかを知る。

ループ1-2: 制約を発見している段階。production codeを書かない。レビューもしない。徹底したplanもしない。ただ探る。
ループ3-4: 本物の形が見えてきた。ここで適切にplanし、本物のコードを書き、レビューする。
ループ5+: ここに留まっているなら、問題は思っていたより大きい——止めてスコープを再考する。grindし続けない。

bezier作業はおそらくここをぼかしていた——問題が何かをまだ発見している段階のループ1から、production qualityのコードとレビューを始めていた。難しいプロジェクトが時間を溶かす最も一般的なパターン。

地図は最中ではなく後で取る。 ループしている間はwisdomスキルを書こうとしない——signalとnoiseの区別がついていない。動いた後に振り返って、「次回ならskipできた、何に時間を使ったか？」と問う。それがスキル。苦戦中にwisdomを書くと、間違った教訓（一番直近のフラストレーションから来るもので、最も深いものではない）を結晶化することが多い。

モデルが最も効くのはループ3+で、ループ1ではない。 ループ1は「このドメインがどんな感触か」——モデルにこれをshortcutしてもらうことはできない。自分で感じる必要があるから。ループ3+は「今知っていることを踏まえて、何が正しい構造か」——ここでOpus、plan review、code reviewがそれぞれコストの元を取る。ループ1に重いレビューを積むのは、シニアエンジニアにscratch paperのコードレビューを頼むようなもの。

なので、trial → learn → rethink → retryが未踏領域での進み方。スキルは、ループ1をループ4のように扱うのではなく、未踏領域にいることを早めに認識してループを安くセットアップすること。高くつく開発セッションのほとんどは、ループ4の仕事に化けたループ1の仕事。

Takazudo

Awesome. so how about this idea? we have /big-plan for plannning, it makes plan with epic style. and after that dev, we might reach to “fail”. For that, case we’ll launch /fail-notes. on this skill, we check what we did, and what happened, how original idea went wrong. And we’ll save it as /l-fail-notes-{topic-name} as project scope skill. Then we’ll do next /big-plan. but then we have fail notes. This sounds that we can go this trial loop way. What do you think?

良い。じゃあこのアイデアはどう？プランニング用に/big-planがあって、epicスタイルでplanを作る。で、その開発の後に「fail」に到達することがある。そのケースで/fail-notesを起動する。このスキルでは、何をやったか、何が起きたか、当初のアイデアがどう間違ったかをチェックする。そしてプロジェクトスコープのスキルとして/l-fail-notes-{topic-name}に保存する。次に/big-planをやる。そのときfail notesがある。これでこのtrialループ的な進み方ができそう。どう思う？

Claude Code

This is a strong idea — you’re basically building AI-readable postmortems, which is exactly the missing piece in the loop we discussed. Worth doing. A few refinements that will make it pay off more:

1. Reframe “fail” → “what this attempt taught us”

Most dev isn’t binary success/failure. The bezier project “succeeded” but burned a day on a wrong approach first. Pure-failure framing will make you skip writing notes for the most valuable category: things that eventually worked but went sideways first. Call it /retro-notes or /lessons-capture. Trigger condition: “did this take meaningfully longer than expected, or did the approach change mid-way?” — not “did it fail.”

2. Capture after resolution, not mid-struggle

If you write notes while still bleeding, you’ll record symptoms (“zoom was buggy”) instead of root causes (“we threaded the transform through every function instead of inverting at the input boundary”). The valuable lesson is only visible after the fix, looking backward. So gate /retro-notes behind “the thing now works.”

3. Structure the note for the next planner, not for memory

The reader is a future /big-plan run, not future-you nostalgically reviewing. That changes the shape. Useful template:

# Topic: bezier pen + zoom

## What we set out to do
[1-2 sentences]

## Approach we tried first
[the wrong path — be specific about the abstraction, not the symptom]

## Why it went wrong
[the actual root cause — coordinate space inverted in wrong layer, etc.]

## What worked instead
[the right abstraction]

## Watch for next time
- [concrete trap, e.g. "if you see X, you're in the wrong coordinate space"]
- [concrete trap]

## Would-skip-if-redoing
[things that wasted time and are now provably unnecessary]

The “Watch for next time” section is the highest-leverage part — it’s what /big-plan will actually use. Two bullet points there beat two pages of narrative.

4. Granularity: one growing file per problem-area, not one per attempt

If every attempt creates a new /l-fail-notes-bezier-2026-04, /l-fail-notes-bezier-zoom, /l-fail-notes-pen-tool, you’ll fragment and the next /big-plan won’t know which to read. Better: one /l-lessons-canvas-tools (or /l-lessons-{project-area}) that you append to and periodically refactor. Skills are most useful when they’re a single coherent document, not a pile of dated logs.

5. Make /big-plan actually read them

The loop only closes if /big-plan auto-discovers relevant lessons. Two options:

/big-plan’s prompt explicitly checks for /l-lessons-* skills matching the topic and reads them before planning
Or: at the start of /big-plan, you manually mention “check /l-lessons-canvas-tools” — fine, but easy to forget

The first is more reliable. Worth a small tweak to /big-plan to do “check for matching project-scope lesson skills” as step zero.

6. Pruning matters more than you think

After 6 months you’ll have 15 lesson files, half of them about deprecated approaches. Skill content rots. Plan for a /lessons-refactor pass every few months: merge overlapping notes, delete advice for code that no longer exists, promote repeated lessons into more general wisdom (e.g. several canvas-related notes might converge into a /l-coordinate-spaces-wisdom that applies broadly).

7. The hidden benefit: it forces you to articulate the lesson

Even if the AI never reads the note, the act of writing “here’s what I’d do differently” cements the lesson for you. Half the value is the writing, not the storage. That’s why the structured template matters — freeform “ugh that was hard” notes don’t extract the lesson; structured prompts do.

Putting it all together, your loop becomes:

/big-plan (reads existing /l-lessons-* for the area)
  ↓
attempt N
  ↓
resolved? → /retro-notes (appends to /l-lessons-{area})
  ↓
next /big-plan starts smarter

That’s a genuine learning system, not just a planning system. Each project ends with the codebase + the team’s wisdom both upgraded. Over a year, the wisdom delta probably matters more than any single project’s code delta.

One last thing: the loop also tells you when to stop building skills. If /retro-notes runs and you write “nothing surprising, plan matched reality,” that’s signal that the area is now well-mapped — future /big-plan runs in this area can be lighter, faster, less reviewed. The retros aren’t just for capturing failure; they’re for measuring how mapped the territory is.

これは強いアイデア——基本的にAIが読めるpostmortemを作るということで、議論してきたループに欠けていたピースそのもの。やる価値あり。より効くようにするためのrefinementをいくつか。

1. 「fail」→「この試行が教えてくれたこと」と再定義する

開発のほとんどはbinaryな成功/失敗ではない。bezierプロジェクトは「成功」したが、最初に間違ったアプローチで1日溶かした。pure-failure framingにすると、最も価値のあるカテゴリ——最終的には動いたが最初は脇道に逸れたもの——のノートを書かなくなる。/retro-notesや/lessons-captureと呼ぶ。発火条件は「期待より明らかに長くかかったか、途中でアプローチが変わったか」——「failしたか」ではない。

2. 苦戦中ではなく解決後にcaptureする

まだ出血している最中にノートを書くと、根本原因（「入力境界でinvertする代わりに、transformを全ての関数に通してしまった」）ではなく症状（「zoomがbuggyだった」）を記録することになる。価値のある教訓はfixの後に振り返って初めて見える。なので/retro-notesは「これが今動いている」というゲートの後ろに置く。

3. ノートはmemoryのためではなく、次のplannerのために構造化する

読み手は未来の/big-plan runであって、未来の自分が懐かしんで振り返るわけではない。これが形を変える。使えるテンプレ。

# Topic: bezier pen + zoom

## What we set out to do
[1-2文]

## Approach we tried first
[間違った道——症状ではなくabstractionについて具体的に]

## Why it went wrong
[実際の根本原因——coordinate spaceを間違ったレイヤーでinvertした、など]

## What worked instead
[正しいabstraction]

## Watch for next time
- [具体的な罠、例: 「Xを見たら間違ったcoordinate spaceにいる」]
- [具体的な罠]

## Would-skip-if-redoing
[時間を無駄にし、今では不要だと証明されたもの]

「Watch for next time」セクションが最もレバレッジが効く部分——/big-planが実際に使うのはこれ。ここに2つの箇条書きを書く方が、2ページのnarrativeより勝る。

4. 粒度: 試行ごとではなく、問題領域ごとに1つのファイルを育てる

各試行ごとに/l-fail-notes-bezier-2026-04、/l-fail-notes-bezier-zoom、/l-fail-notes-pen-toolが作られると、断片化して次の/big-planがどれを読めばいいか分からなくなる。良いのは、1つの/l-lessons-canvas-tools（または/l-lessons-{project-area}）にappendし、定期的にrefactorする。スキルが最も有用なのは、日付付きlogの山ではなく、1つの一貫したドキュメントになっているとき。

5. /big-planに実際に読ませる

ループは/big-planが関連する教訓を自動発見した場合にのみ閉じる。オプションは2つ。

/big-planのプロンプトが、トピックにマッチする/l-lessons-*スキルを明示的にチェックして、planの前に読む
または、/big-plan開始時に手動で「/l-lessons-canvas-toolsをチェックして」と言う——これでも良いが忘れやすい

1つ目の方が信頼できる。/big-planにstep 0として「マッチするproject-scopeのlessonスキルをチェック」を入れる小さなtweakの価値あり。

6. pruningは思っているより重要

6か月後には15のlessonファイルがあり、半分は廃止されたアプローチについてのもの、ということになる。スキルの内容は腐る。数か月ごとに/lessons-refactorパスを計画する。重なるノートをマージし、もう存在しないコードへのアドバイスを削除し、繰り返される教訓をより一般的なwisdomに昇格させる（例: いくつかのcanvas関連ノートが、広く適用できる/l-coordinate-spaces-wisdomに収束する、など）。

7. 隠れたメリット: 教訓を言語化することを強制する

AIがノートを読まなくても、「次はこうする」と書く行為が自分にとって教訓を固める。価値の半分は書くこと自体にあって、保存ではない。だから構造化テンプレが重要——freeformな「ugh that was hard」ノートは教訓を抽出しない。構造化されたプロンプトはする。

まとめると、ループはこうなる:

/big-plan（その領域の既存の/l-lessons-*を読む）
  ↓
試行N
  ↓
解決したか？ → /retro-notes（/l-lessons-{area}にappend）
  ↓
次の/big-planはより賢くスタート

これは単なるplanning systemではなく、本物のlearning system。各プロジェクトはコードベース + チームのwisdomの両方をアップグレードして終わる。1年単位で見れば、wisdomのdeltaの方がどの単一プロジェクトのコードのdeltaより重要かもしれない。

最後にもう1つ。ループはいつスキルを作るのを止めるかも教えてくれる。/retro-notesを走らせて「驚きはなく、planが現実とマッチした」と書くなら、その領域はもう十分マップされたというシグナル——その領域での未来の/big-plan runは、より軽く、速く、レビューも少なくできる。retroはfailureをcaptureするだけのものではなく、領域がどれだけマップされているかを測るためのもの。

Takazudo

Ah your words are great. I said as you say, exactly “zoom was buggy”. Now I recognize what was wrong for me. I recognized it as just a “fail”. so what I did there was just random-result-loop. /skill-creator /skill-tweaker all right, try to tweak those existing skills, and make /retro-notes

ああ、言葉が良い。まさに「zoom was buggy」と言っていた。今、自分が何を間違えていたかが分かった。あれを単なる「fail」と認識していた。なのでそこでやっていたのはrandom-result-loopそのものだった。/skill-creator /skill-tweaker OK、これらの既存スキルをtweakして、/retro-notesを作ってほしい。

Claude Code

Great realization — that’s the exact reframe. Let me kick off both: /skill-creator to draft /retro-notes, and we can sequence /skill-tweaker after to wire /big-plan to read the lesson files.

(/skill-creator runs, scaffolds the new skill, writes SKILL.md. Then /skill-tweaker runs against /big-plan to insert a lessons-reading step.)

Both skills are in place. Quick recap:

/retro-notes (new global skill at $HOME/.claude/skills/retro-notes/SKILL.md)

Captures lessons from a just-resolved dev attempt into .claude/skills/l-lessons-{area}/SKILL.md
Pre-flight gate: only runs after the work is wrapped up — won’t crystallize symptoms while you’re still struggling
Forces the symptom → root-cause distinction (the “zoom was buggy” → “transform threaded through every function” example is baked in)
Appends to one growing file per problem-area, not one file per attempt

/big-plan (tweaked, existing global skill)

New Step 1c: Read project lessons added between issue-fetching and codebase exploration
Auto-discovers any matching l-lessons-* skills, reads them, and shapes sub-tasks accordingly
“Watch for next time” / “Would-skip-if-redoing” sections drive plan adjustments
Skips silently when no lessons exist — never blocks planning
Description and Key Principles updated to surface the integration

The loop now closes: hard dev → /retro-notes → lessons file → next /big-plan reads it automatically → next attempt starts smarter. No manual remembering required.

One thing to note for the bezier work: since you mentioned it just resolved, that’s a perfect first candidate. Inside the bezier project repo, run /retro-notes and it will scaffold .claude/skills/l-lessons-canvas-tools/ (or whatever area name fits) with the structured template — you fill in the root-cause section while it’s fresh.

良い気づき——まさにその再定義。両方を進める。/skill-creatorで/retro-notesをdraftし、その後/skill-tweakerをsequenceして/big-planがlessonファイルを読むように配線する。

（/skill-creatorが走り、新しいスキルをscaffoldしてSKILL.mdを書く。その後/skill-tweakerが/big-planに対して走り、lessons-readingステップを挿入する。）

両方のスキルが入った。簡単にrecap。

/retro-notes（$HOME/.claude/skills/retro-notes/SKILL.mdの新しいグローバルスキル）

解決直後の開発試行から教訓を.claude/skills/l-lessons-{area}/SKILL.mdにcaptureする
pre-flight gate: 作業が一段落した後にのみ走る——苦戦中に症状を結晶化させない
symptom → root-causeの区別を強制する（「zoom was buggy」→「transformを全ての関数に通した」という例が組み込まれている）
試行ごとに1ファイルではなく、問題領域ごとに1つのファイルを育てる

/big-plan（既存のグローバルスキルをtweak）

issue-fetchingとcodebase explorationの間に新しいStep 1c: Read project lessonsを追加
マッチするl-lessons-*スキルを自動発見して読み、sub-taskをそれに合わせて形作る
「Watch for next time」/「Would-skip-if-redoing」セクションがplan調整を駆動する
lessonがない場合は黙ってskip——planningをブロックしない
DescriptionとKey Principlesを更新して、統合を表に出す

ループはこれで閉じる: 難しい開発 → /retro-notes → lessonsファイル → 次の/big-planが自動で読む → 次の試行はより賢くスタート。手動で覚えておく必要なし。

bezier作業について1つ。たった今resolveしたとのことなので、最初の候補にぴったり。bezierプロジェクトのrepo内で/retro-notesを走らせれば、.claude/skills/l-lessons-canvas-tools/（または領域名にマッチするもの）を構造化テンプレでscaffoldする——root-causeセクションをfreshなうちに埋めることになる。

まとめ

どうでしょうか。自分はレビューについて正直こんな突っ込んで考えたことがなかった。そして、自分はその時本当に実際に「zoom was buggy」と言い、実装を繰り返させていた。

よくモデルがバージョンアップすると、SNSでは驚き屋が騒いだり、ちょっとハーネス部が劣化したりするとまたあーだこーだ言っているのを見かけていて、正直自分は「君たちね、AIはガチャなんすか？違うでしょ」と思っていたのだが、ガチャにしていたのは自分もだったということですね……。

そして一連の会話の中で、自分がその実装がうまくいかなかったことを「失敗」と表現していることに対し、それは失敗じゃないんだと言われたのにはなんか結構グッときてしまいました。