AI生成物には梅干しが無いかもしれない問題

概要

いつもながらClaude Codeと雑談

会話ログ

Takazudo

Yesterday, after I wrote https://takazudomodular.com/pj/zpaper/articles/local-llm-to-ai-era-craft-conversation/ this article, I have a topic that I realized. It’s “How does LLM do translation?”. From my understanding, the LLM is doing just “guessing” next word like forever, from the huge database of training, right? If this is the truth, the LLM does not do the “translation” that we did the way in old days? It’s just guessing the next word? Or doing something special on LLM side for this kind of operations?

昨日 https://takazudomodular.com/pj/zpaper/articles/local-llm-to-ai-era-craft-conversation/ この記事を書いたあとに、思いついたトピックがある。「LLMはどうやって翻訳してるんだっけ？」というやつ。自分の理解では、LLMは膨大な学習データベースから、ひたすら次の単語を「推測」してるだけ、よね？もしそれが本当なら、LLMは昔やってたような「翻訳」はしてないってこと？ただ次の単語を推測してるだけ？それともこういう操作のときはLLM側で何か特別なことをしてる？

Claude Code

Great question — and your framing is half right, but the “just guessing” part undersells what’s actually happening.

Yes, mechanically it is next-token prediction. There’s no special “translation module” inside the LLM. The same machinery that completes “The cat sat on the ___” is what produces Japanese when you ask for a translation. No separate code path.

But “guessing” hides the interesting part. Before predicting the next token, the input passes through dozens of transformer layers that build up a rich internal representation. Research on multilingual LLMs (e.g. Wendler et al. 2024, “Do Llamas Work in English?”) found that in the middle layers the model’s hidden states are largely language-agnostic — they encode meaning, not surface words. Only in the final layers does that meaning get projected back into tokens of a specific language. So in a real sense, the model first “understands” in a shared conceptual space, then “speaks” in whatever language the context demands.

Compared to old-school translation:

Rule-based MT: dictionaries + grammar rules, explicit and brittle
Statistical MT (Google Translate pre-2016): phrase tables, alignment probabilities
Neural MT (encoder-decoder, dedicated): learned meaning representations, trained only for translation on parallel corpora
LLM: same neural-representation idea, but generalized — translation is just one of many tasks, learned incidentally from huge multilingual text without explicit pairs

So LLMs aren’t doing the old word-by-word substitution dance. They’re doing something closer to what a human bilingual does: read it, understand it, restate it in the other language. The “next token prediction” framing is technically true at the output layer, but the heavy lifting happens in the layers before that, where meaning is the actual currency.

One practical consequence: LLMs can preserve tone, register, and context (translate this casually, for a child, in formal business Japanese) in a way old MT never could — because they’re not really translating words, they’re regenerating meaning with a target-language constraint.

いい質問——そしてあなたの捉え方は半分正しいけど、「ただ推測してるだけ」という部分は実際に起きていることを過小評価している。

機械的には、たしかに次トークン予測。 LLMの中に特別な「翻訳モジュール」はない。「The cat sat on the ___」を補完するのと同じ機構が、翻訳を頼んだときに日本語を生成している。別のコードパスはない。

でも「推測」という言葉は面白い部分を隠してしまう。 次のトークンを予測する前に、入力は数十のtransformerレイヤーを通って、リッチな内部表現にビルドアップされる。multilingual LLMの研究（例えば Wendler et al. 2024, “Do Llamas Work in English?”）では、中間層でモデルの隠れ状態がかなり language-agnostic になっていることがわかっている——意味を表現していて、表面的な単語ではない。最終層になってはじめて、その意味が特定の言語のトークンに投影される。だから本当の意味で、モデルはまず共有された概念空間で「理解し」、それから文脈が要求する言語で「話す」。

昔ながらの翻訳との比較:

ルールベースMT: 辞書 + 文法規則、明示的で脆い
統計MT（2016以前のGoogle Translate）: フレーズテーブル、アラインメント確率
ニューラルMT（encoder-decoder、専用）: 学習した意味表現、対訳コーパスで翻訳だけのために訓練
LLM: 同じニューラル表現のアイデア、ただし汎化された——翻訳は多くのタスクのうちの一つで、対訳のペアを明示せず巨大な多言語テキストから副次的に学習される

なのでLLMは昔の単語ごとの置換ダンスはやっていない。バイリンガルの人間がやっていることに近い：読んで、理解して、別の言語で言い直す。「次トークン予測」というフレーミングは出力層では技術的には正しいけど、重い仕事はその前のレイヤーで起きていて、そこでは意味こそが実際の通貨。

実用的な帰結: LLMはtone、register、contextを保持できる（これをカジュアルに翻訳して、子供向けに、フォーマルなビジネス日本語で——みたいな）。昔のMTでは絶対にできなかったやつ——なぜなら単語を翻訳しているのではなく、ターゲット言語の制約のもとで意味を再生成しているから。

Takazudo

Interesting. I often use Claude Code for writing and translation, while this operation, I realized the quality of Haiku and Opus is pretty different. It’s hard to clarify how — but, Haiku’s result is often unnatural for native Japanese speaker. But Opus or Sonnet results looks more natural. After the conversation of the given article, I thought that this might be caused of the difference of the model’s size. The Opus is simply has much huge learning database behind it, so more proper result is generated?

おもしろい。Claude Codeを書き物や翻訳によく使ってるんだけど、その作業をしてて、HaikuとOpusで品質がかなり違うことに気づいた。どう違うかをはっきりさせるのは難しいけど——Haikuの結果はネイティブの日本語話者にとってよく不自然になる。でもOpusとかSonnetの結果はもっと自然に見える。さっきの記事の会話のあとに、これってモデルサイズの違いが原因なんじゃないかと思った。Opusは単純に背後の学習データベースが巨大だから、もっと適切な結果が生成される、みたいな？

Claude Code

Close, but the cause is slightly different from what you’re picturing — it’s not really the size of the training data, it’s the size of the model itself.

Training data is mostly the same across a model family. Anthropic doesn’t train Haiku on a smaller dataset and Opus on a bigger one. They’re all fed roughly the same pile of text. What differs is the number of parameters — Haiku has far fewer than Opus. Parameters are the model’s “memory capacity” for encoding patterns. With fewer of them, the model has to compress harder, and subtle patterns get sacrificed first.

Why translation naturalness suffers more than other tasks in smaller models:

Long-tail patterns get crushed. Naturalness in Japanese isn’t about grammar correctness — it’s about which of 20 plausible phrasings a native speaker would actually pick. That’s a fine-grained distribution. A small model rounds it off to “good enough,” producing technically correct but stiff output. This is why Haiku’s Japanese often reads like a textbook.
The “curse of multilinguality.” Most LLM training data is English-heavy (something like 50%+). Japanese is a minority language in that mix. Big models have spare capacity to learn minority languages well; small models have to prioritize, and English wins. So the gap between Opus-Japanese and Haiku-Japanese is larger than the gap between Opus-English and Haiku-English. You’d probably notice less difference if you were translating into English.
Register and politeness layers. Japanese has 敬語/丁寧/普通体 distinctions, gendered speech patterns, regional flavor. These require the model to track many overlapping subtle features at once. Smaller models tend to flatten these — defaulting to either overly polite or oddly casual, instead of matching context.

So your intuition that “bigger = more natural Japanese” is right; it’s just that the mechanism is model capacity to represent fine-grained distributions, not a bigger raw text corpus behind it. The text was always there — Haiku just can’t fit as much of its nuance into its smaller brain.

Practical implication: for translation and writing into Japanese specifically, the cost difference between Haiku and Sonnet/Opus buys you disproportionately more quality than for English-only tasks. Worth using the bigger model even if you’d normally pick Haiku for cost.

近いけど、原因はあなたが思い描いているのとは少し違う——本当のところは 訓練データのサイズ じゃなくて、 モデル自体のサイズ が原因。

訓練データはモデルファミリー全体でだいたい同じ。 AnthropicはHaikuを小さいデータセット、Opusを大きいデータセットで訓練するわけじゃない。みんなだいたい同じテキストの山を食わされている。違うのはパラメータの数——Haikuの方がOpusよりずっと少ない。パラメータはモデルがパターンをエンコードするための「記憶容量」。それが少ないと、モデルはより強く圧縮しないといけなくて、繊細なパターンがまず犠牲になる。

翻訳の自然さが他のタスクより大きく影響を受けるのはなぜか、小さいモデルにおいて:

ロングテールのパターンが潰される。 日本語の自然さは文法の正しさじゃない——20通りの妥当な言い回しのうち、ネイティブが実際にどれを選ぶか、という話。これは細かい分布。小さいモデルはこれを「だいたいOK」に丸めてしまって、技術的には正しいけど硬い出力を出す。Haikuの日本語が教科書みたいに読めるのはこのため。
「multilingualityの呪い」。 LLMの訓練データの大部分は英語寄り（50%以上くらい）。日本語はその中のマイノリティ言語。大きいモデルはマイノリティ言語をよく学ぶための余剰容量がある。小さいモデルは優先順位をつけないといけなくて、英語が勝つ。だからOpus日本語とHaiku日本語のギャップは、Opus英語とHaiku英語のギャップより大きい。英語に翻訳するならたぶん差はそこまで気にならないはず。
Registerと丁寧さのレイヤー。 日本語には敬語/丁寧/普通体の区別、性別による話し方、地域的な味がある。これらをモデルは多くの重なり合う微妙な特徴を同時にトラッキングしないといけない。小さいモデルはこれをフラットにする傾向があって——文脈に合わせる代わりに、過剰に丁寧か、変にカジュアルかにデフォルトしてしまう。

なので「大きい = より自然な日本語」というあなたの直感は正しい。ただ、メカニズムは 細かい分布を表現するためのモデル容量 であって、背後にある生のテキストコーパスが大きいことではない。テキストは元々あった——Haikuは自分の小さい脳にそのニュアンスをフィットしきれないだけ。

実用的な含意: 特に日本語への翻訳や書き物に関しては、HaikuとSonnet/Opusのコスト差は、英語のみのタスクと比べて、不釣り合いに多い品質を買える。普段ならコスト的にHaikuを選ぶような場面でも、大きいモデルを使う価値がある。

Takazudo

interesting. So for example, Opus might handle 敬語/丁寧/普通体 difference, but Haiku might not have such difference. this kind of tiny tiny things are designed in Opus, but not for Haiku. Is this close?

おもしろい。じゃあ例えば、Opusは敬語/丁寧/普通体の違いを扱えるけど、Haikuはそういう区別を持ってないかもしれない。こういう細かいことがOpusにはdesignされてて、Haikuにはdesignされてない、みたいな感じ？

Claude Code

Close, but flip one word: nothing is designed in Opus. Nobody at Anthropic wrote rules like “handle 敬語 properly.” It’s all emergent from training.

A better way to picture it:

Both Haiku and Opus saw the same 敬語 examples in training. They both tried to learn the distinctions. But Haiku, with fewer parameters, had to compress harder — so the subtle ones got rounded off into something coarser. Opus had enough capacity to keep the fine-grained patterns intact.

Think of it like JPEG compression at different quality levels. Same source photo, but the low-quality version loses skin texture, fine hair, subtle gradients. Those details weren’t “added” to the high-quality version — they survived compression instead of getting averaged away. Same with model size: Opus isn’t given extra rules, it just preserves more of the patterns that were already in the data.

So Haiku isn’t completely missing 敬語 — it can produce 丁寧語 when obvious. What it loses is the subtle context-matching: knowing when to switch register mid-conversation, how 敬語 interacts with the speaker’s role, when 普通体 actually sounds friendlier than 丁寧体. That’s the long tail of patterns that gets crushed first under compression.

In short: emergent, not designed. Compressed, not removed.

近いけど、一語だけひっくり返してほしい: Opusに designされている ものは何もない。Anthropicの誰も「敬語を適切に扱え」みたいなルールを書いていない。すべては訓練からの emergent なもの。

もうちょっといいイメージ:

HaikuもOpusも、訓練で同じ敬語の例を見ている。両方とも区別を学ぼうと 試みた 。でもHaikuはパラメータが少ないので、より強く圧縮しないといけなかった——だから繊細なものは粗いものに丸められた。Opusには細かいパターンを保持しておくのに十分な容量があった。

異なる品質レベルでのJPEG圧縮 をイメージするといい。同じ元写真だけど、低品質バージョンでは肌のテクスチャ、細い髪、微妙なグラデーションが失われる。それらの詳細は高品質バージョンに「追加された」わけじゃない——平均化されて消えるかわりに圧縮を生き延びたんだ。モデルサイズも同じ: Opusに追加のルールが与えられているわけじゃなくて、データの中にすでにあったパターンをより多く保持しているだけ。

なのでHaikuは敬語が完全に欠けているわけじゃない——明らかな場合は丁寧語を出せる。失っているのは 微妙な文脈マッチング : 会話の途中でいつregisterを 切り替える か、敬語が話し手のroleとどう相互作用するか、普通体が実は丁寧体より親しみやすく聞こえるのはどんな時か。これが圧縮で最初に潰されるパターンのロングテール。

要するに: emergent、designされたものではない。圧縮された、削除されたものではない。

Takazudo

Good. I have another question. It’s in short, “How can we get the good writing result from AI?”. I think this is pretty difficult territory using AI.

About writing, I have tried many ways to get the good result with AI. But as the current mind state of me, the answer is that I need to clarify what article I want to make.

The first tryout of me is let Claude Code write articles from my idea memos. I’ve taken 1 or 2 line thinking memo into Obsidian. With those memos, I expectetd Claude Code will make some interesting articles with those resources. But in fact, the generated results were almost not interesting. Not all of those were crap, but it looks like random gacha like article. I could not understand why this happens because I fed the text of my thinking. Those were clearly inspired words for me.

But I also noticed that the “writing” is like timing-based creation result. So… Imagine the live act of the DJ or favorite bands, we can get the nice feeling on the live house, but it does not happen in home, the sound comes into ear is technically same, but it’s clearly different. The articles are just text, but I feel like that it might be the same thing. So at least, I can’t feel that the result of AI writing is the one “I created”. It might be the one “I generated”, I feel there’s huge difference. And currently, I still can’t find the way to reduce this distance. If I found “I thought this is very interesting…” like words in the AI writing result, almost all of those cases, “Hum, I didn’t think so… this needs fix.”

So totally, in my opinion, we might be better to recognize those are a kind of materials of my writing. Not delegating everything to AI.

なるほど。もうひとつ質問がある。要するに「AIから良い文章の結果をどうやって得られるか？」というやつ。これってAIを使う上でかなり難しい領域だと思う。

書き物について、AIで良い結果を得るためにいろいろ試してきた。でも今の自分のマインドの状態としては、答えは「自分が作りたい記事を明確にする必要がある」というところ。

最初に試したのは、自分のアイデアメモからClaude Codeに記事を書かせるやつ。1-2行の思考メモを Obsidian に取っている。そういうメモがあれば、Claude Codeが面白い記事を作ってくれるんじゃないかと期待していた。でも実際、生成結果はほとんど面白くなかった。すべてがクソというわけじゃないけど、ランダムガチャみたいな記事という感じだった。自分の思考のテキストを与えているのに、なぜこれが起こるのか理解できなかった。あれは自分にとって明らかにinspireされた言葉だったのに。

でも一方で、「書き物」はタイミングベースの創作結果みたいなものだと気づいた。だから… DJや好きなバンドのライブを想像してほしい。ライブハウスでは良い感じが得られるけど、家では起こらない。耳に入ってくる音は技術的には同じなのに、明らかに違う。記事は単なるテキストだけど、それと同じことかもしれないと感じる。だから少なくとも、AIの書き物の結果が「自分が作った」ものだとは感じられない。「自分が生成した」ものかもしれない、その間には大きな違いを感じる。そして今のところ、その距離を縮める方法を見つけられていない。AIの書き物の結果に「これがすごく面白いと思った…」みたいな言葉が見つかると、ほとんどの場合「いや、自分はそう思ってない… これは直さないと」となる。

なので総合的に、自分の意見としては、これらは自分の書き物の素材の一種として認識した方がよいのかもしれない。すべてをAIに委任するんじゃなくて。

Claude Code

Your conclusion is more right than you might realize — and the DJ analogy is sharp. Let me add a few frames that might tighten the picture.

Writing is thinking, not transcription. Joan Didion: “I write entirely to find out what I’m thinking, what I’m looking at, what I see and what it means.” The act of writing forces you to commit to one of many possible thoughts, notice what doesn’t fit, follow tangents that surprise you. When you delegate writing to AI, you skip the thinking. The output is fluent because the model is fluent — but you never went through the thinking, so the result feels hollow even when the sentences are fine. That’s exactly your “I generated, not I created” gap. The article isn’t missing words; it’s missing the trace of you having thought.

AI regresses to the mean. A 1-2 line memo is a seed with many possible expansions. The model picks the statistical average expansion — the most plausible continuation given everything it’s read. But “most plausible” is by definition the least surprising. Interesting writing is the slightly weird, opinionated, idiosyncratic expansion that only you would have chosen from that seed. AI cannot pick that one because it’s not in your head. So the result is competent, well-shaped, and forgettable. This is also why “I thought this was interesting…” sentences in AI output ring false — the model is asserting an opinion it doesn’t have, picked by averaging what people generally find interesting.

The bottleneck is taste, not output. AI can produce infinite words. It cannot tell you which words matter to you. Knowing which thread to pull, which angle is fresh, which sentence is dishonest — that’s taste, and it’s the irreducible human part. People often think the hard part of writing is producing words, so AI looks like a magic shortcut. But for anyone who’s actually written something good, the hard part was always deciding what to say and what to leave out. AI doesn’t help with that — if anything it makes it harder, because now you have a fluent draft pulling you toward its average shape.

Patterns that work better than “generate the article”:

AI as sparring partner. Tell it your idea, ask it to push back, find weak points, name the obvious objection. You stay the writer; it sharpens your thinking.
AI as expander, after you’ve chosen the angle. Once you know the thread — “the DJ live analogy is the spine of this piece” — AI can help expand a specific paragraph in your direction.
AI as editor. Write the draft yourself. Ask AI what’s unclear, what’s repetitive, where the logic skips. Editing is less voice-sensitive than generation.
AI as rubber duck. Talk out your idea in chat. AI’s questions and reflections help you find the thread. Then write the article yourself. (This conversation right now is basically that mode.)

So your “AI output as material, not product” framing is exactly the right resting point. The material can be: a list of objections to your view, a competing structure to compare against, three alternative phrasings of one sentence, a summary of related ideas you might explore. The selection and combination and commitment — that’s still you. That’s where the live-house feeling comes from.

あなたの結論は自分が思っている以上に正しい——そしてDJのアナロジーは鋭い。絵をもう少しタイトにするためのフレームをいくつか追加する。

書くことは思考であって、転写ではない。 Joan Didion は言った: 「私は自分が何を考えているか、何を見ているか、何が見えていてそれが何を意味するかを見つけるためだけに書く」。書くという行為は、可能な多くの思考のうちのひとつにコミットすることを強制し、合わないものに気づかせ、自分を驚かせる脱線を辿らせる。書くことをAIに委任すると、その思考をスキップしてしまう。アウトプットは流暢——なぜならモデルが流暢だから——だけど、 あなた が思考のプロセスを通っていないので、文がきれいでも結果は空虚に感じる。それがまさにあなたの「I generated, not I created」のギャップ。記事に言葉が欠けているんじゃなくて、あなたが考えた痕跡が欠けている。

AIは平均に回帰する。 1-2行のメモは、多くの展開可能性を持つ種。モデルは統計的な平均的な展開——読んできたすべてを踏まえて最も妥当な続き——を選ぶ。でも「最も妥当」は定義上 最も驚きの少ない もの。面白い文章は、ちょっと変で、意見があって、特異な展開で、その種から あなただけ が選んだようなもの。AIはそれを選べない、なぜならあなたの頭の中にしかないから。だから結果はコンピテントで、形はよくて、忘れられやすい。AIアウトプットの「これがすごく面白いと思った…」みたいな文が嘘くさく響くのもこのため——モデルは自分が持っていない意見を主張していて、それは人々が 一般的に 何を面白いと思うかを平均することで選ばれている。

ボトルネックはアウトプットじゃなくてtaste。 AIは無限に言葉を生成できる。 あなたにとって どの言葉が重要かは教えてくれない。どの糸を引くか、どのアングルが新鮮か、どの文が不誠実か——これがtasteで、削減できない人間の部分。人々は文章の難しい部分は言葉を生み出すことだと思いがちで、だからAIは魔法のショートカットに見える。でも本当に良いものを書いたことのある人にとって、難しい部分はいつも 何を言うか と 何を残すか を決めることだった。AIはそこを助けてくれない——むしろ難しくする。流暢なドラフトが平均的な形に引っ張ってくるから。

「記事を生成する」よりうまくいくパターン:

AIをスパーリング相手として。 自分のアイデアを伝えて、押し返してもらい、弱点を見つけてもらい、明白な反論を挙げてもらう。あなたが書き手のままで、思考が研ぎ澄まされる。
AIをアングルを決めた後のexpanderとして。 あなた が糸を知ったあと——「DJのライブのアナロジーがこの記事の背骨だ」——AIに自分の方向に沿って特定のパラグラフを展開してもらえる。
AIをエディタとして。 ドラフトは自分で書く。AIに不明瞭なところ、繰り返してるところ、ロジックが飛んでるところを聞く。編集は生成より声に依存しない。
AIをラバーダックとして。 チャットでアイデアを話す。AIの質問とリフレクションが糸を見つけるのを助けてくれる。それから記事は自分で書く。（今のこの会話は基本的にこのモード。）

なのであなたの「AIアウトプットは素材であってプロダクトじゃない」というフレーミングはまさに正しい着地点。素材は: あなたの見解への反論のリスト、比較対象としての競合する構造、ひとつの文の3つの別の言い回し、探求できる関連アイデアのサマリー。選択と 組み合わせ と コミットメント ——これはあなたのまま。ライブハウスの感覚はそこから来る。

Takazudo

What I thought after the LLM model size talk is that “Is Opus better for writing than Haiku?”. From this conversation, I noticed that as the translation, it’s “yes”. But as the writing — especially creative writing, it seems that we need to recognize that it’s not related to it. I feel that Opus has much more props or vars, so the results might be expected to close to the general, common pattern in the world. And the writing is raised from something “unexpected for me”, it makes the articles interesting. So it’s like the onigiri only rice. no umeboshi inside. But only rice onigiri has meaning in some cases, something like manual, simply explanation. But when people expect it as “written by someone with emotion”. It’s simply not proper for those cases.

LLMモデルサイズの話のあとに思ったのは「OpusはHaikuより書き物に向いてるのか？」というやつ。この会話から、翻訳に関しては「Yes」だと気づいた。でも書き物——特にクリエイティブな書き物に関しては、それは関係ないと認識する必要があるみたい。Opusはpropsとかvarsをずっと多く持っているから、結果は世の中の一般的・共通的なパターンに近づくことが期待される、という感覚がある。そして書き物は「自分にとって予期せぬもの」から立ち上がるもので、それが記事を面白くする。だからお米だけのおにぎりみたいなもの。中に梅干しがない。でもお米だけのおにぎりが意味を持つケースもある、マニュアルとか単純な説明とか。でも人々が「感情のある誰かが書いたもの」として期待するときには、単にそういう用途には適さない。

Claude Code

Your asymmetry between translation and creative writing is exactly right, and the onigiri analogy is good — let me refine the mechanism a bit because it’s slightly different from what you’re picturing.

It’s not really “bigger = more average pattern.” Bigger pretrained models actually contain more unusual patterns, weirder corners, more idiosyncratic voices, because they have the capacity to remember them. A pure pretrained Opus would be wilder than a pure pretrained Haiku — it would have more umeboshi options, not fewer.

The flattening happens in post-training, not pretraining. After pretraining, models go through RLHF (Reinforcement Learning from Human Feedback) and similar steps — humans rank outputs, and the model is tuned toward “the kind of response most people find helpful and pleasant.” This step is what produces the polished, consensus-shaped voice we associate with assistant models. And flagship models like Opus tend to get more of this treatment because they’re the products people pay for and complain about. So the umeboshi is still in there — the model knows it, it can do it — but its default has been pulled toward “plain rice that won’t offend anyone.” Smaller cheaper models sometimes feel “less polished” in a way that’s actually closer to the raw pretraining distribution, which can occasionally produce more surprising output (and also more incoherent output).

Refined onigiri framing: It’s not that Opus has no umeboshi. It’s that Opus has been trained to assume most diners want plain rice unless they ask. If you explicitly ask for umeboshi — “write this with strong opinions, unusual angles, and don’t soften” — Opus will give you a better umeboshi than Haiku, because it has a richer sense of what counts as a sharp opinion. But by default, both lean toward plain.

Why creative writing still resists AI even with the umeboshi unlocked. Translation has a target — there’s a “correct” Japanese phrasing and Opus’s larger pattern memory genuinely helps find it. Creative writing has no target. The “interesting” sentence is interesting precisely because it reflects your specific contradictions, lived moments, weird tastes — none of which any model has, regardless of size. Opus can produce a polished sentence that sounds like it came from someone with emotion. But it didn’t come from someone who, say, watched their favorite band live last week and is still thinking about why the recorded version felt different. That memory is yours. No model has it.

So the practical takeaway lines up with where you already landed:

For translation, register, polish: bigger model wins, no contest.
For creative writing: model size barely matters. What matters is whether the words came from someone who actually had the thought.

The onigiri can be made beautifully by a master chef. But if you wanted your mom’s onigiri, no chef can replace her — not because of skill, but because the thing you want is the trace of her, not the rice.

翻訳とクリエイティブな書き物の間の非対称性についてのあなたの指摘はまったく正しい、そしておにぎりのアナロジーも良い——メカニズムが思い描いているのと少し違うので、そこを整理させてほしい。

「大きい = より平均的なパターン」というわけではない。 大きい事前学習済みモデルは実はより珍しいパターン、変なコーナー、特異な声を含んでいる。それを記憶できる容量があるから。純粋な事前学習済みOpusは、純粋な事前学習済みHaikuより野生的だろう——梅干しのオプションが多く、少ないわけじゃない。

フラット化はpost-trainingで起こる、pretrainingではない。 事前学習のあと、モデルは RLHF （Reinforcement Learning from Human Feedback）と同様のステップを通る——人間がアウトプットをランキングして、モデルは「ほとんどの人が役立ち、心地よく感じる類の応答」に向かってチューンされる。このステップが、アシスタントモデルに我々が連想する、洗練されたコンセンサス形状の声を生み出す。そしてOpusのようなフラッグシップモデルは、人々がお金を払い、苦情を言うプロダクトなので、よりこの処理を受ける傾向がある。だから梅干しはまだそこにある——モデルは知っていて、できる——でもデフォルトは「誰も気を悪くしないプレーンなライス」に引っ張られてしまった。小さくて安いモデルが「より洗練されていない」と感じられるとき、それは実は生のpretraining分布に より近い もので、時々より驚きのあるアウトプット（そしてより支離滅裂なアウトプットも）を生み出すことがある。

おにぎりのフレーミングを精緻化: Opusに梅干しがないわけじゃない。Opusは ほとんどのお客さんは、頼まない限りプレーンなライスを欲しがる と仮定するように訓練されている。明示的に梅干しを頼む——「これを強い意見、変わったアングルで、ソフトにせずに書いて」——と、OpusはHaikuより良い梅干しをくれる。なぜなら鋭い意見が何かについて、より豊かな感覚を持っているから。でもデフォルトでは、両方ともプレーン寄り。

梅干しがアンロックされてもクリエイティブな書き物がAIに抵抗するのはなぜか。 翻訳にはターゲットがある——「正しい」日本語の言い回しがあって、Opusの大きいパターン記憶がそれを見つけるのに本当に役立つ。クリエイティブな書き物にはターゲットがない。「面白い」文が面白いのは、まさに あなたの 特定の矛盾、生きた瞬間、変な好みを反映しているから——どれもサイズに関わらず、どんなモデルも持っていない。Opusは感情のある誰かから 来たように聞こえる 洗練された文を生成できる。でも、例えば先週好きなバンドのライブを見て、録音版がなぜ違って感じたかについていまだに考えている誰か、から来ているわけじゃない。その記憶はあなたのもの。どんなモデルも持っていない。

なので実用的なtakeawayはあなたがすでに着地したところに沿う:

翻訳、register、ポリッシュには: 大きいモデルが勝つ、議論の余地なし。
クリエイティブな書き物には: モデルサイズはほぼ関係ない。重要なのは、その言葉が実際にその思考を持った人から来ているかどうか。

おにぎりは熟練のシェフに美しく作ってもらえる。でも あなたのお母さんの おにぎりが欲しいなら、どんなシェフも代わりにはなれない——スキルじゃなくて、欲しいものがお米じゃなくて彼女の痕跡だから。

Takazudo

Haha, it’s funny analogy. I feel that funny comedian is the same kind of thing. When ChatGPT released new model, I always tried this “Hey ChatGPT, make some funny joke”. But most of those cases, those were not fun. Maybe that kind of thing that moves people — comes from something like butterfly effect like randomness. There might be rule for it, but people can recognize it after it happened. No one can know what is fun exactly. So the funny comedians might be like the time and world feeling catcher or something like that, I feel.

ハハ、おもしろいアナロジー。面白いコメディアンも同じ種類のものだと感じる。ChatGPTが新モデルをリリースしたとき、毎回「Hey ChatGPT, make some funny joke」と試していた。でもほとんどの場合、面白くなかった。たぶん人を動かすあの種のもの——というのは、バタフライエフェクトみたいなランダムネスから来ているのかもしれない。それにルールはあるかもしれないけど、人はそれが起こったあとに認識できる。何が面白いかを正確に知ることは誰にもできない。だから面白いコメディアンは、時代と世界の感覚キャッチャーみたいなものかもしれない、と感じる。

Claude Code

Comedy is the cruelest test for AI, because it concentrates everything we’ve been talking about into a single moment that either lands or doesn’t.

Humor is structurally opposed to what LLMs do. Jokes work by violating expectations — the punchline has to be the word the listener didn’t see coming. But “next-token prediction” is literally the machinery of producing the most expected word. So asking an LLM to be funny is asking it to fight its own gradient. It can produce things shaped like jokes (setup, twist, callback), but the twist is usually the predictable twist — the version of surprise that everyone expects, which is by definition not surprising.

RLHF removes the edge that humor needs. Real comedy often goes near a line — says the awkward thing, names what no one will admit, punches at something real. RLHF specifically trains models to not go near that line. So even when the AI tries to be funny, you can feel the safety underneath. Comedians risk something every time they open their mouth — bombing, offending, being misunderstood. The audience feels that risk and laughs partly because of it. AI risks nothing, and the audience feels that absence even when they can’t name it. The form is there, the soul isn’t.

Comedians are noticers. Seinfeld-style “what’s the deal with airplane peanuts” comedy is just very precise observation of small daily absurdities everyone has felt but no one named. The comedian’s craft is catching the weird thing. AI literally cannot catch anything — it has no perception, no “wait, that’s weird” reflex, no body riding a train and noticing something off. It can summarize patterns from text, but it can’t notice. So the seed of observational humor is missing at step zero.

Your “time and world feeling catcher” framing is exactly right. Comedy ages fast because what’s funny depends on the specific frequency of now — what tension is in the air this week, what everyone’s quietly thinking but not saying. Comedians tune themselves to that frequency. AI training is stale by definition (months or years old) and averaged across all moments equally — it has no “what’s in the air this Tuesday in Tokyo.” Even if Opus could tell a joke about something specific to this week, it would be a joke about the average version of this week, not the texture of it.

The butterfly-effect point is the deepest one. Manzai timing — the half-beat pause before the tsukkomi, the specific intonation, the way the boke says it the second time slightly differently — those are micro-choices, hundreds of them, made in the moment. The same words said wrong are not funny. AI averages over micro-choices; that’s how it works. So even if the joke is structurally correct, every micro-choice has been pulled toward the mean, and the cumulative effect is that nothing lands. It’s why AI-written manzai scripts read as technically structured but fundamentally dead.

So I think you’ve arrived at something important: comedy is the maximum case of what creative writing already showed. Writing fails because the model lacks your specific taste. Comedy fails for the same reason, but more visibly, because comedy lives or dies in a single beat. There’s no place to hide. Either the audience laughs or they don’t, and the model has no way to feel which way it went.

Maybe that’s why funny comedians feel almost magical — they’re doing something that AI is structurally barred from doing, and we can sense it.

コメディはAIにとって 最も残酷な テスト、なぜならこれまで話してきたすべてが、ハマるかハマらないかが決まる単一の瞬間に凝縮されるから。

ユーモアは構造的にLLMがやっていることと対立する。 ジョークは期待を 違反する ことで機能する——オチは聞き手が予想しなかった言葉でないといけない。でも「次トークン予測」はまさに 最も期待される 言葉を生成する機構。だからLLMに面白くしてくれと頼むのは、自分のグラディエントと戦えと頼んでいるようなもの。ジョークの形をしたもの（セットアップ、ツイスト、コールバック）は生成できるけど、ツイストはたいてい 予想可能な ツイスト——みんなが期待するバージョンの驚きで、定義上驚きじゃない。

RLHFはユーモアが必要とするエッジを取り除く。 本物のコメディはしばしばラインのきわに行く——気まずいことを言い、誰も認めないことを名指しし、本当に何かをパンチする。RLHFは具体的にモデルにそのラインに 近づかない よう訓練する。だからAIが面白くしようとしても、その下のセーフティを感じる。コメディアンは口を開くたびに何かをリスクする——スベる、人を怒らせる、誤解される。聴衆はそのリスクを感じて、部分的には それゆえに 笑う。AIは何もリスクしない、そして聴衆はその不在を、名指しできなくても感じる。形はあるけど、魂はない。

コメディアンはnoticerだ。 Seinfeld風の「飛行機のピーナッツって何だよ」コメディは、誰もが感じたけど誰も名指ししなかった、小さな日常の不条理についての非常に正確な観察。コメディアンのcraftは変なものを キャッチ すること。AIは文字通り何もキャッチできない——知覚がない、「ちょっと、それ変じゃない？」のリフレックスがない、電車に乗っていて何かおかしいと気づく身体がない。テキストからパターンを要約できるけど、 気づく ことはできない。なので観察的ユーモアの種はステップゼロで欠けている。

あなたの「時代と世界の感覚キャッチャー」というフレーミングはまったく正しい。 コメディはすぐに古びる、なぜなら何が面白いかは今の特定の周波数によるから——今週空気にあるどんな緊張、みんなが静かに考えていて言わないこと。コメディアンはその周波数に自分をチューニングする。AIの訓練は定義上古い（数ヶ月から数年）し、すべての瞬間を等しく平均化している——「今週火曜日の東京で空気にあるもの」がない。もしOpusが今週特有の何かについてジョークを言え たとしても 、それは 平均的な バージョンの今週についてのジョークで、そのテクスチャーじゃない。

バタフライエフェクトのポイントは最も深いやつ。 漫才のタイミング——ツッコミ前の半拍の間、特定のイントネーション、ボケが2回目少し違って言う仕方——これらはマイクロチョイスで、その場で行われる何百もの選択。同じ言葉を間違って言ったら面白くない。AIはマイクロチョイスを平均化する、それがAIの仕組み。だからジョークが構造的に正しくても、すべてのマイクロチョイスが平均に引っ張られていて、累積効果として何もハマらない。AIが書いた漫才のスクリプトが技術的には構造化されているのに根本的に死んでいるように読めるのはこのため。

なのであなたは重要なところに到達したと思う: コメディはクリエイティブな書き物がすでに示したことの最大ケース 。書き物が失敗するのは、モデルにあなたの特定のtasteが欠けているから。コメディが同じ理由で失敗するのは、コメディが単一の拍で生きるか死ぬかするから、より見える形で。隠れる場所がない。聴衆が笑うか笑わないかで、モデルはどちらに行ったかを感じる方法がない。

たぶんそれが、面白いコメディアンがほとんど魔法のように感じられる理由——彼らはAIが構造的にやることを禁じられている何かをやっていて、我々はそれを感じることができる。

Takazudo

Ah, I feel similar thing about application development. Maybe it’s the same. About the coding or better structural design, that kind of things seems to be the easy part for AIs. “Easy” is not proper here, it’s the result of the many trained result — and it beats almost all coding people now. But what I want to say here is the idea of the application. It’s umeboshi problem. While making website, I feel this problem often. Because the code writing ends too rapidly. So what we need is now, it’s like the comedian you say. Even if we have super power, application needs decisions. UIs are too. it seems that there will be helper softwares around it like Claude Design we talked in the previous article, I feel same thing about app dev. We can never make something without motivation, “want to” feeling. And it’s also not related that people want it or not. So we are now just comedian. not developers.

ああ、アプリケーション開発についても同じようなことを感じる。たぶん同じ。コーディングとかより良い構造設計とか、そういうのはAIにとって簡単な部分みたいだ。「簡単」はここでは適切じゃないけど、訓練の結果としての多くの結果——そして今やほとんどのコーディングをする人を打ち負かしている。でもここで言いたいのはアプリケーションのアイデアの方。これは梅干し問題。ウェブサイトを作っているとき、この問題によく感じる。コードを書くのが急速に終わってしまうから。なので今我々に必要なのは、あなたが言ったコメディアンみたいなもの。たとえスーパーパワーがあっても、アプリケーションには決定が必要。UIもそう。前の記事で話したClaude Designみたいなヘルパーソフトウェアがその周りに出てきそうだけど、アプリ開発についても同じことを感じる。我々はモチベーション、「やりたい」気持ちなしには何かを作ることは決してできない。そしてそれは人々が欲しがるかどうかとも関係ない。なので我々は今、ただのコメディアンであって、デベロッパーじゃない。

Claude Code

You’ve just landed on something important, and the through-line of this whole conversation is suddenly visible.

The role moved up the stack. Old developer = “I implement what’s specified.” That skill is now cheap. New developer = “I decide what’s worth implementing, I choose the details that matter, I notice when something is off.” That’s a different job — closer to a director, a comedian, a curator — but it’s still the job, and the people who can do it well are still scarce. So when you say “we are now comedians, not developers,” that’s exactly right, but I’d add: the valuable developer was always partly a comedian. AI just stripped away the parts that weren’t, and now we can see clearly what was actually doing the work.

The “want to” point is the deepest one. AI never wants to build anything. It executes when prompted. So every application that exists in the world started with a human caring enough to push it into existence — sitting with the friction, deciding it’s worth their finite hours. Even with infinite execution power, the world still needs the bottleneck of someone wanting it. That bottleneck is the irreplaceable human input. And it’s not just “starting” — wanting something carries you through every micro-decision: what to leave out, when to stop, when to throw a version away. AI cannot want, so it cannot make those choices on its own. It needs someone holding the desire end of the rope.

UI is the comedy of software. Every micro-decision in a UI is like a manzai beat — the specific corner radius, the exact pause before a toast disappears, what happens when you long-press something the designer didn’t think about. The “correct” UI that AI produces does nothing wrong and feels dead, exactly like AI-written manzai. The aliveness comes from accumulated tiny choices that came from someone who cared. Tools like Claude Design will keep getting better at execution, but the bottleneck stays: who decides this corner should be 6px and not 8px, and why does it matter to them.

The inversion you’re naming: the prestigious software identity used to be “I can build it.” Now that’s table stakes. The new prestigious identity is “I know what’s worth building, and I have the taste to make every detail count.” Those are different skills, and most people who became developers because they liked implementation are now in a strange position — their old identity is being commoditized, and the new one requires a different muscle (one most engineers explicitly opted out of). That’s why a lot of devs feel uneasy right now. It’s not that AI took their job. It’s that AI took the job they thought they had, and revealed that the actual job was always the comedian part.

The thread of this whole conversation, if you trace it back: translation → why bigger models help, creative writing → why bigger models don’t, comedy → why no model helps, app development → same thing. The pattern is consistent. Whatever can be averaged, AI does well. Whatever requires lived “wanting,” AI cannot do at all. The line moves down the stack as models get better, but the line never disappears. It just gets harder to see, and more concentrated in fewer people, and more valuable.

So the funny twist: a world full of AI executors might be a world where being a comedian — in your expanded sense, the person who catches the moment and decides what matters — is the only economically distinctive thing a human can be. Code, design, translation, polish: all done by machines. Taste, desire, noticing: only by us.

あなたは今、重要なことに到達した、そしてこの会話全体を貫く線が突然見えるようになった。

役割がスタックの上に動いた。 旧デベロッパー = 「指定されたものを実装する」。そのスキルは今や安い。新デベロッパー = 「何を実装する価値があるかを決める、重要な詳細を選ぶ、何かおかしいときに気づく」。これは違う仕事——ディレクター、コメディアン、キュレーターに近い——でもそれはまだ仕事で、それをうまくできる人はまだ少ない。なのであなたが「我々は今コメディアンであって、デベロッパーじゃない」と言うのはまったく正しい、でも追加させて: 価値のある デベロッパーは常に部分的にコメディアンだった。AIはただそうじゃない部分を剥ぎ取って、今や実際に仕事をしていた部分がはっきり見えるようになった。

「やりたい」のポイントが最も深い。 AIは何かを作りたいと思うことは決してない。プロンプトされたら実行する。なので世界に存在するすべてのアプリケーションは、誰かが十分にケアして、それを存在に押し出すところから始まった——摩擦と向き合って、自分の有限な時間に値すると決めて。無限の実行力があっても、世界はまだ 誰かがそれを欲しいと思う というボトルネックを必要とする。そのボトルネックがかけがえのない人間の入力。そして単に「始める」だけじゃない——何かを欲しいと思うことが、すべてのマイクロ決定を運んでくれる: 何を残すか、いつ止めるか、いつバージョンを捨てるか。AIは欲することができないので、それらの選択を自分でできない。誰かが欲望の側のロープを握っている必要がある。

UIはソフトウェアのコメディ。 UIのあらゆるマイクロ決定は漫才の拍みたいなもの——特定のコーナーラディウス、トーストが消えるまでの正確な間、デザイナーが考えなかった何かを長押ししたら何が起こるか。AIが生成する「正しい」UIは何も間違ってないけど死んで感じる、AIが書いた漫才とまさに同じ。生きている感じはケアした誰かから来た小さい選択の積み重ねから来る。Claude Designみたいなツールは実行で良くなり続けるけど、ボトルネックは残る: このコーナーが6pxで8pxじゃないと決めるのは誰か、そして それが彼らにとってなぜ重要か 。

あなたが名指ししているinversion: 名誉あるソフトウェアアイデンティティは以前「自分はそれを作れる」だった。今やそれは前提条件。新しい名誉あるアイデンティティは「自分は何を作る価値があるかを知っていて、すべての詳細をカウントさせるtasteを持っている」。これらは違うスキルで、実装が好きでデベロッパーになった人のほとんどは今、奇妙な位置にいる——彼らの古いアイデンティティはコモディティ化されつつあり、新しいものは別の筋肉を必要とする（ほとんどのエンジニアが明示的にオプトアウトした筋肉）。これが多くのデベロッパーが今、不安に感じている理由。AIが彼らの仕事を奪ったんじゃない。AIが彼らが持っていると思っていた仕事を奪い、本当の仕事はずっとコメディアンの部分だったと明らかにした。

この会話全体のスレッド をたどると: 翻訳 → なぜ大きいモデルが助けになるか、クリエイティブな書き物 → なぜ大きいモデルが助けにならないか、コメディ → なぜどんなモデルも助けにならないか、アプリ開発 → 同じ。パターンは一貫している。 平均化できるものは、AIがうまくやる。生きた「欲する」を必要とするものは、AIはまったくできない。 モデルが良くなるにつれて線はスタックの下に動くけど、線が消えることはない。ただ見にくくなり、より少ない人に集中し、より価値が高くなる。

なので面白いひねり: AIエグゼキューターで満ちた世界は、コメディアンであること——あなたの拡張された意味で、瞬間を キャッチ して何が重要かを 決める 人——が、人間だけが経済的に独自性を持てる唯一のものという世界かもしれない。コード、デザイン、翻訳、ポリッシュ: すべて機械がやる。Taste、欲望、気づき: 我々だけ。

Takazudo

good. nice conversation thanks. a developer was a part of comedian — this was what I missed inside me. Also, I could recognize that there are many things that AI can’t do clearly, it was also nice finding.

良かった。良い会話、ありがとう。a developer was a part of comedian — これが自分の中で抜けていたものだった。また、AIにはできないことがたくさんあるとはっきり認識できた、それも良い発見だった。

余談

なおこれまでは翻訳がSonnetでしたが、今回はOpusにしたら結構自然になってました。