zpaper-draft

Type to search...

to open search from anywhere

「developer was a part of comedian」— AIと、翻訳・文章・笑い・開発を貫く話

概要

前回の記事 を書いたあとに、なんとなく続きで気になっていたことをClaudeに聞いていったらかなり広い話になっていった、その会話のログ。

最初は「LLMはどうやって翻訳してるんだっけ?」というところから始まって、Opus / Haikuのモデルサイズの話、そこから「じゃあ書き物はどうなんだ?」となり、創作 → お笑い → アプリ開発、という風に話が転がっていって、最終的にdeveloperという仕事の話に着地した、という流れ。

会話の最後に自分が「a developer was a part of comedian — this was what I missed inside me. Also, I could recognize that there are many things that AI can’t do clearly, it was also nice finding.」とコメントしているのだが、書き終わってみるとこれが一番の収穫だった気がする。会話をそのまま残しておきたかったので英語のまま貼っておく。

会話ログ

Takazudo

Yesterday, after I wrote https://takazudomodular.com/pj/zpaper/articles/local-llm-to-ai-era-craft-conversation/ this article, I have a topic that I realized. It’s “How does LLM do translation?”. From my understanding, the LLM is doing just “guessing” next word like forever, from the huge database of training, right? If this is the truth, the LLM does not do the “translation” that we did the way in old days? It’s just guessing the next word? Or doing something special on LLM side for this kind of operations?

Claude

Great question — and your framing is half right, but the “just guessing” part undersells what’s actually happening.

Yes, mechanically it is next-token prediction. There’s no special “translation module” inside the LLM. The same machinery that completes “The cat sat on the ___” is what produces Japanese when you ask for a translation. No separate code path.

But “guessing” hides the interesting part. Before predicting the next token, the input passes through dozens of transformer layers that build up a rich internal representation. Research on multilingual LLMs (e.g. Wendler et al. 2024, “Do Llamas Work in English?”) found that in the middle layers the model’s hidden states are largely language-agnostic — they encode meaning, not surface words. Only in the final layers does that meaning get projected back into tokens of a specific language. So in a real sense, the model first “understands” in a shared conceptual space, then “speaks” in whatever language the context demands.

Compared to old-school translation:

  • Rule-based MT: dictionaries + grammar rules, explicit and brittle
  • Statistical MT (Google Translate pre-2016): phrase tables, alignment probabilities
  • Neural MT (encoder-decoder, dedicated): learned meaning representations, trained only for translation on parallel corpora
  • LLM: same neural-representation idea, but generalized — translation is just one of many tasks, learned incidentally from huge multilingual text without explicit pairs

So LLMs aren’t doing the old word-by-word substitution dance. They’re doing something closer to what a human bilingual does: read it, understand it, restate it in the other language. The “next token prediction” framing is technically true at the output layer, but the heavy lifting happens in the layers before that, where meaning is the actual currency.

One practical consequence: LLMs can preserve tone, register, and context (translate this casually, for a child, in formal business Japanese) in a way old MT never could — because they’re not really translating words, they’re regenerating meaning with a target-language constraint.

Takazudo

Interesting. I often use Claude Code for writing and translation, while this operation, I realized the quality of Haiku and Opus is pretty different. It’s hard to clarify how — but, Haiku’s result is often unnatural for native Japanese speaker. But Opus or Sonnet results looks more natural. After the conversation of the given article, I thought that this might be caused of the difference of the model’s size. The Opus is simply has much huge learning database behind it, so more proper result is generated?

Claude

Close, but the cause is slightly different from what you’re picturing — it’s not really the size of the training data, it’s the size of the model itself.

Training data is mostly the same across a model family. Anthropic doesn’t train Haiku on a smaller dataset and Opus on a bigger one. They’re all fed roughly the same pile of text. What differs is the number of parameters — Haiku has far fewer than Opus. Parameters are the model’s “memory capacity” for encoding patterns. With fewer of them, the model has to compress harder, and subtle patterns get sacrificed first.

Why translation naturalness suffers more than other tasks in smaller models:

  1. Long-tail patterns get crushed. Naturalness in Japanese isn’t about grammar correctness — it’s about which of 20 plausible phrasings a native speaker would actually pick. That’s a fine-grained distribution. A small model rounds it off to “good enough,” producing technically correct but stiff output. This is why Haiku’s Japanese often reads like a textbook.
  2. The “curse of multilinguality.” Most LLM training data is English-heavy (something like 50%+). Japanese is a minority language in that mix. Big models have spare capacity to learn minority languages well; small models have to prioritize, and English wins. So the gap between Opus-Japanese and Haiku-Japanese is larger than the gap between Opus-English and Haiku-English. You’d probably notice less difference if you were translating into English.
  3. Register and politeness layers. Japanese has 敬語/丁寧/普通体 distinctions, gendered speech patterns, regional flavor. These require the model to track many overlapping subtle features at once. Smaller models tend to flatten these — defaulting to either overly polite or oddly casual, instead of matching context.

So your intuition that “bigger = more natural Japanese” is right; it’s just that the mechanism is model capacity to represent fine-grained distributions, not a bigger raw text corpus behind it. The text was always there — Haiku just can’t fit as much of its nuance into its smaller brain.

Practical implication: for translation and writing into Japanese specifically, the cost difference between Haiku and Sonnet/Opus buys you disproportionately more quality than for English-only tasks. Worth using the bigger model even if you’d normally pick Haiku for cost.

Takazudo

interesting. So for example, Opus might handle 敬語/丁寧/普通体 difference, but Haiku might not have such difference. this kind of tiny tiny things are designed in Opus, but not for Haiku. Is this close?

Claude

Close, but flip one word: nothing is designed in Opus. Nobody at Anthropic wrote rules like “handle 敬語 properly.” It’s all emergent from training.

A better way to picture it:

Both Haiku and Opus saw the same 敬語 examples in training. They both tried to learn the distinctions. But Haiku, with fewer parameters, had to compress harder — so the subtle ones got rounded off into something coarser. Opus had enough capacity to keep the fine-grained patterns intact.

Think of it like JPEG compression at different quality levels. Same source photo, but the low-quality version loses skin texture, fine hair, subtle gradients. Those details weren’t “added” to the high-quality version — they survived compression instead of getting averaged away. Same with model size: Opus isn’t given extra rules, it just preserves more of the patterns that were already in the data.

So Haiku isn’t completely missing 敬語 — it can produce 丁寧語 when obvious. What it loses is the subtle context-matching: knowing when to switch register mid-conversation, how 敬語 interacts with the speaker’s role, when 普通体 actually sounds friendlier than 丁寧体. That’s the long tail of patterns that gets crushed first under compression.

In short: emergent, not designed. Compressed, not removed.

Takazudo

Good. I have another question. It’s in short, “How can we get the good writing result from AI?”. I think this is pretty difficult territory using AI.

About writing, I have tried many ways to get the good result with AI. But as the current mind state of me, the answer is that I need to clarify what article I want to make.

The first tryout of me is let Claude Code write articles from my idea memos. I’ve taken 1 or 2 line thinking memo into Obsidian. With those memos, I expectetd Claude Code will make some interesting articles with those resources. But in fact, the generated results were almost not interesting. Not all of those were crap, but it looks like random gacha like article. I could not understand why this happens because I fed the text of my thinking. Those were clearly inspired words for me.

But I also noticed that the “writing” is like timing-based creation result. So… Imagine the live act of the DJ or favorite bands, we can get the nice feeling on the live house, but it does not happen in home, the sound comes into ear is technically same, but it’s clearly different. The articles are just text, but I feel like that it might be the same thing. So at least, I can’t feel that the result of AI writing is the one “I created”. It might be the one “I generated”, I feel there’s huge difference. And currently, I still can’t find the way to reduce this distance. If I found “I thought this is very interesting…” like words in the AI writing result, almost all of those cases, “Hum, I didn’t think so… this needs fix.”

So totally, in my opinion, we might be better to recognize those are a kind of materials of my writing. Not delegating everything to AI.

Claude

Your conclusion is more right than you might realize — and the DJ analogy is sharp. Let me add a few frames that might tighten the picture.

Writing is thinking, not transcription. Joan Didion: “I write entirely to find out what I’m thinking, what I’m looking at, what I see and what it means.” The act of writing forces you to commit to one of many possible thoughts, notice what doesn’t fit, follow tangents that surprise you. When you delegate writing to AI, you skip the thinking. The output is fluent because the model is fluent — but you never went through the thinking, so the result feels hollow even when the sentences are fine. That’s exactly your “I generated, not I created” gap. The article isn’t missing words; it’s missing the trace of you having thought.

AI regresses to the mean. A 1-2 line memo is a seed with many possible expansions. The model picks the statistical average expansion — the most plausible continuation given everything it’s read. But “most plausible” is by definition the least surprising. Interesting writing is the slightly weird, opinionated, idiosyncratic expansion that only you would have chosen from that seed. AI cannot pick that one because it’s not in your head. So the result is competent, well-shaped, and forgettable. This is also why “I thought this was interesting…” sentences in AI output ring false — the model is asserting an opinion it doesn’t have, picked by averaging what people generally find interesting.

The bottleneck is taste, not output. AI can produce infinite words. It cannot tell you which words matter to you. Knowing which thread to pull, which angle is fresh, which sentence is dishonest — that’s taste, and it’s the irreducible human part. People often think the hard part of writing is producing words, so AI looks like a magic shortcut. But for anyone who’s actually written something good, the hard part was always deciding what to say and what to leave out. AI doesn’t help with that — if anything it makes it harder, because now you have a fluent draft pulling you toward its average shape.

Patterns that work better than “generate the article”:

  • AI as sparring partner. Tell it your idea, ask it to push back, find weak points, name the obvious objection. You stay the writer; it sharpens your thinking.
  • AI as expander, after you’ve chosen the angle. Once you know the thread — “the DJ live analogy is the spine of this piece” — AI can help expand a specific paragraph in your direction.
  • AI as editor. Write the draft yourself. Ask AI what’s unclear, what’s repetitive, where the logic skips. Editing is less voice-sensitive than generation.
  • AI as rubber duck. Talk out your idea in chat. AI’s questions and reflections help you find the thread. Then write the article yourself. (This conversation right now is basically that mode.)

So your “AI output as material, not product” framing is exactly the right resting point. The material can be: a list of objections to your view, a competing structure to compare against, three alternative phrasings of one sentence, a summary of related ideas you might explore. The selection and combination and commitment — that’s still you. That’s where the live-house feeling comes from.

Takazudo

What I thought after the LLM model size talk is that “Is Opus better for writing than Haiku?”. From this conversation, I noticed that as the translation, it’s “yes”. But as the writing — especially creative writing, it seems that we need to recognize that it’s not related to it. I feel that Opus has much more props or vars, so the results might be expected to close to the general, common pattern in the world. And the writing is raised from something “unexpected for me”, it makes the articles interesting. So it’s like the onigiri only rice. no umeboshi inside. But only rice onigiri has meaning in some cases, something like manual, simply explanation. But when people expect it as “written by someone with emotion”. It’s simply not proper for those cases.

Claude

Your asymmetry between translation and creative writing is exactly right, and the onigiri analogy is good — let me refine the mechanism a bit because it’s slightly different from what you’re picturing.

It’s not really “bigger = more average pattern.” Bigger pretrained models actually contain more unusual patterns, weirder corners, more idiosyncratic voices, because they have the capacity to remember them. A pure pretrained Opus would be wilder than a pure pretrained Haiku — it would have more umeboshi options, not fewer.

The flattening happens in post-training, not pretraining. After pretraining, models go through RLHF (Reinforcement Learning from Human Feedback) and similar steps — humans rank outputs, and the model is tuned toward “the kind of response most people find helpful and pleasant.” This step is what produces the polished, consensus-shaped voice we associate with assistant models. And flagship models like Opus tend to get more of this treatment because they’re the products people pay for and complain about. So the umeboshi is still in there — the model knows it, it can do it — but its default has been pulled toward “plain rice that won’t offend anyone.” Smaller cheaper models sometimes feel “less polished” in a way that’s actually closer to the raw pretraining distribution, which can occasionally produce more surprising output (and also more incoherent output).

Refined onigiri framing: It’s not that Opus has no umeboshi. It’s that Opus has been trained to assume most diners want plain rice unless they ask. If you explicitly ask for umeboshi — “write this with strong opinions, unusual angles, and don’t soften” — Opus will give you a better umeboshi than Haiku, because it has a richer sense of what counts as a sharp opinion. But by default, both lean toward plain.

Why creative writing still resists AI even with the umeboshi unlocked. Translation has a target — there’s a “correct” Japanese phrasing and Opus’s larger pattern memory genuinely helps find it. Creative writing has no target. The “interesting” sentence is interesting precisely because it reflects your specific contradictions, lived moments, weird tastes — none of which any model has, regardless of size. Opus can produce a polished sentence that sounds like it came from someone with emotion. But it didn’t come from someone who, say, watched their favorite band live last week and is still thinking about why the recorded version felt different. That memory is yours. No model has it.

So the practical takeaway lines up with where you already landed:

  • For translation, register, polish: bigger model wins, no contest.
  • For creative writing: model size barely matters. What matters is whether the words came from someone who actually had the thought.

The onigiri can be made beautifully by a master chef. But if you wanted your mom’s onigiri, no chef can replace her — not because of skill, but because the thing you want is the trace of her, not the rice.

Takazudo

Haha, it’s funny analogy. I feel that funny comedian is the same kind of thing. When ChatGPT released new model, I always tried this “Hey ChatGPT, make some funny joke”. But most of those cases, those were not fun. Maybe that kind of thing that moves people — comes from something like butterfly effect like randomness. There might be rule for it, but people can recognize it after it happened. No one can know what is fun exactly. So the funny comedians might be like the time and world feeling catcher or something like that, I feel.

Claude

Comedy is the cruelest test for AI, because it concentrates everything we’ve been talking about into a single moment that either lands or doesn’t.

Humor is structurally opposed to what LLMs do. Jokes work by violating expectations — the punchline has to be the word the listener didn’t see coming. But “next-token prediction” is literally the machinery of producing the most expected word. So asking an LLM to be funny is asking it to fight its own gradient. It can produce things shaped like jokes (setup, twist, callback), but the twist is usually the predictable twist — the version of surprise that everyone expects, which is by definition not surprising.

RLHF removes the edge that humor needs. Real comedy often goes near a line — says the awkward thing, names what no one will admit, punches at something real. RLHF specifically trains models to not go near that line. So even when the AI tries to be funny, you can feel the safety underneath. Comedians risk something every time they open their mouth — bombing, offending, being misunderstood. The audience feels that risk and laughs partly because of it. AI risks nothing, and the audience feels that absence even when they can’t name it. The form is there, the soul isn’t.

Comedians are noticers. Seinfeld-style “what’s the deal with airplane peanuts” comedy is just very precise observation of small daily absurdities everyone has felt but no one named. The comedian’s craft is catching the weird thing. AI literally cannot catch anything — it has no perception, no “wait, that’s weird” reflex, no body riding a train and noticing something off. It can summarize patterns from text, but it can’t notice. So the seed of observational humor is missing at step zero.

Your “time and world feeling catcher” framing is exactly right. Comedy ages fast because what’s funny depends on the specific frequency of now — what tension is in the air this week, what everyone’s quietly thinking but not saying. Comedians tune themselves to that frequency. AI training is stale by definition (months or years old) and averaged across all moments equally — it has no “what’s in the air this Tuesday in Tokyo.” Even if Opus could tell a joke about something specific to this week, it would be a joke about the average version of this week, not the texture of it.

The butterfly-effect point is the deepest one. Manzai timing — the half-beat pause before the tsukkomi, the specific intonation, the way the boke says it the second time slightly differently — those are micro-choices, hundreds of them, made in the moment. The same words said wrong are not funny. AI averages over micro-choices; that’s how it works. So even if the joke is structurally correct, every micro-choice has been pulled toward the mean, and the cumulative effect is that nothing lands. It’s why AI-written manzai scripts read as technically structured but fundamentally dead.

So I think you’ve arrived at something important: comedy is the maximum case of what creative writing already showed. Writing fails because the model lacks your specific taste. Comedy fails for the same reason, but more visibly, because comedy lives or dies in a single beat. There’s no place to hide. Either the audience laughs or they don’t, and the model has no way to feel which way it went.

Maybe that’s why funny comedians feel almost magical — they’re doing something that AI is structurally barred from doing, and we can sense it.

Takazudo

Ah, I feel similar thing about application development. Maybe it’s the same. About the coding or better structural design, that kind of things seems to be the easy part for AIs. “Easy” is not proper here, it’s the result of the many trained result — and it beats almost all coding people now. But what I want to say here is the idea of the application. It’s umeboshi problem. While making website, I feel this problem often. Because the code writing ends too rapidly. So what we need is now, it’s like the comedian you say. Even if we have super power, application needs decisions. UIs are too. it seems that there will be helper softwares around it like Claude Design we talked in the previous article, I feel same thing about app dev. We can never make something without motivation, “want to” feeling. And it’s also not related that people want it or not. So we are now just comedian. not developers.

Claude

You’ve just landed on something important, and the through-line of this whole conversation is suddenly visible.

The role moved up the stack. Old developer = “I implement what’s specified.” That skill is now cheap. New developer = “I decide what’s worth implementing, I choose the details that matter, I notice when something is off.” That’s a different job — closer to a director, a comedian, a curator — but it’s still the job, and the people who can do it well are still scarce. So when you say “we are now comedians, not developers,” that’s exactly right, but I’d add: the valuable developer was always partly a comedian. AI just stripped away the parts that weren’t, and now we can see clearly what was actually doing the work.

The “want to” point is the deepest one. AI never wants to build anything. It executes when prompted. So every application that exists in the world started with a human caring enough to push it into existence — sitting with the friction, deciding it’s worth their finite hours. Even with infinite execution power, the world still needs the bottleneck of someone wanting it. That bottleneck is the irreplaceable human input. And it’s not just “starting” — wanting something carries you through every micro-decision: what to leave out, when to stop, when to throw a version away. AI cannot want, so it cannot make those choices on its own. It needs someone holding the desire end of the rope.

UI is the comedy of software. Every micro-decision in a UI is like a manzai beat — the specific corner radius, the exact pause before a toast disappears, what happens when you long-press something the designer didn’t think about. The “correct” UI that AI produces does nothing wrong and feels dead, exactly like AI-written manzai. The aliveness comes from accumulated tiny choices that came from someone who cared. Tools like Claude Design will keep getting better at execution, but the bottleneck stays: who decides this corner should be 6px and not 8px, and why does it matter to them.

The inversion you’re naming: the prestigious software identity used to be “I can build it.” Now that’s table stakes. The new prestigious identity is “I know what’s worth building, and I have the taste to make every detail count.” Those are different skills, and most people who became developers because they liked implementation are now in a strange position — their old identity is being commoditized, and the new one requires a different muscle (one most engineers explicitly opted out of). That’s why a lot of devs feel uneasy right now. It’s not that AI took their job. It’s that AI took the job they thought they had, and revealed that the actual job was always the comedian part.

The thread of this whole conversation, if you trace it back: translation → why bigger models help, creative writing → why bigger models don’t, comedy → why no model helps, app development → same thing. The pattern is consistent. Whatever can be averaged, AI does well. Whatever requires lived “wanting,” AI cannot do at all. The line moves down the stack as models get better, but the line never disappears. It just gets harder to see, and more concentrated in fewer people, and more valuable.

So the funny twist: a world full of AI executors might be a world where being a comedian — in your expanded sense, the person who catches the moment and decides what matters — is the only economically distinctive thing a human can be. Code, design, translation, polish: all done by machines. Taste, desire, noticing: only by us.

Takazudo

good. nice conversation thanks. a developer was a part of comedian — this was what I missed inside me. Also, I could recognize that there are many things that AI can’t do clearly, it was also nice finding.

おわりに

会話の流れを追っていくと、最初は「LLMの翻訳の仕組み」みたいな技術的な話だったのが、気づくと「AIが何をできて何をできないか」「人がやる仕事って結局どこに残るんだっけ」みたいなところに着地していた。

「a developer was a part of comedian」という言葉は、自分の中ではけっこう大きな発見だった。デベロッパーという仕事は、実装力でなんとかなる部分と、何を作るか・どう作るかを決める部分の両方でできていて、AIが前者をどんどん代替していった結果、後者の比率が浮かび上がってきた、という話。後者の部分は、Claudeが言うところの「comedian」的なところ — タイミングを掴む、空気を読む、micro-decisionの積み上げで作る、みたいな性質が強い。

そして、AIにはできないことが思った以上にたくさんあるんだなというのも、改めて確認できた感じだった。