zpaper-draft

Type to search...

to open search from anywhere

ソート済みJSONのマージコンフリクトをcustom git merge driverで消す会話

概要

Takazudo Modularのリポジトリにはmetadata-db.jsonという、画像のslugをキーにしたJSONファイル(約7.8MB、~1900エントリ、slug順にソート、1レコード1行)がある。新しい商品画像を足すたびにほぼ全PRがこの巨大ファイルを触るので、別々のslugを足してるだけなのに行が隣接していてコンフリクトする、みたいなことが頻発する。issue #1608で「custom git merge driverでこの非セマンティックなコンフリクトを消すのが筋が良さそう」という話になり、/big-planで実装計画を立てた後、Claude Codeに「で、これってどう動くの?」と聞きまくった会話がこれ。学習用に英語のまま残しておく。

会話ログ

Claude (proposal)

Proposal — please review before I create issues

Plan log: /Users/takazudo/cclogs/zmod/20260502_181229-big-plan-metadata-db-merge-driver.md

Source: issue #1608 — metadata-db.json: reduce merge-conflict pain across WIP PRs

Proposed impl-title: Metadata DB Merge Driver

⚠️ Parent branch is NOT main

You invoked /big-plan while on topic/skill-prevent-blurhash-skip. The new base branch will be created off that branch, and the eventual epic-PR will target topic/skill-prevent-blurhash-skip, not main (nested-base pattern). If you actually meant main, switch branches and re-run.

  • Suggested base branch: base/metadata-db-merge-driver (parent: topic/skill-prevent-blurhash-skip)
Sub-tasks (6 total — all parallel-safe; contract values frozen in plan)
#TitleModeModelNotes
1Implement scripts/merge-metadata-db.mjs (+ shared scripts/lib/metadata-db-serializer.mjs)subagentsopusDeep equality, quoted placeholders, byte-exact only on clean merges
2Add .gitattributes registering metadata-db.json merge=metadata-db-jsonlsubagentssonnetSingle line, new file
3Wire git config registration into scripts/setup-local.shsubagentssonnetQuoted "%O" "%A" "%B" "%P"
4Vitest unit tests (11 cases incl. format-snapshot + key-reorder)subagentsopus__og not __ogonly
5Docs in CLAUDE.md + scripts/CLAUDE.mdsubagentssonnetServer-side merge caveat documented
6Validation guard railscripts/validate-metadata-db.mjs + check:metadata-db wired into pnpm check and _preparesubagentsopusAdded during review (both reviewers flagged)
Key reviewer findings applied

Both /codex-2nd and /gcoc-2nd independently flagged the critical risk that conflict markers in metadata-db.json would break runtime parsing AND the Vite virtual-module build inject (astro.config.ts:36-40 does export default ${content} against the raw file). I added Sub-task 6 as defense-in-depth.

Codex also caught:

  • Test case 10 was wrong — __ogonly slugs are skipped by the builder and never appear in metadata-db.json (replaced with __og).
  • Driver command needs quoted placeholders for path-with-spaces safety.
  • Byte-exact output requirement only makes sense for clean merges, not conflicted output.
  • Serializer should be lifted into a shared helper to prevent future drift.

Takazudo

oh sorry I expected this to be invoked from main. confirm the diff and update the plan. the very original problem was metadata db json was handled in many PRs, so it’s troubled. how does this plan resolve this? I thought we might be better to use cloudflare related online DB should be handled as the latest source (basically we add and add the image source), but what do you think about this? is this plan is the polish of current way?

Claude

Plan updated — parent branch is now main everywhere. The diff between your current branch and main (4 unrelated skill-hardening files) is independent of this work, so branching off main is clean.

Now — your architectural question. This is genuinely substantive and worth thinking through honestly rather than just defending the plan.

Is this plan just polish of the current way? Yes — and that’s the right call here, but not for the obvious reason.

Re-framing the actual problem. The pain isn’t really “git’s merge algorithm is bad on this file.” It’s “feature PRs are forced to touch a high-traffic file because adding MDX content requires committing the matching metadata-db.json slug in the same PR.” That’s a workflow shape, not a storage shape.

There are three paths to relief, in increasing order of cost:

Path A — Discipline: feature PRs never touch metadata-db.json.

Only /l-metadata-update PRs edit it (they already create their own metadata-update-YYMMDD-HHMM branches — see commit c990f2ec8, PR #1710). Your recent commit 26de970e8 [claude] Harden product-pipeline skills against silent metadata-db skip is already moving the project toward this. If feature PRs never touch the file, the conflict problem mostly evaporates without any code change. The remaining conflicts are only between two simultaneous /l-metadata-update PRs, which are rare.

Path B — Merge driver (this plan). Cheap insurance for the residual cases Path A doesn’t fully cover (two metadata-update PRs racing, a stale feature PR with new images, force-merging an old PR). Small, surgical, no arch change, no operational burden, the file stays in git with full history.

Path C — D1 / online DB. Move metadata to D1, build queries the Worker. The conflict problem doesn’t fully go away — you’d still need a committed snapshot for build-time fallback (the photos epic has exactly this), and that snapshot is itself in git, so PRs that bump the snapshot can still collide. To genuinely escape the conflict, the snapshot would need to be regenerated by a scheduled Action, never by feature PRs — which is essentially Path A again, just enforced by tooling instead of discipline.

Real costs of going to D1 for products

  • Build-time Cloudflare dependency. The photos epic accepts this because photos have a true runtime producer (mobile upload). Products don’t — your only “writer” is convimgs:upload running locally. Adding a Worker hop here is solving a problem you don’t have.
  • Operational burden. Another Worker, more secrets, more deployment flow, more failure modes (Worker outage during Netlify build).
  • Loss of git log metadata-db.json. Today you can see exactly when a slug’s hash changed and bisect to it. With D1, that history moves to a database table you can’t git log.
  • The issue itself already rejects this — and the rejection reasoning still holds.

My recommendation

Ship Path B (this plan) — it’s the right insurance layer for residual conflicts, and the new Sub-task 6 (validation guard rail) protects the worst-case failure mode that both reviewers flagged.

The bigger leverage was already started in your topic/skill-prevent-blurhash-skip branch — that’s Path A discipline, and once it’s solid, conflicts on metadata-db.json should mostly disappear. The merge driver then handles the leftovers.

Don’t go to D1 just for this. Wait until there’s a separate justification (admin UI for products, multi-user editing, a runtime producer) — at that point the cost is amortized.

Takazudo

I want to know the detail of this plan. does this change the metadata db json to be handled by script like automatically? how is this possible?

Claude

Good question. “Handled automatically” is exactly the right word, but with a specific scope. Let me walk through what actually happens, with a concrete scenario.

What a git merge driver is

Git already has a default merge algorithm — it’s a line-by-line text merger. That’s the thing producing the bogus conflicts on metadata-db.json today: it sees two branches modifying lines that happen to be adjacent in sort order and flags them as conflicting, even though they’re inserting different slugs.

A custom merge driver is a Node script that git invokes instead of its default text merger, but only for files you explicitly opt in via .gitattributes. The script gets the three versions of the file (common ancestor, your branch, the incoming branch), does the merge however it wants, writes the result back, and tells git “I handled it.” Git treats the file as cleanly merged.

What happens today (the pain)

Imagine:

  • main has slugs aaa and zzz in metadata-db.json.
  • Branch A adds slug bbb (lands at line 2 in sorted order).
  • Branch B adds slug ccc (lands at line 3, right next to where bbb would go).
  • Branch A’s PR merges to main first.
  • You’re on Branch B and run git pull origin main (or rebase).
  • Git’s default text merger sees both branches modified line 2-ish and flags a conflict on metadata-db.json.
  • You open the file, see <<<<<<< HEAD markers around bbb and ccc, and have to hand-edit it to keep both, in sorted order.
  • The conflict was bogus — you added different slugs, no actual semantic disagreement.

What happens after this plan ships

Same setup, you run git pull origin main on Branch B:

  1. Git compares the three versions and sees metadata-db.json differs.

  2. Git looks up .gitattributes and sees the line metadata-db.json merge=metadata-db-jsonl.

  3. Git looks up your .git/config (registered by pnpm setup-local) and finds the driver command.

  4. Git writes the three versions to temp files and invokes:

    node scripts/merge-metadata-db.mjs "<base>" "<ours>" "<theirs>" "metadata-db.json"
  5. The script:

    • Reads all three as JSON.
    • Sees base = {aaa, zzz}, ours = {aaa, ccc, zzz}, theirs = {aaa, bbb, zzz}.
    • For slug aaa: same in all three → keep.
    • For slug bbb: absent in base, absent in ours, added by theirs → take theirs.
    • For slug ccc: absent in base, added by ours, absent in theirs → take ours.
    • For slug zzz: same in all three → keep.
    • No real conflict — both sides added different keys.
    • Writes {aaa, bbb, ccc, zzz} (sorted, same one-record-per-line format) to the “ours” temp file.
    • Exits 0.
  6. Git sees exit 0, treats the file as cleanly merged. You don’t see a conflict at all. git pull finishes silently.

The “automatic” part is steps 2-6 — it’s just git pull from your perspective. The script runs invisibly because git invokes it.

When the script does signal a real conflict

If both branches modified the same slug to different values (rare for products — slugs are owned by whoever adds them, and once committed their hash/blurhash don’t change), the script detects that and writes the file with <<<<<<< / ======= / >>>>>>> markers around the disagreeing slug only. You resolve that one slug by hand, the rest of the file is already merged.

The setup chain

For this to work, three things must be in place — sub-tasks 2 and 3 of the plan handle them:

WhereWho registers itWhat it does
.gitattributes (committed)Sub-task 2Tells git “use the driver named metadata-db-jsonl for this file”
.git/config (per-machine, NOT committed)pnpm setup-local (Sub-task 3)Maps the driver name to the Node command
scripts/merge-metadata-db.mjsSub-task 1The actual merge logic

Without all three, git falls back to its default text merger — i.e. today’s behavior. No regression for contributors who skip setup.

What it does NOT do

This is important — the “automatic” only goes so far:

  1. GitHub’s “Merge pull request” button does NOT run this script. GitHub does merges server-side using its own merger, which doesn’t know about your custom drivers. So if your PR has conflicts when you go to merge it on github.com, the conflict still appears there. The fix: pull main locally, do the rebase locally (driver runs, conflicts resolve), push, then GitHub’s merge becomes trivial.
  2. It doesn’t change metadata-db.json outside of merges. The build pipeline (pnpm build:metadata) still regenerates the file from static/images/p/<slug>/metadata.json exactly as it does today. The runtime loader (load-metadata-database.ts) still reads the file with JSON.parse exactly as it does today. The file shape is identical. We just added a smarter merge step.
  3. It doesn’t prevent you from committing a half-merged file if you somehow bypass the driver (e.g., by deleting the file and git add-ing a hand-edited version with conflict markers still in it). That’s the failure mode Sub-task 6 (the validation guard rail) was added to catch — it runs in pnpm check and pnpm dev preflight and refuses to proceed if metadata-db.json contains conflict markers or invalid JSON.

So to your question — “is metadata-db.json handled automatically?”

Yes, in the specific sense that:

  • After pnpm setup-local, every local git merge / git rebase / git pull that touches metadata-db.json runs through the smart merger automatically. You stop seeing bogus conflicts.

No, in the senses that:

  • The file is still a regular committed JSON file. It’s still regenerated by pnpm build:metadata from local image processing. It’s still read at runtime by the build. We did not move it to D1, replace it with a database, or change the workflow that produces it.
  • GitHub’s server-side merge UI doesn’t get the smart merger. If a conflict surfaces there, you still have to pull locally, rebase (where the driver kicks in), and push.

In short: it’s a smarter merge step layered on top of the existing file. The file’s lifecycle is unchanged. Only the merge moment is improved.

Takazudo

Let’s say we are branch A targeting main. We added the image on A, then do conv imgs, commit. WHen we merge main, then what happens? main automatically update it? which does the metadata json stored?

Claude

Great concrete scenario — let me walk through exactly what happens, with actual file states at each step.

Setup

Three commits back, main’s metadata-db.json looks like:

{
"aaa": {...metadata for slug aaa...},
"zzz": {...metadata for slug zzz...}
}

(In reality there are ~1900 entries, but the principle is the same.)

You start branch A from this state.

Step 1 — On branch A, add the image and process it

# You drop foo.heic into /imgs/ (Dropbox)
pnpm convimgs:upload     # processes foo.heic → static/images/p/foo/{webp,mercari.png,metadata.json}, uploads to R2
pnpm build:metadata      # regenerates metadata-db.json from all the per-slug metadata.json files
git add metadata-db.json static/images/p/foo
git commit -m "[content] Add image foo"
git push -u origin A

After this, branch A’s metadata-db.json is:

{
"aaa": {...},
"foo": {...metadata for new slug foo...}, new
"zzz": {...}
}

main has not changed. Only branch A’s commit knows about foo.

Step 2 — Meanwhile, someone else’s PR with image bar lands on main

After their merge, main’s metadata-db.json is:

{
"aaa": {...},
"bar": {...metadata for slug bar...}, landed via someone else's PR
"zzz": {...}
}

Now there’s drift: A has foo, main has bar, neither knows about the other. This is the moment of pain today.

Step 3 — On branch A, pull main into A (the merge moment)

git checkout A
git pull origin main      # or: git fetch origin && git rebase origin/main

Git sees metadata-db.json differs across base / ours / theirs. It checks .gitattributes, finds metadata-db.json merge=metadata-db-jsonl, looks up the registered driver, and invokes:

node scripts/merge-metadata-db.mjs <tmpfile-base> <tmpfile-ours> <tmpfile-theirs> metadata-db.json

Git has written three temp files:

  • base = state at the common ancestor (what main looked like when you branched off): {aaa, zzz}
  • ours = your current branch’s version (A): {aaa, foo, zzz}
  • theirs = main’s version: {aaa, bar, zzz}

The script reads all three and walks every slug:

slugbaseours (A)theirs (main)decision
aaapresentpresent (same)present (same)keep — no change
barabsentabsentpresenttake theirs — only main added it
fooabsentpresentabsenttake ours — only A added it
zzzpresentpresent (same)present (same)keep — no change

No slug was modified by both sides differently → no real conflict. Script writes the merged result to the <tmpfile-ours> path (which is what git reads back) and exits 0.

After this, branch A’s working-tree metadata-db.json is now:

{
"aaa": {...},
"bar": {...}, came from main, brought in by the merge
"foo": {...}, yours, preserved
"zzz": {...}
}

Git treats the file as cleanly merged. git pull finishes with no conflict prompt. You see no merge editor, no <<<<<<<, nothing to hand-resolve. From your perspective, it just worked.

You then git push origin A. The remote A branch now contains the merged file.

Step 4 — GitHub PR merge button

You click “Merge pull request” on GitHub. GitHub’s server-side merge runs — and here’s the important caveat: GitHub’s server-side merge does NOT run the custom driver. But because Step 3 already produced a clean working tree on A and you pushed it, A is now strictly ahead of main (it contains everything in main plus your foo commit plus the merge commit). Server-side, this is a fast-forward (or at worst a trivial merge) — there’s nothing for the driver to resolve, because A’s tree already has the union.

After merge, main’s metadata-db.json is:

{
"aaa": {...},
"bar": {...},
"foo": {...},
"zzz": {...}
}

main is automatically updated through your PR. The driver did its work locally on A in Step 3; GitHub just accepted the result.

Where is metadata-db.json actually stored?

There’s exactly one file: metadata-db.json at the root of the repo, committed in git. Every branch has its own committed copy — same as any other tracked file.

  • main’s copy is what’s in main’s HEAD commit.
  • A’s copy is what’s in A’s HEAD commit.
  • Your local working tree has whichever branch you’ve checked out.

The merge driver doesn’t introduce any new storage location. It just changes how git computes “what does the merged file look like” when it has to combine two branches’ versions.

What if you skip Step 3 and try to merge directly on GitHub?

Today (without the driver) and after this plan ships:

  • If A’s metadata-db.json and main’s metadata-db.json have textually-conflicting changes, GitHub shows the conflict in its UI. You can’t click “Merge pull request” — you have to resolve it.
  • The driver only helps locally. So the workflow is: when you’re notified of the conflict (either by trying git pull locally, or by GitHub’s UI), do git pull origin main locally. The driver runs, resolves the bogus conflict invisibly, you push, the GitHub UI conflict goes away.

So main doesn’t automatically update — your local merge is what produces the clean result, and your push delivers it. But for any reasonable workflow where you keep your branch up-to-date with main before merging (which is the standard practice), the driver makes that update step invisible: no more <<<<<<< markers to hand-edit.

TL;DR for your scenario

  • A adds image, runs convimgs:upload + build:metadata, commits — A’s metadata-db.json has foo.
  • Main meanwhile gains bar from another PR.
  • You git pull origin main on A — driver runs, A’s local file becomes {aaa, bar, foo, zzz} with no human intervention.
  • You push A, click merge on GitHub — main becomes {aaa, bar, foo, zzz}.
  • The file lives where it always has — root of the repo, in git, exactly the same shape.

Takazudo

With general mergeing, step2 and step3 invoke conflicts, then we choose the order in this case. how does this merge helper work with it? automatically decide order?

Claude

Sharp question — this is exactly the part where the custom driver fundamentally beats the default text merger.

Short answer

There is no “order to decide” — because the script doesn’t see the file as a sequence of lines. It sees it as a key-value map ({slug: metadata}). Order in the file is purely a convention (alphabetical by slug) maintained so diffs are friendly. The build pipeline, the runtime loader, nothing cares about line order — they all just JSON.parse it into an object.

So the script doesn’t decide which side’s order wins. It builds the union of keys, picks a value per key, and re-sorts the whole result alphabetically. The output is always in canonical order regardless of what order the inputs were in.

Concrete example — adjacent insertion

Base (main when you branched):

{
"apple": {...},
"cherry": {...}
}

You’re on Branch A and add image banana, run convimgs:upload + build:metadata, commit. A’s file:

{
"apple": {...},
"banana": {...}, line 2, inserted between apple and cherry
"cherry": {...}
}

Meanwhile, someone else’s PR landed on main adding blueberry. Main’s file:

{
"apple": {...},
"blueberry": {...}, line 2, inserted between apple and cherry
"cherry": {...}
}

Both inserts land at line 2 of the file. This is the exact case the default text merger hates.

What the default text merger does

It compares text line-by-line:

line 1: {              ← same in both
line 2: "apple": {...},  ← same in both
line 3: "banana": {...},   (in ours)   vs   "blueberry": {...},   (in theirs)
line 4: "cherry": {...}  ← same in both
line 5: }              ← same in both

Line 3 differs. Conflict. You get:

{
"apple": {...},
<<<<<<< HEAD
"banana": {...},
=======
"blueberry": {...},
>>>>>>> origin/main
"cherry": {...}
}

Now you have to manually:

  1. Recognize this isn’t actually a conflict — both should be in the file.
  2. Decide their order (banana before blueberry, alphabetically — both should appear).
  3. Hand-edit the file to keep both, in the right sort position.
  4. git add and continue the merge.

That manual judgment about order — banana goes before blueberry — is the exact pain point you’re describing.

What the custom driver does

The driver doesn’t read the file line by line. It does:

const base    = JSON.parse(readFileSync(baseFile));    // {apple, cherry}
const ours    = JSON.parse(readFileSync(oursFile));    // {apple, banana, cherry}
const theirs  = JSON.parse(readFileSync(theirsFile));  // {apple, blueberry, cherry}

const allKeys = new Set([...Object.keys(base), ...Object.keys(ours), ...Object.keys(theirs)]);
// allKeys = {apple, banana, blueberry, cherry}

const merged = {};
for (const key of allKeys) {
  // per-key decision:
  //   apple: same in all three   → keep
  //   banana: only in ours        → take ours
  //   blueberry: only in theirs   → take theirs
  //   cherry: same in all three   → keep
  merged[key] = decideValue(base[key], ours[key], theirs[key]);
}

// re-emit sorted, exactly the way build:metadata does:
const sortedKeys = Object.keys(merged).sort();
const output = `{\n${sortedKeys.map(k => `${JSON.stringify(k)}: ${JSON.stringify(merged[k])}`).join(',\n')}\n}\n`;
writeFileSync(oursFile, output);
process.exit(0);

Output:

{
"apple": {...},
"banana": {...},
"blueberry": {...},
"cherry": {...}
}

Order is automatic and deterministic — always alphabetical by slug. The script doesn’t have to “pick a winner” between A and main because the question doesn’t apply: both keys belong in the file, and they each have exactly one alphabetical position.

Why the question of “order” never comes up

Three reasons stack on top of each other:

  1. The file’s authoritative shape is {key: value}. Order isn’t part of its semantic content. pnpm build:metadata regenerates it sorted; the runtime loader does JSON.parse and accesses entries by key. No code anywhere depends on bar appearing before cherry instead of after.
  2. Both sides sort by the same rule. Branch A produced its file via pnpm build:metadata (alphabetical sort). Main produced its file the same way. The driver also sorts the same way. There is no disagreement to resolve.
  3. The driver works by key-set union, not by line position. Adjacent insertions at the same line number look identical to insertions at opposite ends of the file — both just become “ours added X, theirs added Y, both go in the output, sorted.”

This is exactly why the issue’s recommendation #1 was the merge driver, not “force everyone to use a different sort order” or “shard by prefix” — the bug isn’t the order, it’s that the default merger doesn’t understand the file is a key-value map.

When the script does have to make a choice

There is exactly one scenario where the script can’t auto-decide:

Both branches modified the same key to different values.

Example: A re-processed image apple and got a new blurhash (because the source image changed); main meanwhile updated apple’s width field via some other path. Both ours and theirs differ from base, and differ from each other.

In that case the script doesn’t try to merge the values within a single record (that would be reckless). It writes the file with <<<<<<< markers around just that one slug’s record, exits with code 1, and you hand-resolve only that slug. The other 1899 slugs in the file are still cleanly merged automatically.

In practice this is rare for products because each slug’s metadata is owned by whoever generated it, and re-processing an image isn’t a normal workflow.

TL;DR

  • The default text merger asks “which order wins?” because it’s looking at lines and doesn’t know they’re keys.
  • The custom driver doesn’t ask that question. It builds a key-value union, sorts the result, and emits. Order falls out of the sort automatically — same rule both branches and build:metadata already use.
  • The only thing requiring your judgment is when two branches edit the same key to different values, which is rare and handled by leaving conflict markers around just that one record.

So yes: ordering is decided automatically, and deterministically, by alphabetical sort — not by picking a side.

Takazudo

I see. so in short, this does merging as script processing result, not just text diff merging, right? So it does not care the line number or something other tiny topics if we see those as text, right?

Claude

Exactly right. Two-line summary:

  1. It treats the file as data, not as text. The script does JSON.parse(base), JSON.parse(ours), JSON.parse(theirs) — so it reasons about slugs and their metadata records, not lines and characters. The merge happens at the data level, then the result is re-serialized into the standard sorted shape.
  2. Anything purely textual stops being a conflict. Line numbers, whitespace differences, where a record landed in the file, even hypothetically the order in which keys appear inside a single record’s JSON — none of that matters. As long as the parsed data agrees per slug, the driver produces a clean merge.

The only thing the script does care about is per-slug semantic disagreement — i.e. “did both branches change slug X to two different values.” That’s the one case it surfaces as a real conflict. Everything else (adjacent insertions, sort drift, formatting) becomes invisible noise that the script silently normalizes.

So you can think of the driver as: “Forget that this is a text file. Treat it as the JSON object it really is. Merge the object. Re-emit it in the canonical shape.”