Skip to content

feat(git): move repo-name registry to Postgres + relax RWM chart gate (HA relay)#1432

Merged
tlongwell-block merged 4 commits into
mainfrom
feat/git-name-registry-pg
Jul 1, 2026
Merged

feat(git): move repo-name registry to Postgres + relax RWM chart gate (HA relay)#1432
tlongwell-block merged 4 commits into
mainfrom
feat/git-name-registry-pg

Conversation

@tlongwell-block

Copy link
Copy Markdown
Collaborator

Summary

Makes the buzz relay stateless enough to run multiple pods without a ReadWriteMany (RWM) volume, unblocking HA deployment. The last local-disk state — the .names/ repo-name registry — moves into Postgres, and the Helm chart's RWM hard-gate is relaxed (Redis stays required at replicaCount > 1).

This lets the bb-block EFS/CSI workaround (bb-block #129, which created an EFS filesystem purely to get RWM for the git PVC) be closed — no shared git storage needed. Git ref/object state was already object-store-backed; only name allocation lived on local disk.

What changed

Registry → Postgres (buzz-db::git_repo, migration)

  • New git_repo_names table, PK (community_id, repo_id) — the DB primary key is the race guard, so concurrent same-name announces are TOCTOU-free.
  • reserve_repo_nameReserved / AlreadyOwned / TakenByOther; plus repo_name_owner, count_repos_for_owner (per-pubkey quota), release_repo_name.
  • Name uniqueness is per-community, inside the server-resolved community boundary.

Announce handler (handle_git_repo_announcement, side_effects.rs)

  • Rewritten off create_dir/read_dir onto the DB registry + object-store pointer.
  • Attempt-scoped rollback: only a name freshly Reserved by this attempt is released on a pointer failure; AlreadyOwned never deletes another attempt's row.
  • Split pointer step by outcome so re-announce is idempotent:
    • fresh Reserved → strict seed_manifest_pointer (creates empty pointer; a pre-existing non-empty pointer under a just-reserved name is suspicious → fail + rollback).
    • same-owner AlreadyOwned → tolerant ensure_manifest_pointer (any existing pointer, empty or non-empty, is left untouched; only an absent pointer is repaired). This keeps normal re-announce-after-push accepted while never overwriting real ref state.

Helm chart RWM cleanup (~13 sites)

  • values.yaml, values.schema.json, README.md, NOTES.txt, examples/*, and test fixtures now reflect: replicaCount > 1 requires Redis only; git is object-store-backed, so ReadWriteOnce per replica is fine.
  • The Redis-absent gate test is retained (Redis is still required for multi-pod pub/sub).

Validation

  • buzz-relay suite 429/0; buzz-db git_repo live-PG 4/4; clippy clean.
  • helm unittest: 30/30 across 7 suites, including render_test asserting the git PVC renders ReadWriteOnce at replicaCount=3.
  • helm template A/B: replicaCount=2 + RWO + Redis renders; replicaCount=2 + RWO + no Redis still blocks ("requires Redis for buzz-pubsub").

Live clean-room git validation

Stood up a fresh relay (empty DB verified 0→37 tables, fresh object-store bucket, AUTO_MIGRATE=true, A3 git object-store conformance probe passed) and drove the git path end-to-end at this branch's tip:

  • announce → clone → push (pointer non-empty) → re-announce same owner = ACCEPTED (the regression the tolerant path fixes; relay log shows the fresh claim on the strict-seed path and the post-push re-announce on the tolerant path, non-empty pointer untouched).
  • re-clone after re-announce recovers the pushed content — ref state is not clobbered.
  • control: cross-owner announce of the same name is rejected, exactly one registry row survives (PK guard).

Reviewed independently before opening.

Follow-ups (non-blocking)

  • Remove the now-dead git_repo_path / PVC mount (kept in this PR to minimize the diff).

Related

npub1qyvc0c5kl4gqv2fd97fsk46tu378sqgy35vc83rvgfwne90sel7s0ed67d and others added 4 commits July 1, 2026 12:44
The relay's git ref/object state is already fully object-store-backed:
every read and write hydrates an ephemeral bare repo from S3 per request,
and writer serialization is the object-store pointer CAS
(docs/git-on-object-storage.md, Inv_NoFork). The one remaining piece of
persistent local-disk state was the `.names/<community>/<repo_id>` repo-name
reservation index, which forced multi-replica deployments onto a shared
ReadWriteMany (EFS) volume just to agree on name ownership.

Move that registry into Postgres (already a hard relay dependency):

- New tenant-scoped table `git_repo_names` in the consolidated 0001 schema,
  keyed `(community_id, repo_id)` so the migration-lint invariant holds and
  name uniqueness is per-community and DB-enforced.
- New `buzz-db::git_repo` module mirroring `relay_members`:
  `reserve_repo_name` (INSERT … ON CONFLICT DO NOTHING → Reserved /
  AlreadyOwned / TakenByOther), `repo_name_owner`, `count_repos_for_owner`
  (quota), `release_repo_name` (owner-scoped rollback).
- `handle_git_repo_announcement` reworked onto those calls, preserving the
  three semantics (atomic uniqueness, idempotent same-owner re-announce,
  per-pubkey quota) and the all-or-nothing seed-failure rollback.

With no persistent git state on disk, drop the chart's ReadWriteMany
hard-fail: `_validate.tpl` no longer requires
persistence.git.accessMode=ReadWriteMany at replicaCount>1. Redis remains
required for buzz-pubsub — that is the real multi-pod dependency.

The now-unused `git_repo_path` config field / git PVC mount is left in place
(doc-noted) as a separate cleanup PR to keep this diff minimal.

Verified: full buzz-relay suite (429) green; buzz-db lib (79) green
including migration-lint; new git_repo DB tests pass against live Postgres;
clippy clean; helm template confirms replicaCount=2 + ReadWriteOnce renders
while the Redis gate still blocks when no Redis source is configured.

Co-authored-by: Tyler Longwell <tlongwell@block.xyz>
Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
Addresses two review blockers on the .names->Postgres change.

Blocker 1 (name-registry rollback/re-announce race):
- The seed-failure rollback was owner-scoped but not attempt-scoped: an
  AlreadyOwned attempt could DELETE the reservation row a concurrent
  same-owner attempt had inserted and successfully seeded. Now only a
  genuinely fresh Reserved outcome by THIS attempt may release the row;
  AlreadyOwned releases nothing.
- Same-owner re-announce no longer returns Ok() purely on the existing
  row (which proves ownership, not that the manifest pointer was seeded).
  It falls through to seed_manifest_pointer, which is idempotent under
  concurrency (create-only put_pointer(IfNoneMatchStar); LostRace on the
  same empty digest is success, a different non-empty pointer is a hard
  error). Handler success now means row AND pointer are both present, so
  a repo can never be 'accepted' while uncloneable.

Blocker 2 (stale RWM assertions across the chart):
Relaxing the gate in _validate.tpl left the rest of the chart claiming
ReadWriteMany is required. Made the gate-relax chart-wide:
- values.yaml, values.schema.json, README.md: replicaCount>1 requires
  Redis only; git is object-store-backed, RWO per replica is fine.
- NOTES.txt: dropped the now-false 'RWO will fail at template time' warning.
- examples/argocd-app.yaml, examples/flux-helmrelease.yaml: RWO + no efs-sc.
- tests: validation_test drops the obsolete 'RWO fails' case (RWO no longer
  fails; the Redis-absent gate test is retained); render_test + HA/production
  fixtures now assert replicaCount=3 renders with ReadWriteOnce.

Verified: buzz-relay suite 429/0; buzz-db git_repo live-PG tests 4/4; clippy
clean; helm unittest 30/30 (7 suites); helm template confirms replicaCount=2 +
RWO + Redis renders and replicaCount=2 + RWO + no-Redis is still blocked.

Co-authored-by: Tyler Longwell <tlongwell@block.xyz>
Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
…nsure)

The prior fix routed the same-owner AlreadyOwned reannounce path through
seed_manifest_pointer, which is intentionally strict for repo *creation*:
its LostRace branch only accepts an existing pointer if it names the same
empty manifest digest, and errors otherwise. That regressed the normal
lifecycle — once an owner pushed, the pointer named a non-empty manifest,
so the next kind:30617 reannounce failed with 'already has a non-empty
pointer; refusing to overwrite via announce'.

Split the pointer step by outcome:

- Fresh Reserved claim (this attempt inserted the row) -> seed_manifest_pointer,
  kept strict: a non-empty pointer under a just-reserved name is suspicious
  (stale prior lifecycle) and correctly fails + rolls back the fresh row.
- Same-owner AlreadyOwned (reannounce) -> new ensure_manifest_pointer, tolerant:
  any existing pointer (empty OR non-empty) is valid and left untouched; only an
  absent pointer is repaired by seeding the empty pointer. This restores the
  old filesystem behavior (reannounce is an idempotent update regardless of
  pushed state) and keeps the 'announced <=> pointer exists' invariant, while
  never overwriting real ref state.

The read-then-conditional-seed in ensure_manifest_pointer is race-safe: the
repair uses the same create-only put_pointer(IfNoneMatchStar), so a concurrent
seeder or pusher that populates the pointer between the read and the seed loses
the create race and is treated as already-present, not an overwrite.

Rollback semantics unchanged: only reserved_by_this_attempt releases the row.

Verified: buzz-relay suite 429/0; git_repo live-PG 4/4; clippy clean; helm
unittest 30/30 (chart unchanged this pass).

Note: no cheap handler-level test exists (the announce path needs a full
AppState with a live git_store); coverage is the full suite + the isolated,
reasoned ensure_manifest_pointer branching. Flagged for reviewer.

Co-authored-by: Tyler Longwell <tlongwell@block.xyz>
Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
…eserve

Two review blockers on the name-registry work:

1. Brownfield migration. git_repo_names was added by editing the
   consolidated 0001 in place, changing its checksum. Any database that
   already applied the pre-PR 0001 would abort startup under AUTO_MIGRATE
   with sqlx VersionMismatch(1). Revert 0001 to be byte-identical to
   origin/main and add the table + owner index as an additive
   0002_git_repo_names.sql. To keep the tenant-isolation lint honest as
   migrations accrue, migration_sql() now concatenates every embedded
   migration in version order (git_repo_names is community_id-led, so it
   passes). Tests updated: embedded_migrator expects 2 migrations and
   asserts git_repo_names lives in v2 (absent from v1); the live
   run_migrations test expects applied_versions [1, 2].

2. Stale ref-state on re-announce. handle_git_repo_announcement called
   emit_initial_ref_state unconditionally, so a same-owner
   re-announce-after-push published a newer empty kind:30618 that, under
   NIP-16 latest-replaceable ordering, shadowed the real pushed refs.
   Gate the emission behind reserved_by_this_attempt: emit the initial
   empty ref-state only on a fresh Reserved claim. AlreadyOwned
   re-announce still ensures/repairs the manifest pointer but emits no
   empty 30618.

Verified: cargo check -p buzz-relay, cargo test -p buzz-db --lib (79
pass), clippy clean on both crates, helm unittest (30 pass). Brownfield
proven live against a throwaway Postgres seeded with origin/main's 0001:
the reviewed SHA errored VersionMismatch(1); this change applied [1, 2]
and created git_repo_names cleanly.

Co-authored-by: Tyler Longwell <tlongwell@block.xyz>
Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant