catalog: allocate cluster replica ids durably, single-source#37178
catalog: allocate cluster replica ids durably, single-source#37178antiguru wants to merge 2 commits into
Conversation
Replica ids were allocated in-apply by insert_cluster_replica, reading the IdAlloc counter from the snapshot the coordinator opened the transaction with. Cluster and item ids are instead allocated out-of-band by the durable allocator before the transaction, and the op carries the explicit id. That asymmetry breaks any caller that pre-allocates a replica id out-of-band: the coordinator does not apply IdAlloc updates to its in-memory catalog, so a later in-apply allocation reuses the id and a DuplicateReplica panic surfaces on the next bootstrap. Make replica ids durable-allocated and single-source like cluster ids. Op::CreateClusterReplica now carries a required replica_id; the apply path always uses insert_cluster_replica_with_id and never allocates. The runtime construction sites pre-allocate via the durable allocator, choosing user or system ids from the owning cluster. The catalog-open builtin-replica migration keeps a transaction-level system allocation, which is single-source inside the open transaction. The in-apply insert_cluster_replica and the unused transaction-level user allocator are removed. DB-147 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01FUorxx5m6B8TT2VAek2STo
antiguru
left a comment
There was a problem hiding this comment.
Reviewed at ebce5cc. The change is correct and faithful to the existing cluster/item-id allocation pattern. Summary of what I verified and where I'd push.
Correctness — holds up
- Core fix is sound. Moving replica-id allocation out-of-band (durable allocator, before the txn) and making
Op::CreateClusterReplicacarry a requiredreplica_idremoves the in-applyIdAllocread that the coordinator never reflects in its in-memory catalog. That's the actual source of theDuplicateReplicabootstrap panic, so this closes it at the right layer. - alter id accounting is exact.
needed_replica_idsequals what each branch consumes: config_changed → new_rf (both the None drop-all-recreate and the For/UntilReady pending branches create new_rf); pure scale-up → new_rf - rf; scale-down/no-op → 0. Allocation is gated on > 0, the iterator type matches in both arms, and the.expect("pre-allocated enough replica ids")is genuinely unreachable. At worst you over-allocate zero. Good. - Validation-before-allocation preserved everywhere. Both create paths and alter keep the eager max_replicas_per_cluster check (database-issues#6046) ahead of allocation, so a rejected DDL allocates nothing.
- zip_eq/into_element invariants hold (allocate n, zip with n; allocate 1, take 1).
- Builtin migration is behavior-preserving. Forcing
allocate_system_replica_idmatches the oldinsert_cluster_replicadispatch, since builtin replicas are always on system clusters. - No dangling callers.
insert_cluster_replicaandallocate_user_replica_idare gone with no references left in mz-adapter/mz-catalog;insert_cluster_replica_with_idis correctly bumped topubfor the cross-crate call. The twoddl.rsOp::CreateClusterReplicamatches use.., so the new field doesn't touch them.
One caveat: my clone is blobless, so I couldn't grep the entire tree. The crates that actually touch Transaction directly are clean, but confirm no test-only caller elsewhere references the two removed pub fns before un-drafting.
Things to weigh
- Extra durable commit per replica-creating DDL. Allocation was previously free (in-apply, same txn); now every CREATE CLUSTER / CREATE REPLICA / config-changing ALTER does an additional
allocate_idcommit beforecatalog_transact. It's batched (one commit for N replicas), and it matches how cluster/item ids already work, so it's consistent — but it does add a round-trip to the latency of these statements. Acceptable given the correctness goal; just flagging it's a real, if small, regression in DDL latency. - Id-space gaps on a failed/rejected txn. If
catalog_transactfails after allocation, the counter has already advanced and those ids are burned. Harmless (opaque monotonic counters) and identical to existing cluster/item semantics, but it's now true for replicas too. Worth a one-line acknowledgement in the commit message rather than a fix. - id_ts fetched at top of alter vs. inline in the create paths. The comment justifies the early fetch (borrow conflict), and on the single-threaded coord nothing advances the write ts between fetch and use, so it's safe. Minor style inconsistency only.
Gap I'd push on
No regression test. The failure being fixed is a bootstrap panic, and the triggering caller (a pre-allocated replica id colliding with an in-apply one) is exactly what the dependent stack introduces. The diff relies on existing mz-catalog tests + the cluster suite, neither of which asserts that an out-of-band-allocated replica id survives a bootstrap without collision. I understand the real caller doesn't exist yet, so a direct end-to-end test is awkward — but a mz-catalog unit test asserting that Op::CreateClusterReplica with an explicit id never re-allocates (and that the durable counter advances past it) would lock in the invariant this PR establishes and is cheap to add now. Given the severity of the failure mode, I'd want that before merge rather than after the stack lands.
Nothing here blocks the direction — the design is the right one. The test gap is the only item I'd resolve before un-drafting.
Assert that inserting a replica with an explicit id via insert_cluster_replica_with_id does not consume the IdAlloc counter, and that the durable allocator advances independently so a later allocation returns the direct successor and never collides with an explicitly inserted id. This locks in the invariant the parent commit establishes, which is the source of the DuplicateReplica bootstrap panic the dependent stack would otherwise hit. DB-147 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01FUorxx5m6B8TT2VAek2STo
aljoscha
left a comment
There was a problem hiding this comment.
Overall: This is correct and well scoped, and the direction is the right one. Single-sourcing replica id allocation removes a real coexistence hazard (DB-147) and brings replicas in line with how cluster and item ids already work. Nothing blocking. A couple of notes below, mostly operational and one small efficiency suggestion.
Correctness (looks good)
I traced the id accounting and timestamp handling and both hold up:
- The allocate-then-consume counts in
sequence_alter_cluster_managed_to_managedmatch exactly per branch (config-changed recreatesnew_replication_factor, pure scale-up creates the delta, scale-down and no-op create none), so the.expect("pre-allocated enough replica ids")cannot fire and no ids are over-allocated. - The allocation runs after the eager
max_replicas_per_clustervalidation, so a rejected alter allocates nothing durably. id_tsis safe to capture up front: the only code between capturing it andallocate_replica_idsis synchronous validation, and the coordinator holds&mut selfthroughout, so the catalog upper cannot advance underneath andcommit_ts >= upperholds. This matches the existingsequence_create_clusterpattern.- The two
ddl.rsmatch sites are unaffected because they use.., which is why the diff correctly leaves them alone. - The DB-147 fix is sound: nothing reads the stale in-memory
IdAllocanymore, andremove_orphaned_replicaskeys off the durable counter viaget_next_{user,system}_replica_id, which stays a correct high-watermark for one-at-a-time out-of-band allocation.
Operational note (the main thing to be aware of)
This converts replica creation from one durable catalog write (the id was allocated in-apply inside the create transaction) into two: the allocate_replica_ids transaction plus the create transaction, each with its own timestamp-oracle round-trip. So CREATE CLUSTER (with replicas), CREATE CLUSTER REPLICA, ALTER CLUSTER scale-up/reconfigure, and automated cluster scheduling turn-on each gain one durable write and one oracle round-trip. This is exactly the cost clusters and items already pay, so it is consistent, and it is batched (N replicas in one statement is one allocation write, not N). Worth being explicit about in the PR description since it is a small but real increase in catalog write traffic for replica DDL and for scheduled-cluster flapping. If this ever shows up as a cost, the established lever is batching through the IdPool (as was done for user item ids), not folding allocation back into the transaction. No action needed here, just flagging.
A side effect to note: replica ids are no longer guaranteed contiguous. On a crash or a rejected create transaction between the allocation commit and the create commit, the allocated id is consumed but unused, leaving gaps in u/s replica ids in mz_cluster_replicas and audit logs. This is harmless and matches cluster/item behavior, but operators may notice it.
Strong suggestion: fetch id_ts lazily in the alter path
In sequence_alter_cluster_managed_to_managed, id_ts = self.get_catalog_write_ts().await is fetched unconditionally at function entry, but when needed_replica_ids == 0 (scale-down, no-op, and importantly every automated scheduling turn-off) nothing is allocated, so that oracle round-trip is wasted on a recurring path. The comment explains it is hoisted for borrow-checker reasons against the cluster borrow. Consider extracting the cluster data you need and dropping the borrow, then fetching id_ts only on the needed_replica_ids > 0 path, so turn-off and no-op alters do not hit the oracle. Minor, but it is on the scheduling hot path.
Tests
The durable-layer regression test is good and targets the right invariant (explicit insert consumes zero allocator ids, allocator advances independently). The coordinator paths are covered by the cluster mzcompose suite. One gap worth a sentence in the description: there is no end-to-end test reproducing the original DuplicateReplica-on-bootstrap symptom. That is acceptable for pre-work, and the colliding pre-allocation it guards against lands with the scoped-flags stack, so an integration regression will likely come naturally there.
Nits
- The three near-identical "Pre-allocate replica ids out-of-band..." comments are a touch repetitive across the file. Could collapse to one canonical explanation referenced from the others, but fine as is.
Vec::<ReplicaId>::new().into_iter()reads slightly oddly as the empty-iterator branch. Equivalent tovec![].into_iter(), keep whichever you prefer.
aljoscha
left a comment
There was a problem hiding this comment.
I think the change is good!
Motivation
Closes https://linear.app/materializeinc/issue/DB-147
Cluster replica ids are allocated in-apply by
insert_cluster_replica, reading theIdAlloccounter from the snapshot the coordinator opened the transaction with.Cluster and item ids are instead allocated out-of-band by the durable allocator before the transaction, and the op carries the explicit id.
That asymmetry breaks any caller that pre-allocates a replica id out-of-band: the coordinator does not apply
IdAllocupdates to its in-memory catalog, so a later in-apply allocation reuses the id and aDuplicateReplicapanic surfaces on the next bootstrap.Description
Make replica ids durable-allocated and single-source like cluster ids.
Op::CreateClusterReplicacarries a requiredreplica_id; the apply path always usesinsert_cluster_replica_with_idand never allocates.Runtime construction sites pre-allocate via the durable allocator, choosing user or system ids from the owning cluster.
The catalog-open builtin-replica migration keeps a transaction-level system allocation, which is single-source inside the open transaction with no coordinator.
The in-apply
insert_cluster_replicaand the unused transaction-level user allocator are removed.This is pre-work for the scoped feature flags stack (#36959, #37158), which needs to pre-allocate a replica id to fold replica-scoped data into the create transaction.
Verification
cargo test -p mz-catalogand theclustermzcompose suite.