catalog: allocate cluster replica ids durably, single-source by antiguru · Pull Request #37178 · MaterializeInc/materialize

antiguru · 2026-06-22T09:32:53Z

Motivation

Closes https://linear.app/materializeinc/issue/DB-147

Cluster replica ids are allocated in-apply by insert_cluster_replica, reading the IdAlloc counter from the snapshot the coordinator opened the transaction with.
Cluster and item ids are instead allocated out-of-band by the durable allocator before the transaction, and the op carries the explicit id.
That asymmetry breaks any caller that pre-allocates a replica id out-of-band: the coordinator does not apply IdAlloc updates to its in-memory catalog, so a later in-apply allocation reuses the id and a DuplicateReplica panic surfaces on the next bootstrap.

Description

Make replica ids durable-allocated and single-source like cluster ids.
Op::CreateClusterReplica carries a required replica_id; the apply path always uses insert_cluster_replica_with_id and never allocates.
Runtime construction sites pre-allocate via the durable allocator, choosing user or system ids from the owning cluster.
The catalog-open builtin-replica migration keeps a transaction-level system allocation, which is single-source inside the open transaction with no coordinator.
The in-apply insert_cluster_replica and the unused transaction-level user allocator are removed.

This is pre-work for the scoped feature flags stack (#36959, #37158), which needs to pre-allocate a replica id to fold replica-scoped data into the create transaction.

Verification

cargo test -p mz-catalog and the cluster mzcompose suite.

Replica ids were allocated in-apply by insert_cluster_replica, reading the IdAlloc counter from the snapshot the coordinator opened the transaction with. Cluster and item ids are instead allocated out-of-band by the durable allocator before the transaction, and the op carries the explicit id. That asymmetry breaks any caller that pre-allocates a replica id out-of-band: the coordinator does not apply IdAlloc updates to its in-memory catalog, so a later in-apply allocation reuses the id and a DuplicateReplica panic surfaces on the next bootstrap. Make replica ids durable-allocated and single-source like cluster ids. Op::CreateClusterReplica now carries a required replica_id; the apply path always uses insert_cluster_replica_with_id and never allocates. The runtime construction sites pre-allocate via the durable allocator, choosing user or system ids from the owning cluster. The catalog-open builtin-replica migration keeps a transaction-level system allocation, which is single-source inside the open transaction. The in-apply insert_cluster_replica and the unused transaction-level user allocator are removed. DB-147 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01FUorxx5m6B8TT2VAek2STo

antiguru

Reviewed at ebce5cc. The change is correct and faithful to the existing cluster/item-id allocation pattern. Summary of what I verified and where I'd push.

Correctness — holds up

Core fix is sound. Moving replica-id allocation out-of-band (durable allocator, before the txn) and making Op::CreateClusterReplica carry a required replica_id removes the in-apply IdAlloc read that the coordinator never reflects in its in-memory catalog. That's the actual source of the DuplicateReplica bootstrap panic, so this closes it at the right layer.
alter id accounting is exact. needed_replica_ids equals what each branch consumes: config_changed → new_rf (both the None drop-all-recreate and the For/UntilReady pending branches create new_rf); pure scale-up → new_rf - rf; scale-down/no-op → 0. Allocation is gated on > 0, the iterator type matches in both arms, and the .expect("pre-allocated enough replica ids") is genuinely unreachable. At worst you over-allocate zero. Good.
Validation-before-allocation preserved everywhere. Both create paths and alter keep the eager max_replicas_per_cluster check (database-issues#6046) ahead of allocation, so a rejected DDL allocates nothing.
zip_eq/into_element invariants hold (allocate n, zip with n; allocate 1, take 1).
Builtin migration is behavior-preserving. Forcing allocate_system_replica_id matches the old insert_cluster_replica dispatch, since builtin replicas are always on system clusters.
No dangling callers. insert_cluster_replica and allocate_user_replica_id are gone with no references left in mz-adapter/mz-catalog; insert_cluster_replica_with_id is correctly bumped to pub for the cross-crate call. The two ddl.rs Op::CreateClusterReplica matches use .., so the new field doesn't touch them.

One caveat: my clone is blobless, so I couldn't grep the entire tree. The crates that actually touch Transaction directly are clean, but confirm no test-only caller elsewhere references the two removed pub fns before un-drafting.

Things to weigh

Extra durable commit per replica-creating DDL. Allocation was previously free (in-apply, same txn); now every CREATE CLUSTER / CREATE REPLICA / config-changing ALTER does an additional allocate_id commit before catalog_transact. It's batched (one commit for N replicas), and it matches how cluster/item ids already work, so it's consistent — but it does add a round-trip to the latency of these statements. Acceptable given the correctness goal; just flagging it's a real, if small, regression in DDL latency.
Id-space gaps on a failed/rejected txn. If catalog_transact fails after allocation, the counter has already advanced and those ids are burned. Harmless (opaque monotonic counters) and identical to existing cluster/item semantics, but it's now true for replicas too. Worth a one-line acknowledgement in the commit message rather than a fix.
id_ts fetched at top of alter vs. inline in the create paths. The comment justifies the early fetch (borrow conflict), and on the single-threaded coord nothing advances the write ts between fetch and use, so it's safe. Minor style inconsistency only.

Gap I'd push on

No regression test. The failure being fixed is a bootstrap panic, and the triggering caller (a pre-allocated replica id colliding with an in-apply one) is exactly what the dependent stack introduces. The diff relies on existing mz-catalog tests + the cluster suite, neither of which asserts that an out-of-band-allocated replica id survives a bootstrap without collision. I understand the real caller doesn't exist yet, so a direct end-to-end test is awkward — but a mz-catalog unit test asserting that Op::CreateClusterReplica with an explicit id never re-allocates (and that the durable counter advances past it) would lock in the invariant this PR establishes and is cheap to add now. Given the severity of the failure mode, I'd want that before merge rather than after the stack lands.

Nothing here blocks the direction — the design is the right one. The test gap is the only item I'd resolve before un-drafting.

Assert that inserting a replica with an explicit id via insert_cluster_replica_with_id does not consume the IdAlloc counter, and that the durable allocator advances independently so a later allocation returns the direct successor and never collides with an explicitly inserted id. This locks in the invariant the parent commit establishes, which is the source of the DuplicateReplica bootstrap panic the dependent stack would otherwise hit. DB-147 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01FUorxx5m6B8TT2VAek2STo

aljoscha

Overall: This is correct and well scoped, and the direction is the right one. Single-sourcing replica id allocation removes a real coexistence hazard (DB-147) and brings replicas in line with how cluster and item ids already work. Nothing blocking. A couple of notes below, mostly operational and one small efficiency suggestion.

Correctness (looks good)

I traced the id accounting and timestamp handling and both hold up:

The allocate-then-consume counts in sequence_alter_cluster_managed_to_managed match exactly per branch (config-changed recreates new_replication_factor, pure scale-up creates the delta, scale-down and no-op create none), so the .expect("pre-allocated enough replica ids") cannot fire and no ids are over-allocated.
The allocation runs after the eager max_replicas_per_cluster validation, so a rejected alter allocates nothing durably.
id_ts is safe to capture up front: the only code between capturing it and allocate_replica_ids is synchronous validation, and the coordinator holds &mut self throughout, so the catalog upper cannot advance underneath and commit_ts >= upper holds. This matches the existing sequence_create_cluster pattern.
The two ddl.rs match sites are unaffected because they use .., which is why the diff correctly leaves them alone.
The DB-147 fix is sound: nothing reads the stale in-memory IdAlloc anymore, and remove_orphaned_replicas keys off the durable counter via get_next_{user,system}_replica_id, which stays a correct high-watermark for one-at-a-time out-of-band allocation.

Operational note (the main thing to be aware of)

This converts replica creation from one durable catalog write (the id was allocated in-apply inside the create transaction) into two: the allocate_replica_ids transaction plus the create transaction, each with its own timestamp-oracle round-trip. So CREATE CLUSTER (with replicas), CREATE CLUSTER REPLICA, ALTER CLUSTER scale-up/reconfigure, and automated cluster scheduling turn-on each gain one durable write and one oracle round-trip. This is exactly the cost clusters and items already pay, so it is consistent, and it is batched (N replicas in one statement is one allocation write, not N). Worth being explicit about in the PR description since it is a small but real increase in catalog write traffic for replica DDL and for scheduled-cluster flapping. If this ever shows up as a cost, the established lever is batching through the IdPool (as was done for user item ids), not folding allocation back into the transaction. No action needed here, just flagging.

A side effect to note: replica ids are no longer guaranteed contiguous. On a crash or a rejected create transaction between the allocation commit and the create commit, the allocated id is consumed but unused, leaving gaps in u/s replica ids in mz_cluster_replicas and audit logs. This is harmless and matches cluster/item behavior, but operators may notice it.

Strong suggestion: fetch `id_ts` lazily in the alter path

In sequence_alter_cluster_managed_to_managed, id_ts = self.get_catalog_write_ts().await is fetched unconditionally at function entry, but when needed_replica_ids == 0 (scale-down, no-op, and importantly every automated scheduling turn-off) nothing is allocated, so that oracle round-trip is wasted on a recurring path. The comment explains it is hoisted for borrow-checker reasons against the cluster borrow. Consider extracting the cluster data you need and dropping the borrow, then fetching id_ts only on the needed_replica_ids > 0 path, so turn-off and no-op alters do not hit the oracle. Minor, but it is on the scheduling hot path.

Tests

The durable-layer regression test is good and targets the right invariant (explicit insert consumes zero allocator ids, allocator advances independently). The coordinator paths are covered by the cluster mzcompose suite. One gap worth a sentence in the description: there is no end-to-end test reproducing the original DuplicateReplica-on-bootstrap symptom. That is acceptable for pre-work, and the colliding pre-allocation it guards against lands with the scoped-flags stack, so an integration regression will likely come naturally there.

Nits

The three near-identical "Pre-allocate replica ids out-of-band..." comments are a touch repetitive across the file. Could collapse to one canonical explanation referenced from the others, but fine as is.
Vec::<ReplicaId>::new().into_iter() reads slightly oddly as the empty-iterator branch. Equivalent to vec![].into_iter(), keep whichever you prefer.

aljoscha

I think the change is good!

antiguru commented Jun 22, 2026

View reviewed changes

antiguru marked this pull request as ready for review June 22, 2026 14:44

antiguru requested a review from a team as a code owner June 22, 2026 14:44

antiguru requested a review from aljoscha June 22, 2026 14:44

aljoscha reviewed Jun 22, 2026

View reviewed changes

aljoscha self-requested a review June 22, 2026 16:01

aljoscha approved these changes Jun 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

catalog: allocate cluster replica ids durably, single-source#37178

catalog: allocate cluster replica ids durably, single-source#37178
antiguru wants to merge 2 commits into
MaterializeInc:mainfrom
antiguru:moritz/db-147-durable-replica-id

antiguru commented Jun 22, 2026

Uh oh!

antiguru left a comment

Uh oh!

aljoscha left a comment

Uh oh!

aljoscha left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

antiguru commented Jun 22, 2026

Motivation

Description

Verification

Uh oh!

antiguru left a comment

Choose a reason for hiding this comment

Correctness — holds up

Things to weigh

Gap I'd push on

Uh oh!

aljoscha left a comment

Choose a reason for hiding this comment

Correctness (looks good)

Operational note (the main thing to be aware of)

Strong suggestion: fetch id_ts lazily in the alter path

Tests

Nits

Uh oh!

aljoscha left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Strong suggestion: fetch `id_ts` lazily in the alter path