Scoped feature flags: resolution and gating by antiguru · Pull Request #36959 · MaterializeInc/materialize

antiguru · 2026-06-10T14:18:23Z

Motivation

The resolution layer for scoped feature flags. The foundation (#37079), the
durable persistence and introspection (#37080), and the catalog-implications
foundation (#37151, which derives the replica-scoped controller push from the
committed catalog diff) have all merged. Create-time render-frozen correctness
is completed in the follow-up #37158. Design:
doc/developer/design/20260609_scoped_feature_flags.md (#36947).

Behind the enable_scoped_system_parameters dyncfg (off by default), this PR
consumes the durable scoped overrides and resolves them at the per-scope
boundaries.

Description

The sync loop evaluates the cluster (replica-free) and replica (size, size
family, owning cluster) LaunchDarkly contexts and records an override only when
the scoped value differs from the environment-wide value, compared in
canonical encoding so bool spellings (on/off vs true/false) match. LD
multi-context evaluation cannot reveal which context produced a value, so the
differs-from-env test, not the variation_detail reason, is the recording
rule. The evaluation is skipped while the gate is off, and the gate is read
from the sync working copy so a disabled environment takes no catalog snapshot
per tick.
Resolution boundaries: cluster-coherent overrides feed plan-time
OptimizerFeatureOverrides (with a lenient bool decode so on/off cannot
panic the optimizer). Replica-local overrides reach the compute controller
through the catalog implication from adapter: derive replica-scoped override push from catalog implications #37151.
Create-time: a post-transact resolve folds a freshly created object's overrides
into the working copy so it need not wait for the next sync tick. The follow-up
Scoped feature flags: create-time evaluation #37158 moves this into the create transaction so render-frozen replica flags are
set before create_replica.

Verification

Canonical bool bridging, lenient optimizer decode, and frontend context unit
tests, plus the test/launchdarkly mzcompose workflow extended with
cluster/replica/durability/gating/create-time cases, validated end-to-end
against a live LaunchDarkly project.

🤖 Generated with Claude Code

https://claude.ai/code/session_01FUorxx5m6B8TT2VAek2STo

def-

Per-cluster LD feature flags in combination with a cluster-targeted optimizer flag crash:
Running cargo test -p mz-repr --lib bool_override_decodes_lenient_spellings with this diff:

diff --git a/src/repr/src/optimize.rs b/src/repr/src/optimize.rs
index 75d4a56c71..d0d3fcef05 100644
--- a/src/repr/src/optimize.rs
+++ b/src/repr/src/optimize.rs
@@ -189,3 +189,21 @@ macro_rules! impl_optimizer_feature_type {
 // Implement `OptimizerFeatureType` for all types used in the
 // `optimizer_feature_flags!(...)`  call above.
 impl_optimizer_feature_type![bool, usize];
+
+#[cfg(test)]
+mod tests {
+    use std::collections::BTreeMap;
+
+    use super::OptimizerFeatureOverrides;
+
+    #[mz_ore::test]
+    fn bool_override_decodes_lenient_spellings() {
+        for (stored, want) in [("true", true), ("false", false), ("on", true), ("off", false)] {
+            let map = BTreeMap::from([("enable_eager_delta_joins".to_string(), stored.to_string())]);
+            assert_eq!(
+                OptimizerFeatureOverrides::from(map).enable_eager_delta_joins,
+                Some(want),
+            );
+        }
+    }
+}

panics:

thread 'optimize::tests::bool_override_decodes_lenient_spellings' (4146015) panicked at src/repr/src/optimize.rs:191:1:
called `Result::unwrap()` on an `Err` value: ParseBoolError
stack backtrace:
   0: __rustc::rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: <bool as mz_repr::optimize::OptimizerFeatureType>::decode
   4: <mz_repr::optimize::OptimizerFeatureOverrides as core::convert::From<alloc::collections::btree::map::BTreeMap<alloc::string::String, alloc::string::String>>>::from
   5: mz_repr::optimize::tests::bool_override_decodes_lenient_spellings::test_impl
   6: mz_repr::optimize::tests::bool_override_decodes_lenient_spellings
   7: mz_repr::optimize::tests::bool_override_decodes_lenient_spellings::{closure#0}
   8: <mz_repr::optimize::tests::bool_override_decodes_lenient_spellings::{closure#0} as core::ops::function::FnOnce<()>>::call_once

antiguru · 2026-06-15T17:26:31Z

Good catch — fixed in 8ea690964b (folded into the catalog/adapter commit). <bool as OptimizerFeatureType>::decode now accepts the canonical on/off spellings in addition to true/false, so a cluster-coherent override served from LaunchDarkly as on/off no longer panics the optimizer at plan time. Added your bool_override_decodes_lenient_spellings test (passes).

DAlperin

LGTM, modulo one fix

DAlperin · 2026-06-16T19:04:06Z

+            let differs = match params.canonicalize(param_name, &value) {
+                Some(canonical) => canonical != base,
+                // Unparseable for this var: fall back to a raw comparison.
+                None => value != base,


There is a risk here. If LD gives us a badly typed response for a bool value (lets say maybe) canconicalize will return false maybe != off so we store maybe but then bool::decode will panic for every query plan I think. Probably better to just do None here.

…plumbing (#37079) ### Motivation First of a three-PR stack splitting #36959 (scoped feature flags) for review: **1/3 (this) → 2/3 #37080 → 3/3 #36959**. Design: doc/developer/design/20260609_scoped_feature_flags.md (#36947). This PR is the additive foundation: it introduces the vocabulary the later PRs build on, with no behavior change on its own. ### Description * `ParameterScope` (`Environment` / `Cluster` / `Replica`), declared on system vars and dyncfg `Config`s and carried through to the synced system vars. The declaration is the single source of truth for which contexts get evaluated. * The size-family taxonomy: `ReplicaAllocation::family` plus a size-map `family` field, with a `cc` / `legacy` fallback for sizes that don't set one. * The compute controller's per-replica dyncfg override layer (`update_replica_dyncfg_overrides` + per-replica command specialization), inert until the adapter wires it in 3/3. Nothing consumes the scope or the override layer yet, so this is a no-op. ### Verification `test_replica_allocation_family` covers the size→family fallback; the rest is exercised by the later PRs in the stack. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

aljoscha

Reviewed the top commit only (d77eb1bc, the resolution layer), per the stack note.

Overall this is a solid, cohesive change and the correctness reasoning holds up well: the differs-from-env recording rule, the canonical bool bridging so on/off vs true/false don't register as spurious overrides, the lenient optimizer decode guarded by canonicalize so a stored value can't panic the optimizer, the create-time resolution ordered before render-freeze, and the revert-via-full-compute_config-re-push are all sound. The end-to-end LD test covers the interesting paths (differs-from-env guard, recreate-by-name, bidirectional reconcile, gating, durability, create-time). No blocking correctness issues found.

A few things to clean up, mostly around our writing/comment conventions:

Strong suggestions

Em-dashes and semicolons in comments. Our style is to avoid em-dashes and sentence-structuring semicolons everywhere, including code comments. This commit adds ~22 lines with em-dashes and a similar number of structuring semicolons (heaviest in coord.rs and config/frontend.rs). Please do a pass replacing them with full stops and commas. Inline suggestions below mark representative spots.
Doc comments that narrate / duplicate the body. evaluate_scoped_overrides is the clearest case: the differs-from-env-vs-variation_detail rationale is spelled out almost verbatim in both the doc comment and the inline comment at the decision point. Our convention is for the doc comment to state the contract and for the reasoning to live inline at the decision point. Trim the doc comment (suggestion inline).
scoped_ld_ctx is a dead public wrapper. It only forwards to the private ld_ctx, and the real callers (pull_cluster_overrides / pull_replica_overrides) already call ld_ctx directly. Its only users are the unit tests, which live in the same module and can call ld_ctx too. Suggest dropping both the wrapper and its pub use re-export.

Nits

push_replica_dyncfg_overrides: the "compute push reaches storage via the shared persist ConfigSet" subtlety is genuinely worth recording, so keep it, but the doc comment is long and could be tightened (and is part of the em-dash pass).
Consider a unit-level test for the evaluate_scoped_overrides differs-from-env decision itself. The canonicalize and decode units are good, but the core recording decision is currently only exercised by the live-LD mzcompose workflow. Not blocking given the LD client dependency.

aljoscha · 2026-06-18T09:40:53Z

+    /// Evaluates each of `param_names` against `ctx`, returning only the values
+    /// that *differ* from the environment-wide value held in `params`. Shared by
+    /// the cluster and replica passes; the returned map is therefore sparse.
+    ///
+    /// The difference — not the `variation_detail` reason — is the signal that
+    /// the scope context produced a genuine opinion. The reason cannot say which
+    /// context kind's clause matched (an env-level rollout rule and a
+    /// cluster-specific rule both report `RULE_MATCH`), and `FALLTHROUGH` serves
+    /// the env-wide value to every object; either would record a row for every
+    /// object (dense) and let a coarse env-wide value override a finer manual
+    /// `CREATE CLUSTER ... FEATURES` pin. Comparing against the env-wide baseline
+    /// captures exactly "this scope context changed the answer". A silent LD
+    /// (flag absent / off / error / failed prerequisite) resolves to the `base`
+    /// default and is dropped. See the scoped feature flags design, §Resolution.


The differs-from-env rationale here is duplicated almost verbatim by the inline // Record iff ... comment at the recording decision below. Per our convention the doc comment states the contract and the reasoning lives inline at the decision point, so this paragraph can go. (It also carries several em-dashes/semicolons.) Suggested trim:

Suggested change

/// Evaluates each of `param_names` against `ctx`, returning only the values

/// that *differ* from the environment-wide value held in `params`. Shared by

/// the cluster and replica passes; the returned map is therefore sparse.

///

/// The difference — not the `variation_detail` reason — is the signal that

/// the scope context produced a genuine opinion. The reason cannot say which

/// context kind's clause matched (an env-level rollout rule and a

/// cluster-specific rule both report `RULE_MATCH`), and `FALLTHROUGH` serves

/// the env-wide value to every object; either would record a row for every

/// object (dense) and let a coarse env-wide value override a finer manual

/// `CREATE CLUSTER ... FEATURES` pin. Comparing against the env-wide baseline

/// captures exactly "this scope context changed the answer". A silent LD

/// (flag absent / off / error / failed prerequisite) resolves to the `base`

/// default and is dropped. See the scoped feature flags design, §Resolution.

/// Evaluates each of `param_names` against `ctx`, returning only the values

/// that differ from the environment-wide value held in `params`. Shared by

/// the cluster and replica passes, so the returned map is sparse.

///

/// We record on the differs-from-env test, not the `variation_detail`

/// reason. The inline comment at the recording decision explains why.

Applied your suggested trim — the doc now states the contract and points at the inline comment for the differs-from-env reasoning.

Confirmed. The doc now states the contract and defers to the inline comment for the reasoning. Thanks.

aljoscha · 2026-06-18T09:40:53Z

+///
+/// The environment-wide pass passes `None` for both. This is the single entry
+/// point the sync loop uses to evaluate each scoped pass.
+pub fn scoped_ld_ctx(


scoped_ld_ctx only forwards to the private ld_ctx, and the real callers (pull_cluster_overrides / pull_replica_overrides) already call ld_ctx directly. Its only users are the unit tests in this module, which can call ld_ctx directly too. Suggest dropping this wrapper and the pub use frontend::{... scoped_ld_ctx} re-export in config.rs.

Dropped the scoped_ld_ctx wrapper and its config.rs re-export; the unit tests call the private ld_ctx directly (same module), and its doc moved onto ld_ctx. Real callers were already on ld_ctx.

Confirmed. The wrapper and the re-export are gone, the tests call ld_ctx directly, and the stray [scoped_ld_ctx] doc reference was updated too. Thanks.

aljoscha · 2026-06-18T09:40:53Z

+    /// Resolves the replica-local scoped overrides from the catalog working copy
+    /// into the compute controller's per-replica dyncfg layer, then re-pushes
+    /// the environment-wide compute configuration so existing replicas observe
+    /// the new values. Called after reconcile and on bootstrap.
+    ///
+    /// Only the *compute* controller's per-replica dyncfg layer is pushed here,
+    /// but on `clusterd` that also reaches storage for the configs that matter.
+    /// Compute and storage share one process, and the compute worker's
+    /// `handle_update_configuration` applies the pushed dyncfg updates both to
+    /// compute's own worker `ConfigSet` *and* to the shared persist client
+    /// `ConfigSet` (`persist_clients.cfg()`), which the co-located storage server
+    /// reads from the same `Arc`. So persist-backed and process-global
+    /// replica-local configs — the design's examples: the persist pager, LZ4,
+    /// persist client tuning, `lgalloc` — take effect on storage as well, even
+    /// though the push goes through the compute controller. The only thing this
+    /// would miss is a future `Replica`-scoped config realized *solely* in the
+    /// storage worker's own `ConfigSet` (applied only via the storage
+    /// controller's `UpdateConfiguration`); none exists today.


nit: the "compute push reaches storage via the shared persist ConfigSet" subtlety is worth recording, so keep it, but this doc comment is long and stacks several em-dashes/semicolons. Consider tightening it and folding the example list into prose as part of the style pass.

Tightened: folded the example list into prose, cut the em-dashes/semicolons, kept the shared-persist-ConfigSet storage-reach subtlety.

Confirmed. Reads well now and keeps the shared-persist-ConfigSet storage-reach subtlety. Thanks.

aljoscha · 2026-06-18T11:26:33Z

Re-reviewed the responses. The three inline fixes all look good, including the follow-on cleanup of the stray scoped_ld_ctx doc reference.

One item from the review summary is still open: the broader em-dash and structuring-semicolon pass across the new comments. The coord.rs doc comment I flagged inline is clean now, but the rest of the new comments still carry them. By a quick count there are about 19 added lines with em-dashes and about 17 comment lines with structuring semicolons. Examples: the // Record iff ... block in config/frontend.rs, the scoped_frontend field doc and the resolve_scoped_for_new_objects / push_new_replica_scoped_override docs in coord.rs, and the ReplicaScopeContext::size_family and canonicalize docs. It is a mechanical pass, flagging only because it is a project-wide convention. Happy to list the exact lines if that helps.

antiguru · 2026-06-18T11:45:02Z

Done the em-dash and structuring-semicolon pass over the new comments. Removed the em-dashes (reworded into separate sentences or commas) and the structuring semicolons (split into sentences) across config/frontend.rs, config/params.rs, config/sync.rs, command.rs, coord.rs, and test/launchdarkly/mzcompose.py — including the // Record iff ... block, the scoped_frontend field doc, the resolve_scoped_for_new_objects / push_new_replica_scoped_override docs, and the ReplicaScopeContext::size_family / canonicalize docs you called out. The remaining em-dashes in client.rs / coord/introspection.rs / frontend_peek.rs are pre-existing lines this PR does not touch, so I left them.

On the non-blocking suggestion: I agree a direct unit test of the differs-from-env decision would be nice, but evaluate_scoped_overrides takes a live ld::Client with no seam to inject canned variations, so it would need a fake-client shim that does not exist yet. The decision's two halves are unit-covered (canonicalize bridging and the lenient decode), and the end-to-end behavior is exercised by the test/launchdarkly workflow. Happy to add the shim as a follow-up if you think it's worth the surface.

aljoscha · 2026-06-18T11:45:45Z

Did a focused performance pass on the hot paths and continual triggers. Summary: the read/plan path is fine, the costs are all on the write side and only when the feature is enabled.

Plan / peek path: no concern. cluster_scoped_optimizer_overrides() is added to every optimizer-config build (peek, MV, index, subscribe, introspection, fast-path peek). With no override for the target cluster (always the case while gated off) it is one miss on an empty BTreeMap plus a default 19-field OptimizerFeatureOverrides and 19 failed override_from branches. No allocation, no locking. Good.

push_new_replica_scoped_override evaluates and discards all cluster overrides (every CREATE ... REPLICA, when on). It calls evaluate_scoped_parameters(.., None, Some(&replicas)) with cluster_filter = None, so it runs an LD pass over every cluster's cluster-coherent flags and then uses only evaluated.replica. The cluster-create path passes an empty set for the mirror case. Suggest passing Some(&BTreeSet::new()) as the cluster filter here, so a replica creation does not run and throw away an O(#clusters x #cluster-params) LD evaluation.

push_new_replica_scoped_override re-pushes the full env-wide compute config unconditionally (every CREATE ... REPLICA, when on). push_replica_dyncfg_overrides_from always calls update_replica_dyncfg_overrides then update_configuration(compute_config). update_configuration clones the full dyncfg ComputeParameters per instance and broadcasts to every replica (controller.rs:603), the same cost as an env-wide ALTER SYSTEM push. It fires even when the new replica has no override (the common case), and it runs synchronously on the coordinator thread. Creating a managed cluster with N replicas triggers N such re-pushes. Suggest an early-out when the new replica contributes nothing, e.g. if evaluated.replica.is_empty() { return; }.

That early-out also removes the common case of a subtle transient: replica create-time overrides are not written back to the catalog working copy (in-memory/controller only, by design), so merged is rebuilt from the stale working copy. If replica A was given a create-time override and replica B is then created before the next reconcile, the push for B rebuilds from the working copy that does not contain A, momentarily clearing A's override at the controller until the next tick. Bounded by the 1s reconcile window and harmless for render-frozen flags, but worth being aware of for replica-local configs that take effect on update.

Continual trigger: per-tick O(fleet) evaluation when enabled. With the feature on, sync_scoped_params runs every tick (1s in prod) and rebuilds eval contexts for every cluster and replica (several String allocations each) and runs an LD variation per object per scoped param, regardless of whether anything changed. This is on the dedicated sync-loop task so it does not block the coordinator, but it is continual O(#clusters + #replicas) CPU and allocation churn every second. Fine for small fleets, worth keeping in mind for large ones. The coordinator-side per-tick cost is just the CatalogSnapshot Arc-clone plus the UpdateScopedSystemParameters equality early-out, both negligible.

Minor: SynchronizedParameters::new(catalog.system_config().clone()) on the create-cluster and create-replica paths deep-clones the whole SystemVars (~450 vars) and builds a ~450-element BTreeSet. Only on those DDL paths and only when enabled, so acceptable, just noting it is not a cheap call.

Net: nothing on the read path, and the write-path costs are gated off by default. The two worth acting on are the discarded cluster evaluation and the unconditional full compute re-push on replica creation.

antiguru · 2026-06-18T11:51:21Z

Thanks for the perf pass. Acted on the two write-path items in push_new_replica_scoped_override:

Discarded cluster evaluation: now passes Some(&BTreeSet::new()) as the cluster filter, so a CREATE ... REPLICA runs only the replica-local pass and no longer evaluates+throws away the O(#clusters x #cluster-params) cluster-coherent LD pass.
Unconditional full compute re-push: added an early-out — if evaluated.replica.is_empty() { return; } — before the merge and push. The new replica having no override is the common case, so we skip the per-instance update_configuration broadcast entirely. As you noted, this also removes the common-case transient where rebuilding the pushed map from the working copy (which does not carry other replicas not-yet-reconciled create-time overrides) momentarily clears them at the controller. The rare has-override case keeps the current behavior, with that transient bounded by the 1s reconcile window.

On the rest, agreed and no change:

Plan/peek path: as you found, gated-off is an empty-map miss + a default-overrides no-op. Nothing to do.
Per-tick O(fleet) evaluation when enabled: real, but on the dedicated sync-loop task and only when the feature is on. Noted for large fleets; if it bites we can diff-and-skip unchanged objects or widen the tick.
SynchronizedParameters::new(... .clone()) on the DDL paths: only on create-cluster/create-replica and only when enabled. Acceptable, noted.

aljoscha · 2026-06-18T12:26:31Z

Follow-up on the per-tick note, on whether this is new work or just more of what the sync loop already did. It is net-new, and it runs unconditionally, so worth calling out explicitly.

Before this feature, the sync loop did two things per tick: backend.pull to fetch current values from the coordinator, and frontend.pull to evaluate each synced param against the single env-wide LD context. That env-wide pass is O(#synced params) against one context, and it only does the downstream backend.push when a value actually changed. No catalog snapshot, no per-object work.

This PR adds, on every tick:

an unconditional catalog_snapshot().await, and
when enabled, evaluate_scoped_parameters which builds eval contexts for every cluster and replica (O(#clusters + #replicas), several String allocations each) and runs an LD variation per object per scoped param (O(#objects x #scoped-params)), followed by an update_scoped_system_parameters().await.

So this is not the env-wide pass getting a bit bigger. The object factor is entirely additional. And unlike the env-wide pass, which gates its expensive work on "did anything change", the scoped pass runs the full fleet evaluation every tick regardless. The coordinator early-outs on the equality check in reconcile_scoped_system_parameters, but the sync-task-side evaluation does not, so we redo the whole O(#objects x #scoped-params) LD pass and the per-object context allocations once per second whether or not anything changed.

It is on the dedicated sync-loop task so it does not block the coordinator, and it is gated off by default. But it is continual extra work proportional to fleet size, so worth a deliberate look before this goes on by default, especially for large environments. A cheap guard would be to skip the pass when neither the set of live objects nor the relevant LD state has changed since the last tick, though detecting the latter is the hard part.

antiguru · 2026-06-18T13:20:25Z

Agreed, the object factor is net-new and the scoped pass runs every tick regardless of change. Two parts here:

The off-path should not regress, since a patch release to undo it would be painful. Fixed in 210c04f: the scoped reconcile now reads the gate from the sync working copy (params) instead of taking a catalog_snapshot().await every tick, so a disabled environment (the default) does no per-tick coordinator round-trip. A small maybe_dirty flag still does the one-time clear of any leftover overrides on startup and on an enable->disable transition, then goes quiet.

The on-path full-fleet evaluation is the part worth optimizing, and since this will be on by default we will want it. I would rather do that in a follow-up than widen this PR: a cheap object-set/env-change guard is unsafe on its own (a scoped LD rule can change with neither the live-object set nor the env-wide context changing, which would silently freeze scoped overrides), so the real fix is either a dedicated slower scoped tick or hooking LD streaming changes. Tracking that separately.

#37151) ### Motivation Scoped feature flags need the per-replica controller push to be *derived from the committed catalog diff* rather than issued as a side effect in the sequencer. Deriving it removes the working-copy/controller divergence that lets a create-time replica override be wiped by a reconcile in the same DDL, and it makes the push fire for a follower `environmentd` that only replays the catalog diff. This is the pattern the catalog-implications framework (#29673, `TODO(aljoscha)`) exists to provide, which until now ignored config-type changes. ### Description `parse_state_update` no longer drops `ReplicaSystemConfiguration` updates; it emits a `ParsedStateUpdateKind::ReplicaSystemConfiguration`. When such an update is present in a transaction, `apply_catalog_implications` rebuilds the complete per-replica compute dyncfg layer from the catalog working copy and pushes it after clusters are created and before replicas are created, so a new replica's first configuration replays with its override. The same push runs once on bootstrap from the restored working copy. Because the push reads the working copy that the same transaction's catalog apply has already updated, the diff drives the push. The push mechanism (`push_replica_dyncfg_overrides`) lives here. Cluster-scoped overrides are read at plan time and have no controller effect, so they remain unparsed. This is the foundation under #36959, which adds the writers that drive the implication; the implication is dormant until that PR lands on top. ### Verification Adds a unit test asserting a `ReplicaSystemConfiguration` update is parsed rather than dropped. End-to-end controller behavior is covered by the LaunchDarkly mzcompose test in #36959. 🤖 Generated with [Claude Code](https://claude.com/claude-code) https://claude.ai/code/session_01FUorxx5m6B8TT2VAek2STo Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

aljoscha

Reviewed the resolution and gating layer locally against main. Overall the design is solid and the comments are unusually good. One blocking correctness item from the new prune_scope rework, plus a couple of suggestions. Main item is inline on catalog/transact.rs, the rest is below.

Blocking

See the inline comment on src/adapter/src/catalog/transact.rs: dropped clusters and replicas leak their scoped-config rows, and as a knock-on the periodic sync stops converging. Details and a suggested fix are there.

Strong suggestions

No Rust unit test covers the UpdateScopedSystemParameters diff and prune semantics. The only coverage is the test/launchdarkly mzcompose workflow, which needs live credentials and is not in standard PR CI. A catalog-level unit test exercising upsert, remove, prune_scope bounding, and the dropped-object case would have caught the blocking item and guards this boundary going forward. The other new logic (canonicalize, lenient decode, context building) is unit tested, which is great.
The cluster-scoped optimizer override is wired in ad hoc at around a dozen plan-path call sites via .override_from(&self.cluster_scoped_optimizer_overrides(cluster_id)) (peek, index, MV, subscribe, introspection, bootstrap, frontend_peek). It looks complete today, but a future plan path that builds a per-cluster OptimizerConfig will silently miss the scoped layer. Consider a single helper that assembles the per-cluster features (system config plus config.features() plus scoped overrides) so the precedence lives in one place. Partly pre-existing, since config.features() is already repeated, but this PR roughly doubles the override points.

Nits

bool::decode in src/repr/src/optimize.rs still panics on truly invalid input, even though the doc says decoding must not crash the optimizer. It is effectively unreachable, since canonicalize gates what gets stored and the FEATURES clause is bool typed, so it is a defensive assertion rather than a real hazard. The wording just oversells the guarantee.
Replica create-time resolution is not done here. Only the cluster paths call resolve_scoped_for_new_objects, always with an empty replica set, so a freshly created replica's render-frozen flags use env-wide values until the next sync tick. The description documents this as deferred to #37158. Fine given the gate, just calling it out.

What is good

The differs-from-env recording rule with canonical bool bridging is the right call and is well explained and tested.
The gating works as intended: the loop reads the gate from params, and backend.pull seeds params from the catalog system vars every tick, so ALTER SYSTEM SET and additional_system_parameter_defaults are observed even though the gate is not an LD flag.
The prune_scope protection for concurrently created objects is sound, and the create-time merge correctly preserves existing overrides.
Create-time resolution does a local in-memory LD evaluation only, with no network round-trip on the coordinator loop, so it is cheap.

aljoscha · 2026-06-22T08:43:11Z

+                    if prune_cluster(&cluster_id)
+                        && !desired_cluster.contains(&(cluster_id, name.clone()))
+                    {
                        tx.remove_cluster_system_config(cluster_id, &name);


Blocking: this prune_scope change drops the lazy GC of orphaned rows for dropped objects, with a few knock-on effects.

The sync loop builds prune_scope from catalog.clusters(), so it only ever contains live objects. A dropped cluster or replica is never in prune_scope, so prune_cluster / prune_replica returns false for it and its rows are never removed. Nothing else GCs these rows either: remove_clusters / remove_cluster_replicas cascade only to cluster_replicas and introspection_sources, and the only callers of remove_*_system_config are right here. The previous code pruned any row not in the desired state, which lazily covered dropped objects (the old comment said exactly that).

Consequences:

Durable and introspection leak. mz_cluster_system_parameters / mz_replica_system_parameters are unfiltered projections of these collections, so orphan rows with a dangling cluster_id / replica_id accumulate until the feature is toggled off (the only full wipe).

The per-tick early return is defeated. reconcile_scoped_system_parameters skips the durable write when working_copy == scoped, but the sync loop's scoped is live-only while the working copy retains orphans, so they can never be equal once an orphan exists. The coordinator then runs a catalog_transact every tick that can never converge.

The recreate-then-clear case in test/launchdarkly/mzcompose.py should fail: after DROP plus CREATE ld_role, clearing the rules and asserting count(*) ... = 0 will see the orphan row from the dropped incarnation. I could not run it locally since it needs a live LD project.

Suggested fix, local to this handler: also prune rows whose owning object is not live. live_clusters / live_replicas are already computed just above, so something like (!live_clusters.contains(&cluster_id) || prune_cluster(&cluster_id)) && !desired_cluster.contains(...) (and the replica analogue). That restores dropped-object GC while keeping the concurrent-create protection that prune_scope was added for. Orphan pruning is always safe since ids are never reused.

aljoscha · 2026-06-22T08:43:11Z

+            c.testdrive(
+                "\n".join(
+                    [
+                        f"> SELECT count(*) FROM mz_internal.mz_cluster_system_parameters WHERE name = '{OPTIMIZER_PARAM}'",


This count(*) = 0 assertion is where the orphan-GC regression surfaces. After the DROP plus CREATE of ld_role above, the dropped incarnation's row is not pruned by the reconcile, since its id is no longer in prune_scope, so this count should be 1, not 0. See the comment on src/adapter/src/catalog/transact.rs. Worth re-running against the live LD project once the prune fix is in.

Replica ids were allocated in-apply by insert_cluster_replica, reading the IdAlloc counter from the snapshot the coordinator opened the transaction with. Cluster and item ids are instead allocated out-of-band by the durable allocator before the transaction, and the op carries the explicit id. That asymmetry breaks any caller that pre-allocates a replica id out-of-band: the coordinator does not apply IdAlloc updates to its in-memory catalog, so a later in-apply allocation reuses the id and a DuplicateReplica panic surfaces on the next bootstrap. Make replica ids durable-allocated and single-source like cluster ids. Op::CreateClusterReplica now carries a required replica_id; the apply path always uses insert_cluster_replica_with_id and never allocates. The runtime construction sites pre-allocate via the durable allocator, choosing user or system ids from the owning cluster. The catalog-open builtin-replica migration keeps a transaction-level system allocation, which is single-source inside the open transaction. The in-apply insert_cluster_replica and the unused transaction-level user allocator are removed. DB-147 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01FUorxx5m6B8TT2VAek2STo

…tion The resolution layer that consumes the durable scoped overrides, behind the `enable_scoped_system_parameters` dyncfg (off by default). * The sync loop evaluates the `cluster` and `replica` LaunchDarkly contexts and records an override only when the scoped value differs from the environment- wide value, comparing in canonical encoding so bool spellings match. The whole evaluation is skipped while the gate is off, and the working copy cleared. * Resolution boundaries: cluster-coherent overrides feed plan-time `OptimizerFeatureOverrides` (with a lenient bool decode so `on`/`off` cannot panic the optimizer); replica-local overrides feed the compute controller's per-replica dyncfg push. * Create-time evaluation resolves a freshly created cluster or replica through a shared `SystemParameterFrontend` so new objects get their overrides without waiting for the next sync tick. * Tests: canonical bool bridging, lenient optimizer decode, frontend context construction, the LaunchDarkly end-to-end cases, and workload-harness registration of the new gate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

aljoscha · 2026-06-22T11:31:07Z

Thanks for the quick turnaround. The rebased revision (926f501) resolves the blocking item: the prune loop now reclaims orphaned rows for dropped objects (!live_clusters.contains(..) || prune_cluster(..)), and the new test_transact_update_scoped_system_parameters_prune covers upsert, bounded prune, stale in-scope prune, and the orphan case. That also clears my note on the recreate-then-clear assertion in the launchdarkly test. (My earlier inline comments are now outdated against this push.)

Remaining items, all non-blocking:

The cluster-scoped optimizer override is still applied ad hoc at roughly a dozen plan-path call sites. Not a problem today, but a future plan path that builds a per-cluster OptimizerConfig can silently miss the scoped layer. A single per-cluster feature-assembly helper would keep the precedence in one place. Fine as a follow-up.
bool::decode in repr/optimize.rs still panics on truly invalid input even though the doc says it must not crash the optimizer. It is effectively unreachable since canonicalize gates what gets stored, so this is just a wording nit.
Replica create-time resolution remains deferred to Scoped feature flags: create-time evaluation #37158, as documented.

LGTM once you are happy with those. Nice work on the orphan-prune test in particular.

antiguru force-pushed the claude/great-brahmagupta-ybi2yf branch 2 times, most recently from e2c7200 to 59a43c8 Compare June 15, 2026 11:08

antiguru marked this pull request as ready for review June 15, 2026 14:22

antiguru requested review from a team as code owners June 15, 2026 14:22

antiguru commented Jun 15, 2026

View reviewed changes

Comment thread src/adapter/src/catalog/apply.rs Outdated

Comment thread src/adapter/src/config/frontend.rs Outdated

antiguru force-pushed the claude/great-brahmagupta-ybi2yf branch from 877ff6a to cc7f88c Compare June 15, 2026 15:26

def- requested changes Jun 15, 2026

View reviewed changes

antiguru force-pushed the claude/great-brahmagupta-ybi2yf branch from cc7f88c to 75f521c Compare June 15, 2026 17:26

This was referenced Jun 16, 2026

Scoped feature flags 1/3: scope declaration and per-replica override plumbing #37079

Merged

Scoped feature flags 2/3: durable persistence and introspection (inert) #37080

Merged

antiguru force-pushed the claude/great-brahmagupta-ybi2yf branch from 75f521c to 43c4d79 Compare June 16, 2026 17:19

antiguru changed the title ~~Scoped feature flags: per-cluster and per-replica LaunchDarkly overrides~~ Scoped feature flags 3/3: resolution, gating, and create-time evaluation Jun 16, 2026

DAlperin approved these changes Jun 16, 2026

View reviewed changes

antiguru force-pushed the claude/great-brahmagupta-ybi2yf branch from 43c4d79 to 7296095 Compare June 17, 2026 09:59

antiguru requested a review from def- June 17, 2026 11:25

antiguru force-pushed the claude/great-brahmagupta-ybi2yf branch 2 times, most recently from ea2b4e1 to 44cd10a Compare June 17, 2026 12:00

antiguru force-pushed the claude/great-brahmagupta-ybi2yf branch 4 times, most recently from 7561342 to d77eb1b Compare June 18, 2026 09:22

aljoscha reviewed Jun 18, 2026

View reviewed changes

antiguru force-pushed the claude/great-brahmagupta-ybi2yf branch from d77eb1b to c34249e Compare June 18, 2026 09:56

antiguru force-pushed the claude/great-brahmagupta-ybi2yf branch from c34249e to 6f6a849 Compare June 18, 2026 11:44

antiguru force-pushed the claude/great-brahmagupta-ybi2yf branch from 6f6a849 to fc13ba4 Compare June 18, 2026 11:51

antiguru force-pushed the claude/great-brahmagupta-ybi2yf branch from fc13ba4 to 210c04f Compare June 18, 2026 13:20

antiguru force-pushed the claude/great-brahmagupta-ybi2yf branch from 210c04f to d3df8f0 Compare June 19, 2026 09:17

antiguru mentioned this pull request Jun 19, 2026

adapter: derive replica-scoped override push from catalog implications #37151

Merged

antiguru marked this pull request as draft June 19, 2026 09:53

antiguru force-pushed the claude/great-brahmagupta-ybi2yf branch 2 times, most recently from edb3bb6 to d75b222 Compare June 19, 2026 12:51

antiguru mentioned this pull request Jun 19, 2026

Scoped feature flags: create-time evaluation #37158

Open

antiguru changed the title ~~Scoped feature flags 3/3: resolution, gating, and create-time evaluation~~ Scoped feature flags: resolution and gating Jun 19, 2026

antiguru marked this pull request as ready for review June 19, 2026 12:53

antiguru force-pushed the claude/great-brahmagupta-ybi2yf branch from d75b222 to db631b9 Compare June 19, 2026 13:28

antiguru force-pushed the claude/great-brahmagupta-ybi2yf branch 2 times, most recently from 96a390d to d790e90 Compare June 21, 2026 19:30

aljoscha reviewed Jun 22, 2026

View reviewed changes

antiguru mentioned this pull request Jun 22, 2026

catalog: allocate cluster replica ids durably, single-source #37178

Open

antiguru force-pushed the claude/great-brahmagupta-ybi2yf branch from d790e90 to 926f501 Compare June 22, 2026 10:09

Conversation

antiguru commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Description

Verification

Uh oh!

Uh oh!

Uh oh!

def- left a comment

Choose a reason for hiding this comment

Uh oh!

antiguru commented Jun 15, 2026

Uh oh!

DAlperin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aljoscha left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aljoscha commented Jun 18, 2026

Uh oh!

antiguru commented Jun 18, 2026

Uh oh!

aljoscha commented Jun 18, 2026

Uh oh!

antiguru commented Jun 18, 2026

Uh oh!

aljoscha commented Jun 18, 2026

Uh oh!

antiguru commented Jun 18, 2026

Uh oh!

aljoscha left a comment

Choose a reason for hiding this comment

Blocking

Strong suggestions

Nits

What is good

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aljoscha commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

antiguru commented Jun 10, 2026 •

edited

Loading