doc: Add design for scoped feature flags (per-cluster and per-replica) by antiguru · Pull Request #36947 · MaterializeInc/materialize

antiguru · 2026-06-09T19:07:08Z

Motivation

This design document outlines the approach for implementing scoped feature flags in Materialize, enabling per-cluster and per-replica LaunchDarkly overrides. This addresses two concrete use cases:

Per-cluster optimizer flags: Different clusters (e.g., mz_catalog_server) should be able to run with different optimizer feature sets without affecting the rest of the environment.
Per-replica flags by size family: Different replica size families (legacy t-shirt sizes, D.1, etc.) should support different configurations (e.g., legacy sizes keep lgalloc, while D.1 enables the persist pager and LZ4 compression).

Currently, feature flags are evaluated once per environment with no way to target specific clusters or replicas.

Description

This document introduces a comprehensive design for scoped system parameters that extends the existing LaunchDarkly integration to support two new scope classes:

Cluster-coherent flags: Evaluated replica-free using a cluster context kind, applied at plan time via OptimizerFeatureOverrides. Ensures all replicas of a cluster run the same optimized plans.
Replica-local flags: Evaluated per-replica using a replica context kind (which includes cluster and size family attributes), resolved at the controller's per-replica dyncfg push.

Key design decisions:

Dual context kinds (cluster and replica) to enforce coherence boundaries — cluster-coherent flags cannot vary by replica.
Required scope declaration on every synced parameter (Environment, Cluster, or Replica), enabling documentation, validation, and efficient evaluation.
In-memory reconciliation from LD (not durable DDL) — scoped overrides are cached from continuous LD evaluation, avoiding recreate ambiguity and simplifying fallback behavior.
Dual id/name attributes on contexts to support both role-based predicates (survive recreate) and incarnation pins (die with the object).
Resolution at existing boundaries — replica-local flags resolve at the controller's dyncfg push; cluster-coherent flags feed into OptimizerFeatureOverrides at plan time.

The design includes a minimal viable prototype roadmap, worked examples, and discussion of alternatives and open questions (deferred operational tuning concerns).

Verification

This is a design document with no code changes to verify. The document is self-contained and ready for review and discussion before implementation begins.

https://claude.ai/code/session_01S9fiehWEbC4BEXEq8p7LP9

…overrides

Two context kinds (cluster-coherent vs replica-local), dual id/name attributes for role-vs-incarnation targeting, in-memory reconciled storage instead of ALTER SYSTEM FOR CLUSTER DDL, and server-side service-connection billing.

Cluster-scoped LD overrides manual CREATE CLUSTER FEATURES, consistent with LD overriding ALTER SYSTEM globally. Ordering: env-wide LD < manual FEATURES < cluster-scoped LD, decided per-feature via variation_detail reason.

…taxonomy - ParameterScope (Environment/Cluster/Replica) declared at definition; drives evaluation, resolution, validation, and docs. - Size family taxonomy sourced from ClusterReplicaSizeMap (new per-size field). - Mark context-list growth and sync cadence as deferrable, non-blocking.

Every scoped evaluation keeps environment/organization/build in the multi-context alongside the scope context so rules can cross axes; do not duplicate env attributes onto cluster/replica contexts.

is_builtin is a clean invariant (System id / s-prefix) so the attribute is readable sugar; replica_size_family is a curated mapping that can't be safely derived via startsWith/endsWith, so the explicit attribute from the size map is required.

Scoped per-cluster/per-replica overrides are now persisted so they survive environmentd restart and LD unavailability (serving last-known values, falling back to env-wide only on a cold cache). Keyed by object id, sole writer is the sync loop, no user DDL; non-reused ids keep stale entries inert so GC is lazy hygiene, not a correctness concern. Matches how global flags already persist.

- Record a scoped row on difference-from-env-wide (not variation_detail reason): restores sparseness and makes the rule uniform across scopes. FALLTHROUGH is the env-wide value and RULE_MATCH can't identify the matching context kind, so the reason-based rule was both dense and incorrect. - Storage is two flat collections (cluster_/replica_system_configurations), mirroring system_configurations, not one sum-typed key. - Working copy rides in CatalogState/Arc<Catalog> so cluster overrides apply on fast-path peeks and bootstrap re-optimization, not just sequencing. - Clarify GC happens on first reconcile after startup, not at startup. - Note ReplicaAllocation::family() fallback to cc/legacy.

mgree · 2026-06-10T17:37:14Z

+   features per cluster from LD, e.g. enabling a feature on the catalog server
+   (or a specific user cluster) without affecting everyone else.
+
+2. **Per-replica overrides (replica-local flags), keyed by size family.** We are


How does this interact with self-managed, where (as I understand it) customers can create their own replica sizes?

mgree · 2026-06-10T17:49:10Z

+    /// Cluster-coherent: env-wide base + per-cluster overrides. Evaluated with
+    /// the `cluster` context (replica-free) and resolved at plan time via
+    /// `OptimizerFeatureOverrides`. e.g. optimizer features.
+    Cluster,


Is there some way to uses this information to automate the OptimizerConfig/OptimizerFeatureOverrides? That would remove a lot of friction from making optimizer feature flags.

I think yes, optimizer feature flags are just cluster-scoped flags. With my proposal, we likely wouldn't need the overrides anymore. That said, I'm not very familiar with the optimizer overrides.

We need the overrides also for EXPLAIN, i.e., EXPLAIN WITH (<flag>) being able to tell us how the plan would look like if a flag were flipped.

…des) - Cluster-scoped LD feeds OptimizerFeatureOverrides, does not replace it; the overrides mechanism is retained for per-statement EXPLAIN WITH, which is the most specific scope. Extend precedence accordingly. - Self-managed: family() must always yield a sensible default string for operator-defined size names we don't control; mechanism is LD-driven and degrades to env-wide via the file frontend / no LD.

Document the opt-in kill-switch (default false) checked at sync_scoped_params: off = today's env-wide-only resolution (overrides cleared once); on = cluster/ replica evaluation with no deploy. dyncfg rather than feature_flags! since the latter's catalog rehydration contract doesn't apply to a runtime toggle.

- State the plan-time vs live timing asymmetry: replica-local changes apply live; cluster-coherent (optimizer) changes apply only on subsequent (re)planning, not retroactively to installed dataflows. - Specify differs-from-env-wide compares the resolved/typed value through the identical eval+parse path, not raw serialized forms. - Make the size-family fallback fail safe: unknown sizes default to a neutral 'other' that matches no curated rule, not 'legacy'. - Note the newly-created-object eventual-consistency window and its interaction with plan-time consumption. - Introspection exposes raw stored override (effective-value is follow-up); gate-off clearing is conditional on the sync loop running; promote LD billing to a pre-launch Open Question; add builtin blast-radius guardrail.

1. Relocate the enable_scoped_system_parameters feature-gate description to the top of Solution Proposal (kept the MVP step-0 companion bullet). 2. Add a Synchronous create-time resolution subsection to Resolution and revise the Storage lifecycle bullet: new objects resolve scoped overrides inline at creation (render-frozen flags make the tick window a correctness problem), via synchronous local variation at the two scope boundaries; the loop stays the authoritative writer with a sub-tick restart residual for replica-local persistence. 3. Clarify Resolution(a): only the compute controller's per-replica dyncfg layer is pushed, but it reaches storage on clusterd via the shared persist client ConfigSet.

DAlperin · 2026-06-15T15:57:33Z

+Scoped overrides are **persisted**, so they survive an `environmentd` restart and
+stay in effect while LaunchDarkly is slow to sync or unavailable. This is


Do they? I mean this seems reasonable but isn't this also status quo? Right now if LD is slow to sync it would fall back to the default in code and you might lose the environment level setting from LD

We store whatever we synced from LD in the catalog, so on restart we should come up with the same flags we had the last time we synced.

…plumbing (#37079) ### Motivation First of a three-PR stack splitting #36959 (scoped feature flags) for review: **1/3 (this) → 2/3 #37080 → 3/3 #36959**. Design: doc/developer/design/20260609_scoped_feature_flags.md (#36947). This PR is the additive foundation: it introduces the vocabulary the later PRs build on, with no behavior change on its own. ### Description * `ParameterScope` (`Environment` / `Cluster` / `Replica`), declared on system vars and dyncfg `Config`s and carried through to the synced system vars. The declaration is the single source of truth for which contexts get evaluated. * The size-family taxonomy: `ReplicaAllocation::family` plus a size-map `family` field, with a `cc` / `legacy` fallback for sizes that don't set one. * The compute controller's per-replica dyncfg override layer (`update_replica_dyncfg_overrides` + per-replica command specialization), inert until the adapter wires it in 3/3. Nothing consumes the scope or the override layer yet, so this is a no-op. ### Verification `test_replica_allocation_family` covers the size→family fallback; the rest is exercised by the later PRs in the stack. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

claude added 7 commits June 9, 2026 18:19

design: scoped feature flags for per-cluster and per-replica-size LD …

190a392

…overrides

design: revise scoped feature flags doc

b39e6ae

Two context kinds (cluster-coherent vs replica-local), dual id/name attributes for role-vs-incarnation targeting, in-memory reconciled storage instead of ALTER SYSTEM FOR CLUSTER DDL, and server-side service-connection billing.

design: resolve manual FEATURES vs LD precedence

0d0dbe7

Cluster-scoped LD overrides manual CREATE CLUSTER FEATURES, consistent with LD overriding ALTER SYSTEM globally. Ordering: env-wide LD < manual FEATURES < cluster-scoped LD, decided per-feature via variation_detail reason.

design: clarify LD contexts are siblings, not hierarchical

9be02f6

Every scoped evaluation keeps environment/organization/build in the multi-context alongside the scope context so rules can cross axes; do not duplicate env attributes onto cluster/replica contexts.

antiguru mentioned this pull request Jun 10, 2026

Scoped feature flags: resolution and gating #36959

Open

antiguru marked this pull request as ready for review June 10, 2026 16:40

antiguru requested review from aljoscha, def- and ggevay June 10, 2026 16:41

mgree reviewed Jun 10, 2026

View reviewed changes

claude added 5 commits June 11, 2026 14:57

design: flag LD billing confirmation as a pre-launch checkbox

09603fe

DAlperin reviewed Jun 15, 2026

View reviewed changes

This was referenced Jun 16, 2026

Scoped feature flags 1/3: scope declaration and per-replica override plumbing #37079

Merged

Scoped feature flags 2/3: durable persistence and introspection (inert) #37080

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc: Add design for scoped feature flags (per-cluster and per-replica)#36947

doc: Add design for scoped feature flags (per-cluster and per-replica)#36947
antiguru wants to merge 13 commits into
mainfrom
claude/gifted-mccarthy-owiuqv

antiguru commented Jun 9, 2026

Uh oh!

mgree Jun 10, 2026

Uh oh!

mgree Jun 10, 2026

Uh oh!

antiguru Jun 10, 2026

Uh oh!

ggevay Jun 11, 2026

Uh oh!

DAlperin Jun 15, 2026

Uh oh!

antiguru Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		Scoped overrides are persisted, so they survive an `environmentd` restart and
		stay in effect while LaunchDarkly is slow to sync or unavailable. This is

Conversation

antiguru commented Jun 9, 2026

Motivation

Description

Verification

Uh oh!

mgree Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

mgree Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

antiguru Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

ggevay Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

DAlperin Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

antiguru Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants