simple-mitigation

A single Go binary that consumes the per-pod ContentionStream gRPC API (see Mitigation-interface.md), evaluates a CEL policy each tick, and fires one of three mitigation tiers:

Tier	Surface	Timescale	Actuator
`isolate`	cgroup v2 `cpu.max` on co-located aggressors	~100 ms	`pkg/actuators/isolate`
`vertical`	`pods/resize` subresource (cpu requests/limits)	~1 s	`pkg/actuators/vertical`
`horizontal`	`apps/v1.Deployment/scale` subresource	~10 s+	`pkg/actuators/horizontal`

The binary runs as a privileged DaemonSet -- one instance per node. Each instance subscribes only to victim pods on its own node (field selector spec.nodeName=$NODE_NAME), so node-local mitigations are race-free without leader election. Horizontal scale is coordinated K8s-natively via an idempotent /scale patch + a mitigation/horizontal-last-scaled-at cooldown annotation on the Deployment.

See plan-v2-centralized.md for the full design.

Architecture

   victim pod (this node)             mitigation-controller (this node, DaemonSet)
   :7900 ──gRPC stream──▶  scoreclient ──▶ features (rolling window per pod)
                                                    ↓
                                            policy (CEL rules)
                                                    ↓
                                ┌──────────────┬──────────────┐
                                ▼              ▼              ▼
                            isolate         vertical      horizontal
                            (cpu.max)    (pods/resize) (deploy/scale)

Repo layout

proto/contention.proto                  vendored wire contract (3 spatial-horizon fields added)
gen/go/contentionpb/                    generated (gitignored) -- run `make proto`
pkg/targets/                            multi-victim config loader
pkg/scoreclient/                        gRPC subscriber w/ reconnect + multi-pod fan-in
pkg/podwatch/                           client-go informer (+ NewLocalNodeWatcher for the DaemonSet)
pkg/features/                           rolling window + spatial/temporal feature computation
pkg/policy/                             CEL env, YAML rule loader, fsnotify hot-reload, engine
pkg/cgroup/                             cgroup v2 path resolution + cpu.max read/write
pkg/actuators/                          shared interface + annotation key constants
pkg/actuators/isolate/                  throttles aggressor pods' cpu.max
pkg/actuators/vertical/                 patches pods/resize for the victim pod
pkg/actuators/horizontal/               patches deployments/scale for the victim Deployment
pkg/aggregator/                         pluggable Max / Mean / P90 (callable from rules)
pkg/thresholder/                        HI/LO + cooldown state machine (also exposed to CEL via `band`)
cmd/mitigation-controller/              the only binary
deploy/controller/                      DaemonSet, RBAC, ConfigMap (targets + policy)
deploy/victim-sample/                   sample search + profile Deployments

Build

Requires Go 1.23 and protoc. On Debian/Ubuntu:

sudo apt install protobuf-compiler
make deps         # installs protoc-gen-go + protoc-gen-go-grpc
make proto        # generates gen/go/contentionpb/*.pb.go
go mod tidy
make build        # equivalent to `go build ./...`
make test         # runs all unit tests

Build the container image:

make docker-controller

The Dockerfile runs make proto inside the build stage, so docker build works from a fresh clone.

Default policy (out of the box)

Three rules ship in deploy/controller/configmap.yaml, matching plan-v2-centralized.md Section 5 verbatim:

rules:
  - name: sharp_rising_spike
    when: "k_temporal > 0.3 || k_spatial > 0.3"
    fire:
      - kind: isolate
        params: { throttle_fraction: 0.5, aggressor_selector: "tier=batch" }
      - kind: vertical
        params: { scale_factor: 1.5 }
    cooldown: "30s"
    priority: 100

  - name: sustained_high_p50
    when: "p50_now > 0.5 && persistence_h >= 3 && duration_above_hi_ms >= 2000"
    fire:
      - kind: horizontal
        params: { delta: 1 }
    cooldown: "60s"
    priority: 50

  - name: clean_state
    when: "p50_now < 0.2 && k_temporal < 0 && tail_now < 0.5"
    fire:
      - kind: restore
        params: { tier: all }
    cooldown: "60s"
    priority: 10

restore is a meta-action: it fans out to every actuator's Restore(), which reads the mitigation/* annotations on the corresponding object and reverses the most recent action.

CEL vocabulary

All feature fields are top-level identifiers (no wrapper object). Match the field names in features.FeatureVector:

Identifier	Type	Meaning
`target`	string	victim service name
`pod`	string	victim pod name
`p50_now`, `tail_now`	double	latest p50_trend_pred / tail_trend_label
`p50_h`, `tail_h`	list(double)	multi-horizon arrays (empty under a single-horizon predictor)
`horizon_ms`	list(int)	parallel array of horizon offsets
`k_spatial`	double	least-squares slope of p50_h vs horizon_ms
`accel_spatial`	double	mean second-difference of p50_h
`p50_max_horizon_ms`	int	argmax horizon
`persistence_h`	int	count of p50_h entries >= HI_THRESHOLD
`k_temporal`	double	least-squares slope of p50 over the rolling window (per second)
`accel_temporal`	double	mean second-difference over the window
`variance`	double	sample variance over the window
`duration_above_hi_ms`	int	length of the most recent contiguous run above HI_THRESHOLD
`window_size`	int	samples currently in the rolling window
`has_spatial`	bool	`true` iff the latest event populated `p50_horizons`
`model_version`	string	latest event's model_version
`source_kind`	string	latest event's source_kind ("onnx" / "formula" / ...)

Two helper functions are registered:

band(score, lo, hi) string -> "up" / "down" / "stable"
count_at_least(list, threshold) int -> count of list entries >= threshold

Authoring workflow

Edit data.policy.yaml in the ConfigMap.
Apply: kubectl apply -f deploy/controller/configmap.yaml.
The kubelet remounts the volume; fsnotify in pkg/policy/loader.go triggers engine.Reload within ~1s. Look for policy reloaded in the controller logs.

A typo in a CEL expression is rejected by engine.Reload and the previous rules stay live -- the controller never goes silent on a bad rule.

Default thresholds (explicit)

Env var	Default	Meaning
`TICK_MS`	`100`	per-pod policy evaluation cadence
`STALE_MS`	`1500`	a snapshot older than this is treated as missing
`WINDOW_SIZE`	`20`	rolling-window samples (~2 s at 100 ms cadence)
`HI_THRESHOLD`	`0.5`	what counts as "elevated" for PersistenceH / DurationAboveHiMs
`MIN_CPU` / `MAX_CPU`	`200m` / `4`	vertical resize clamp
`HORIZONTAL_COOLDOWN_SEC`	`30`	cross-node Deployment scale gate
`TARGETS_CONFIG`	`/etc/mitigation/targets.yaml`	mounted from the ConfigMap
`POLICY_CONFIG`	`/etc/mitigation/policy.yaml`	same
`NODE_NAME`	(none)	required; injected via `fieldRef: spec.nodeName`

Deploy

Prerequisite: K8s >= 1.35 (in-place pod resize GA -- see https://kubernetes.io/blog/2025/12/19/kubernetes-v1-35-in-place-pod-resize-ga/), cgroup v2 on every node, and the pod-security.kubernetes.io/enforce=privileged namespace label is honoured (see deploy/controller/namespace.yaml).

Sample victims

kubectl apply -f deploy/victim-sample/namespace.yaml
kubectl apply -f deploy/victim-sample/search.yaml
kubectl apply -f deploy/victim-sample/profile.yaml

Replace the placeholder image: REGISTRY/...:tag lines with your real images. The fields that matter for mitigations to work: named score port 7900, resources.requests == resources.limits, resizePolicy.cpu = NotRequired.

Mitigation controller

kubectl apply -f deploy/controller/namespace.yaml
kubectl apply -f deploy/controller/rbac.yaml
kubectl apply -f deploy/controller/configmap.yaml
kubectl apply -f deploy/controller/daemonset.yaml

Adding a victim service later = single ConfigMap edit:

kubectl -n mitigation-system edit cm mitigation-controller-config
# Policy/targets reload via fsnotify within ~1s; no rollout needed.

Crash-safe state (annotations only)

Every action stamps annotations on its target before the actual write so Reconcile() at startup can find and complete an interrupted apply:

Target	Annotation keys
Aggressor Pod	`mitigation/cpu-max-original`, `mitigation/cpu-max-set-by-node`, `mitigation/cpu-max-set-at`
Victim Pod	`mitigation/cpu-limit-baseline`
Victim Deployment	`mitigation/horizontal-last-scaled-at`, `mitigation/horizontal-baseline-replicas`

No extra storage backend (etcd, Redis, the controller's own CRD) is needed; the API server is the source of truth.

Smoke test the score API directly

Matches the path used during development; no controllers needed.

# terminal 1
kubectl -n hotelres port-forward pod/search-<id> 7900:7900

# terminal 2
grpcurl -plaintext -d '{}' localhost:7900 \
  gordion.contention.ContentionStream/Subscribe

You should see a stream of ScoreEvent JSON objects at roughly 10 Hz, now including p50_horizons / tail_horizons / horizon_ms once the predictor side ships the matching change.

Observability

JSON log/slog on stderr. Every action emits a single line with rule, kind, pod, node, applied, reason, before, after, and err on failure. No Prometheus exporter yet; deliberately out of scope.

Renaming the module

The module path is github.com/coding-workspace/simple-mitigation-1. To change it (e.g. to your real GitHub org):

OLD=github.com/coding-workspace/simple-mitigation-1
NEW=github.com/your-org/your-repo
grep -rl "$OLD" . --include="*.go" --include="*.proto" --include="Makefile" \
  | xargs sed -i "s|$OLD|$NEW|g"
go mod edit -module "$NEW"
make proto && go mod tidy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

simple-mitigation

Architecture

Repo layout

Build

Default policy (out of the box)

CEL vocabulary

Authoring workflow

Default thresholds (explicit)

Deploy

Sample victims

Mitigation controller

Crash-safe state (annotations only)

Smoke test the score API directly

Observability

Renaming the module

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
cmd/mitigation-controller		cmd/mitigation-controller
deploy		deploy
pkg		pkg
proto		proto
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
go.mod		go.mod

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

simple-mitigation

Architecture

Repo layout

Build

Default policy (out of the box)

CEL vocabulary

Authoring workflow

Default thresholds (explicit)

Deploy

Sample victims

Mitigation controller

Crash-safe state (annotations only)

Smoke test the score API directly

Observability

Renaming the module

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages