refactor(orchestrator): prepare running ic-gateway as a side-car to the replica in cloud engines#10499
Conversation
There was a problem hiding this comment.
⚠️ Not ready to approve
MultipleProcessesManager::start_all does not actually stop ic-gateway when the launch gate is disabled, contradicting the intended behavior and potentially leaving the sidecar running unexpectedly.
Pull request overview
Refactors the orchestrator’s subnet-scoped process supervision from “replica-only + special-case ic-boundary” into a generic multi-process manager, and introduces ic-gateway as an (currently gated) sidecar process for CloudEngine nodes so public API traffic can be terminated locally on port 80 and proxied to the replica.
Changes:
- Replace replica-specific process management in the upgrade loop with
MultipleProcessesManager/ per-processProcessManager. - Add
ic-gatewayprocess definition + GuestOS packaging/env config; add a new (manual) system test verifying/api/v2/statusvia port 80 on CloudEngine nodes. - Rename orchestrator start-attempts metric to be per-process (
orchestrator_processes_start_attempts_total) and update affected tests.
File summaries
| File | Description |
|---|---|
| rs/tests/driver/src/driver/test_env_api.rs | Adjust metrics threshold matching to support prefix-based metric checks. |
| rs/tests/driver/src/driver/group.rs | Update default orchestrator metrics name and document prefix-based matching expectations. |
| rs/tests/consensus/subnet_splitting_test.rs | Update expected orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_nns_failover_nodes_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_same_nodes_with_chain_keys_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_same_nodes_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_same_nodes_enable_chain_keys_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_with_chain_keys_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_provision_write_access_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_local_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_enable_chain_keys_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_large_no_upgrade_with_chain_keys_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_failover_nodes_with_chain_keys_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_failover_nodes_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_failover_nodes_enable_chain_keys_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/replica_determinism_test.rs | Update expected orchestrator metric name. |
| rs/tests/consensus/orchestrator/node_reassignment_test.rs | Update expected orchestrator metric name. |
| rs/tests/consensus/orchestrator/cloud_engine_ic_gateway_test.rs | New system test validating CloudEngine health on port 80 through ic-gateway. |
| rs/tests/consensus/orchestrator/Cargo.toml | Add reqwest/serde_cbor deps and register new system-test binary. |
| rs/tests/consensus/orchestrator/BUILD.bazel | Add Bazel target for the new system test (tagged manual). |
| rs/tests/consensus/cup_explorer_test.rs | Update expected orchestrator metric name. |
| rs/orchestrator/src/upgrade.rs | Switch upgrade loop to start/stop a multi-process manager instead of replica-only management. |
| rs/orchestrator/src/registry_helper.rs | Add subnet-type lookup helper for process policy decisions. |
| rs/orchestrator/src/processes.rs | New module: process definitions + generic ProcessManager + MultipleProcessesManager including gated ic-gateway. |
| rs/orchestrator/src/process_manager.rs | Refactor to ProcessRunner and ensure child processes are spawned as their own process-group leaders. |
| rs/orchestrator/src/orchestrator.rs | Wire new process configs/managers; pass env file paths via args; integrate with dashboard/boundary management. |
| rs/orchestrator/src/metrics.rs | Replace replica-only start metric with per-process start/stop counters. |
| rs/orchestrator/src/lib.rs | Register the new processes module. |
| rs/orchestrator/src/error.rs | Remove now-unused domain-name-missing error variant. |
| rs/orchestrator/src/dashboard.rs | Display both replica and ic-gateway PIDs via MultipleProcessesManager. |
| rs/orchestrator/src/boundary_node.rs | Delegate ic-boundary lifecycle to IcBoundaryManager. |
| rs/orchestrator/src/args.rs | Add required args for boundary/gateway env files; make ic_binary_directory non-optional. |
| rs/ic_os/release/BUILD.bazel | Add ic-gateway to released/stripped binaries. |
| ic-os/guestos/envs/recovery/BUILD.bazel | Increase size thresholds to accommodate image growth. |
| ic-os/guestos/envs/prod/BUILD.bazel | Increase size thresholds to accommodate image growth. |
| ic-os/guestos/defs.bzl | Include ic-gateway binary in GuestOS image dependencies. |
| ic-os/components/guestos/share/ic-gateway.env | New env file configuring ic-gateway (port 80, proxy to 8080, domain). |
| ic-os/components/guestos/ic-replica.service | Add orchestrator flags for boundary/gateway env file paths. |
| ic-os/components/guestos.bzl | Install ic-gateway.env into /opt/ic/share. |
| Cargo.lock | Add lock entries for new test dependencies. |
| Cargo.Bazel.json.lock | Add ic-gateway as a generated binary crate. |
| bazel/rust.MODULE.bazel | Add crate annotation to generate ic-gateway binary via Bazel. |
Copilot's findings
- Files reviewed: 41/43 changed files
- Comments generated: 4
Note
Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
✅ No security or compliance issues detected. Reviewed everything up to 27c706c. Security Overview
Detected Code ChangesThe diff is too large to display a summary of code changes. |
|
✅ No security or compliance issues detected. Reviewed everything up to 27c706c. Security Overview
Detected Code ChangesThe diff is too large to display a summary of code changes. |
Cloud engines should be self-contained, but today their nodes rely on external routing to reach the replica. This PR prepares running
ic-gatewayas a side-car alongside the replica on each node, so the gateway terminates external traffic locally and proxies it to the replica.Launching
ic-gatewayis still gated behind a flag and only happens on cloud engines, so there's no behavioral change for now. To enable the functionality, it should be enough to flip that flag.Refactoring the orchestrator's process manager
The orchestrator's process management was written around a single process (the replica), with a parallel, partially-duplicated path for
ic-boundary. Adding a third managed process the same way would have meant copying that logic again.Instead, process management is now generic. The upgrade loop no longer knows what it's running — it just starts and stops a manager that owns the set of node processes. Adding a future process is a matter of describing how it's built and registering it; no changes to the upgrade loop.
Scope: this refactor is confined to the upgrade loop, i.e. processes whose lifetime is tied to a subnet's. Unassigned nodes and API BNs are unaffected.
Testing
A new system test
cloud_engine_ic_gateway_test(now stillmanualbecause the feature flag is disabled) hitsapi/v2/statuson port 80 of each node, confirmingic-gatewayis up and correctly proxying to the replica.Moreover, new unit tests were added to test the existing
ic-boundarylogic that restarts the process on domain name changes.Implementation details for reviewers
MultipleProcessesManagerand callsstart_all/stop_all. It's process-agnostic, except forstop_replicawhich is (arguably, see comment) still needed during recoveries.ic-boundary) moved intoprocesses.rs. EachProcessdeclares how it's built viabuild(&Self::Config, Self::Args), separating static config from dynamic arguments that can change across the orchestrator's lifetime, likely derived from the registry.ProcessManager<Process>centralizes the logging, metrics, andOrchestratorResulthandling that was previously duplicated.MultipleProcessesManagerholds several of these and decides what to start/stop (e.g.ic-gatewayonly on cloud engines).ic-boundaryis an exception as it has some additional logic when the node's domain changes. Keep that logic by introducingIcBoundaryManager, a wrapper overProcessManager<IcBoundaryProcess>exposingensure_ic_boundary_running_and_restarted_on_domain_change.To add a new process: define
NewProcessConfigandNewProcessimplementingProcess, add aProcessManager<NewProcess>field toMultipleProcessesManager, and wire it into the start/stop methods.