Skip to content

refactor(orchestrator): prepare running ic-gateway as a side-car to the replica in cloud engines#10499

Merged
pierugo-dfinity merged 43 commits into
masterfrom
pierugo/orchestrator/cloud-engine-ic-gateway
Jun 26, 2026
Merged

refactor(orchestrator): prepare running ic-gateway as a side-car to the replica in cloud engines#10499
pierugo-dfinity merged 43 commits into
masterfrom
pierugo/orchestrator/cloud-engine-ic-gateway

Conversation

@pierugo-dfinity

@pierugo-dfinity pierugo-dfinity commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Cloud engines should be self-contained, but today their nodes rely on external routing to reach the replica. This PR prepares running ic-gateway as a side-car alongside the replica on each node, so the gateway terminates external traffic locally and proxies it to the replica.
Launching ic-gateway is still gated behind a flag and only happens on cloud engines, so there's no behavioral change for now. To enable the functionality, it should be enough to flip that flag.

Refactoring the orchestrator's process manager

The orchestrator's process management was written around a single process (the replica), with a parallel, partially-duplicated path for ic-boundary. Adding a third managed process the same way would have meant copying that logic again.
Instead, process management is now generic. The upgrade loop no longer knows what it's running — it just starts and stops a manager that owns the set of node processes. Adding a future process is a matter of describing how it's built and registering it; no changes to the upgrade loop.

Scope: this refactor is confined to the upgrade loop, i.e. processes whose lifetime is tied to a subnet's. Unassigned nodes and API BNs are unaffected.

Testing

A new system test cloud_engine_ic_gateway_test (now still manual because the feature flag is disabled) hits api/v2/status on port 80 of each node, confirming ic-gateway is up and correctly proxying to the replica.
Moreover, new unit tests were added to test the existing ic-boundary logic that restarts the process on domain name changes.

Implementation details for reviewers
  • The upgrade loop now holds a MultipleProcessesManager and calls start_all/stop_all. It's process-agnostic, except for stop_replica which is (arguably, see comment) still needed during recoveries.
  • All per-process logic (including ic-boundary) moved into processes.rs. Each Process declares how it's built via build(&Self::Config, Self::Args), separating static config from dynamic arguments that can change across the orchestrator's lifetime, likely derived from the registry.
  • ProcessManager<Process> centralizes the logging, metrics, and OrchestratorResult handling that was previously duplicated. MultipleProcessesManager holds several of these and decides what to start/stop (e.g. ic-gateway only on cloud engines).
  • ic-boundary is an exception as it has some additional logic when the node's domain changes. Keep that logic by introducing IcBoundaryManager, a wrapper over ProcessManager<IcBoundaryProcess> exposing ensure_ic_boundary_running_and_restarted_on_domain_change.

To add a new process: define NewProcessConfig and NewProcess implementing Process, add a ProcessManager<NewProcess> field to MultipleProcessesManager, and wire it into the start/stop methods.

@pierugo-dfinity pierugo-dfinity added the CI_ALL_BAZEL_TARGETS Runs all bazel targets label Jun 17, 2026
@github-actions github-actions Bot added the feat label Jun 17, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Not ready to approve

MultipleProcessesManager::start_all does not actually stop ic-gateway when the launch gate is disabled, contradicting the intended behavior and potentially leaving the sidecar running unexpectedly.

Pull request overview

Refactors the orchestrator’s subnet-scoped process supervision from “replica-only + special-case ic-boundary” into a generic multi-process manager, and introduces ic-gateway as an (currently gated) sidecar process for CloudEngine nodes so public API traffic can be terminated locally on port 80 and proxied to the replica.

Changes:

  • Replace replica-specific process management in the upgrade loop with MultipleProcessesManager / per-process ProcessManager.
  • Add ic-gateway process definition + GuestOS packaging/env config; add a new (manual) system test verifying /api/v2/status via port 80 on CloudEngine nodes.
  • Rename orchestrator start-attempts metric to be per-process (orchestrator_processes_start_attempts_total) and update affected tests.
File summaries
File Description
rs/tests/driver/src/driver/test_env_api.rs Adjust metrics threshold matching to support prefix-based metric checks.
rs/tests/driver/src/driver/group.rs Update default orchestrator metrics name and document prefix-based matching expectations.
rs/tests/consensus/subnet_splitting_test.rs Update expected orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_nns_failover_nodes_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_same_nodes_with_chain_keys_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_same_nodes_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_same_nodes_enable_chain_keys_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_with_chain_keys_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_provision_write_access_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_local_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_enable_chain_keys_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_large_no_upgrade_with_chain_keys_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_failover_nodes_with_chain_keys_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_failover_nodes_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_failover_nodes_enable_chain_keys_test.rs Update removed orchestrator metric name.
rs/tests/consensus/replica_determinism_test.rs Update expected orchestrator metric name.
rs/tests/consensus/orchestrator/node_reassignment_test.rs Update expected orchestrator metric name.
rs/tests/consensus/orchestrator/cloud_engine_ic_gateway_test.rs New system test validating CloudEngine health on port 80 through ic-gateway.
rs/tests/consensus/orchestrator/Cargo.toml Add reqwest/serde_cbor deps and register new system-test binary.
rs/tests/consensus/orchestrator/BUILD.bazel Add Bazel target for the new system test (tagged manual).
rs/tests/consensus/cup_explorer_test.rs Update expected orchestrator metric name.
rs/orchestrator/src/upgrade.rs Switch upgrade loop to start/stop a multi-process manager instead of replica-only management.
rs/orchestrator/src/registry_helper.rs Add subnet-type lookup helper for process policy decisions.
rs/orchestrator/src/processes.rs New module: process definitions + generic ProcessManager + MultipleProcessesManager including gated ic-gateway.
rs/orchestrator/src/process_manager.rs Refactor to ProcessRunner and ensure child processes are spawned as their own process-group leaders.
rs/orchestrator/src/orchestrator.rs Wire new process configs/managers; pass env file paths via args; integrate with dashboard/boundary management.
rs/orchestrator/src/metrics.rs Replace replica-only start metric with per-process start/stop counters.
rs/orchestrator/src/lib.rs Register the new processes module.
rs/orchestrator/src/error.rs Remove now-unused domain-name-missing error variant.
rs/orchestrator/src/dashboard.rs Display both replica and ic-gateway PIDs via MultipleProcessesManager.
rs/orchestrator/src/boundary_node.rs Delegate ic-boundary lifecycle to IcBoundaryManager.
rs/orchestrator/src/args.rs Add required args for boundary/gateway env files; make ic_binary_directory non-optional.
rs/ic_os/release/BUILD.bazel Add ic-gateway to released/stripped binaries.
ic-os/guestos/envs/recovery/BUILD.bazel Increase size thresholds to accommodate image growth.
ic-os/guestos/envs/prod/BUILD.bazel Increase size thresholds to accommodate image growth.
ic-os/guestos/defs.bzl Include ic-gateway binary in GuestOS image dependencies.
ic-os/components/guestos/share/ic-gateway.env New env file configuring ic-gateway (port 80, proxy to 8080, domain).
ic-os/components/guestos/ic-replica.service Add orchestrator flags for boundary/gateway env file paths.
ic-os/components/guestos.bzl Install ic-gateway.env into /opt/ic/share.
Cargo.lock Add lock entries for new test dependencies.
Cargo.Bazel.json.lock Add ic-gateway as a generated binary crate.
bazel/rust.MODULE.bazel Add crate annotation to generate ic-gateway binary via Bazel.

Copilot's findings

  • Files reviewed: 41/43 changed files
  • Comments generated: 4

Note

Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread rs/orchestrator/src/processes.rs
Comment thread rs/orchestrator/src/processes.rs
Comment thread rs/orchestrator/src/processes.rs Outdated
Comment thread rs/tests/driver/src/driver/test_env_api.rs Outdated
@zeropath-ai

zeropath-ai Bot commented Jun 22, 2026

Copy link
Copy Markdown

No security or compliance issues detected. Reviewed everything up to 27c706c.

Security Overview
Detected Code Changes

The diff is too large to display a summary of code changes.

@zeropath-ai

zeropath-ai Bot commented Jun 22, 2026

Copy link
Copy Markdown

No security or compliance issues detected. Reviewed everything up to 27c706c.

Security Overview
Detected Code Changes

The diff is too large to display a summary of code changes.

Comment thread rs/orchestrator/src/processes.rs
Comment thread rs/orchestrator/src/processes.rs Outdated
Comment thread rs/orchestrator/src/processes.rs Outdated
Comment thread rs/orchestrator/src/process_manager.rs
Comment thread rs/orchestrator/src/upgrade.rs
Comment thread rs/orchestrator/src/upgrade.rs
Comment thread rs/orchestrator/src/args.rs Outdated
Comment thread rs/orchestrator/src/processes.rs Outdated
Comment thread rs/tests/driver/src/driver/test_env_api.rs Outdated
Comment thread rs/orchestrator/src/process_manager.rs
Comment thread rs/orchestrator/src/upgrade.rs Outdated
@pierugo-dfinity pierugo-dfinity added this pull request to the merge queue Jun 26, 2026
Merged via the queue into master with commit 3b1a478 Jun 26, 2026
37 checks passed
@pierugo-dfinity pierugo-dfinity deleted the pierugo/orchestrator/cloud-engine-ic-gateway branch June 26, 2026 08:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants