Skip to content

feat: track resource stats across split, leaf and root search#6416

Open
fulmicoton wants to merge 4 commits into
mainfrom
paul.masurel/search_metrics
Open

feat: track resource stats across split, leaf and root search#6416
fulmicoton wants to merge 4 commits into
mainfrom
paul.masurel/search_metrics

Conversation

@fulmicoton
Copy link
Copy Markdown
Collaborator

@fulmicoton fulmicoton commented May 11, 2026

Description

Introduce a resource_stats payload on LeafSearchResponse and SearchResponse so callers can see how a query spent its budget across the split → leaf → root hierarchy.

Proto

Three new messages in quickwit-proto/protos/quickwit/search.proto:

  • SplitResourceStats — per-split counters (split_num_docs, matched_num_docs), input working set (input_memory_bytes), storage download (download_num_bytes, download_num_requests), and per-phase microsecond timings (permit wait, warmup, cpu-pool wait, cpu search).
  • LeafResourceStats — per-leaf counters: partial-result-cache hits, lambda dispatched / lambda-succeeded splits, the lambda-bottleneck signal, locally-executed splits; plus split_resources_worst (worst split by phase-sum), split_resources_sum (field-wise sum over local splits), and the leaf's wall_time_microsecs.
  • RootResourceStats — per-search aggregates: leaf_resources_worst, leaf_resources_sum (field-wise sums of every leaf), leaf-call counts (excluding and including retries), and failed-split count.

Plumbing

  • CountingStorage (new wrapper in quickwit-storage) records downloaded bytes / GET requests per request.
  • leaf_search_single_split assembles SplitResourceStats from the counters and per-phase Instants; it writes to either localexec_* or lambda_success_* depending on quickwit_common::is_running_in_lambda — a process-wide cached bool driven by the AWS_LAMBDA_FUNCTION_NAME env var the Lambda runtime sets.
  • run_offloaded_search_tasks captures the lambda dispatch totals (lambda_num_splits / lambda_num_docs) and forwards them via a new IncrementalCollector::add_lambda_totals. Lambda failure counting stays in the client because transport-level failures cannot be observed by the lambda server.
  • LeafSearchCache::put rewrites resource_stats to the cache-hit shape at write time, so cache reads need no per-read mutation.
  • multi_index_leaf_search stamps wall_time_microsecs on the merged LeafResourceStats.
  • lambda_bottleneck is determined inside single_doc_mapping_leaf_search via an AtomicBool race between the local and offload futures (last writer wins after tokio::join!).
  • ClusterClient::leaf_search takes an Arc<AtomicU64> so the root can count leaf attempts including retries.
  • search_partial_hits_phase now returns an Option<RootResourceStats> alongside the merged response; compute_root_resource_stats produces it from the per-leaf responses before they are merged. root_search_aux attaches it to SearchResponse.resource_stats.
  • The gRPC Scroll handler surfaces resource_stats when a scroll page triggered an inner search (pure cache-hit pages stay None).
  • add_leaf_stats is fully extensive: every numeric field is summed, including wall_time_microsecs and lambda_bottleneck (the latter becomes a count of leaves where lambda was the bottleneck once aggregated).

Lambda outcome metric

LAMBDA_METRICS.leaf_search_requests_total / leaf_search_duration_seconds now carry an outcome label whose value is success, timeout or other. Timeout is detected from SDK / EFS / SnapStart timeout errors.

How was this PR tested?

  • Unit tests on add_split_stats, add_leaf_stats, compute_root_resource_stats, and CountingStorage. The "sums every field" tests destructure on the proto types so a new field forces a compile error pointing at the test.
  • End-to-end gRPC root test test_root_search_populates_resource_stats exercising the same free function the RootSearch handler invokes, with mocked metastore and leaf services.
  • cargo nextest run -p quickwit-search passes locally.
  • cargo nextest run -p quickwit-storage --tests counting_storage passes locally.

@fulmicoton-dd fulmicoton-dd force-pushed the paul.masurel/search_metrics branch 6 times, most recently from 09dced5 to 494c14e Compare May 11, 2026 13:43
@fulmicoton fulmicoton requested a review from Copilot May 11, 2026 13:46
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands Quickwit’s search observability by introducing richer resource statistics across split/leaf/root search execution, including per-split storage download volume, and by refining Lambda invocation metrics.

Changes:

  • Add a CountingStorage wrapper to measure per-request downloaded bytes and GET request counts.
  • Replace/extend search “resource stats” in the search protobuf and propagate aggregated stats to root search responses.
  • Track additional timing/counters in leaf search (permit wait time, wall time, lambda bottleneck) and refine Lambda metrics labeling.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
quickwit/quickwit-storage/src/lib.rs Exposes the new counting storage wrapper types from the storage crate.
quickwit/quickwit-storage/src/counting_storage.rs Adds CountingStorage + DownloadCounters to instrument storage reads.
quickwit/quickwit-serve/src/jaeger_api/rest_handler.rs Updates tests/struct construction for the new SearchResponse.resource_stats field.
quickwit/quickwit-search/src/service.rs Adds resource_stats to scroll responses (currently set to None).
quickwit/quickwit-search/src/scroll_context.rs Adapts scroll batching to the new (LeafSearchResponse, stats) return type.
quickwit/quickwit-search/src/root.rs Computes root-level resource stats and threads them into SearchResponse.
quickwit/quickwit-search/src/lib.rs Reworks resource-stats merging utilities for the new leaf/split stats model.
quickwit/quickwit-search/src/leaf.rs Collects per-split stats (downloads, timings), wall time, and lambda bottleneck signals.
quickwit/quickwit-search/src/leaf_cache.rs Overwrites cached responses’ stats to reflect cache-hit counters.
quickwit/quickwit-search/src/collector.rs Switches resource stat merging to the new leaf/split resource stats types.
quickwit/quickwit-search/src/cluster_client.rs Tracks leaf-search call counts including retries; merges updated stats on retry.
quickwit/quickwit-proto/src/codegen/quickwit/quickwit.search.rs Updates generated Rust types for the modified search protobuf messages.
quickwit/quickwit-proto/protos/quickwit/search.proto Introduces SplitResourceStats, LeafResourceStats, RootResourceStats; adds SearchResponse.resource_stats.
quickwit/quickwit-proto/build.rs Ensures search.proto changes retrigger codegen; adds clippy attribute for large enum variant.
quickwit/quickwit-lambda-client/src/metrics.rs Renames Lambda metrics label key from status to outcome.
quickwit/quickwit-lambda-client/src/invoker.rs Classifies Lambda invocation outcomes (`success
quickwit/quickwit-jaeger/src/lib.rs Updates tests/struct construction for the new SearchResponse.resource_stats field.
quickwit/quickwit-common/src/lib.rs Adds is_running_in_lambda() helper used by leaf search instrumentation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread quickwit/quickwit-storage/src/counting_storage.rs
Comment thread quickwit/quickwit-search/src/leaf.rs
Comment thread quickwit/quickwit-search/src/leaf.rs
Comment thread quickwit/quickwit-search/src/root.rs
Comment thread quickwit/quickwit-search/src/root.rs Outdated
Comment thread quickwit/quickwit-lambda-client/src/metrics.rs
Comment thread quickwit/quickwit-search/src/scroll_context.rs Outdated
Comment thread quickwit/quickwit-search/src/service.rs Outdated
@fulmicoton-dd fulmicoton-dd force-pushed the paul.masurel/search_metrics branch 5 times, most recently from 5b8f494 to ef5a68c Compare May 12, 2026 07:57
@fulmicoton-dd fulmicoton-dd changed the title Paul.masurel/search metrics feat: track resource stats across split, leaf and root search May 12, 2026
@fulmicoton-dd fulmicoton-dd force-pushed the paul.masurel/search_metrics branch 2 times, most recently from d67eb01 to 7dd2fdc Compare May 12, 2026 09:21
Introduce a `resource_stats` payload on `LeafSearchResponse` and
`SearchResponse` so callers can see how a query spent its budget.

Proto: three new messages in `quickwit-proto/protos/quickwit/search.proto`.
- `SplitResourceStats`: per-split docs counted (`split_num_docs`,
  `matched_num_docs`), input working set (`input_memory_bytes`), storage
  download (`download_num_bytes`, `download_num_requests`), and per-phase
  microsecond timings (permit wait, warmup, cpu-pool wait, cpu search).
- `LeafResourceStats`: per-leaf counters — partial-result-cache hits,
  lambda dispatched / lambda-succeeded splits, lambda-bottleneck flag,
  locally-executed splits — plus `split_resources_worst` (the worst split
  by phase-sum), `split_resources_sum` (field-wise sum over local splits)
  and the leaf's `wall_time_microsecs`.
- `RootResourceStats`: per-search aggregates — `leaf_resources_worst`,
  `leaf_resources_sum` (field-wise sums of every leaf's stats),
  leaf-call counts including retries, and failed-split count.

Plumbing.
- `CountingStorage` in `quickwit-storage` wraps the request-scoped
  storage and records downloaded bytes / GET requests.
- `leaf_search_single_split` builds `SplitResourceStats` from the
  counters and per-phase `Instant`s. It writes to either
  `localexec_*` or `lambda_success_*` depending on
  `quickwit_common::is_running_in_lambda`, which is a process-wide
  cached bool driven by `AWS_LAMBDA_FUNCTION_NAME`.
- `run_offloaded_search_tasks` captures the lambda dispatch totals
  (`lambda_num_splits` / `lambda_num_docs`) and pushes them via a
  new `IncrementalCollector::add_lambda_totals` method. Lambda failure
  counting stays in the client because transport-level failures are
  invisible to the lambda server.
- `LeafSearchCache::put` rewrites `resource_stats` to the cache-hit
  shape at write time so reads do not need to mutate.
- `multi_index_leaf_search` stamps `wall_time_microsecs` on the merged
  `LeafResourceStats`.
- `lambda_bottleneck` is determined in `single_doc_mapping_leaf_search`
  via an `AtomicBool` race between the local and offload futures (last
  writer wins after `tokio::join!`).
- `ClusterClient::leaf_search` now takes an `Arc<AtomicU64>` so the
  root counts leaf attempts including retries.
- `search_partial_hits_phase` returns an `Option<RootResourceStats>`
  alongside the merged response; `compute_root_resource_stats`
  produces it from the per-leaf responses before they are merged.
  `root_search_aux` writes it onto `SearchResponse.resource_stats`.
- The gRPC `Scroll` handler surfaces `resource_stats` when a scroll
  page triggered an inner search (pure cache-hit pages stay `None`).
- `add_leaf_stats` is now fully extensive: every numeric field is
  summed, including `wall_time_microsecs` and `lambda_bottleneck`
  (the latter becomes a count of leaves where lambda was the
  bottleneck once aggregated).

Lambda outcome metric.
- `LAMBDA_METRICS.leaf_search_requests_total` /
  `leaf_search_duration_seconds` now carry an `outcome` label whose
  value is `success`, `timeout` or `other` (timeout is detected from
  SDK / EFS / SnapStart errors).

Tests.
- Unit tests on `add_split_stats`, `add_leaf_stats`,
  `compute_root_resource_stats`, and `CountingStorage`. The
  "sums every field" tests destructure on the proto types so a new
  field forces a compile error pointing at the test.
- End-to-end gRPC root test
  `test_root_search_populates_resource_stats` exercises the same
  function the `RootSearch` handler invokes, with mocked metastore
  and leaf services.
@fulmicoton-dd fulmicoton-dd force-pushed the paul.masurel/search_metrics branch from 7dd2fdc to 6287375 Compare May 12, 2026 11:16
@fulmicoton fulmicoton marked this pull request as ready for review May 12, 2026 11:17
@fulmicoton fulmicoton requested a review from a team as a code owner May 12, 2026 11:18
@fulmicoton fulmicoton requested a review from trinity-1686a May 12, 2026 11:18
Copy link
Copy Markdown
Contributor

@trinity-1686a trinity-1686a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think nothing take into account the fetch_docs phase. It's not typically the most expensive part, but that's still a blindspot

Comment on lines +412 to +415
// The worst single-split contribution, ranked by
// `warmup + wait_for_cpu_pool + cpu_predicate + cpu_collection + cpu_harvest`
// (intentionally excludes `wait_for_search_permit`).
SplitResourceStats split_resources_worst = 10;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why include wait_for_cpu_pool but exclude wait_for_search_permit? Neither is the fault of the split, both are related to the whole node's load

Comment thread quickwit/quickwit-storage/src/counting_storage.rs Outdated
Comment thread quickwit/quickwit-search/src/leaf.rs Outdated
Comment thread quickwit/quickwit-search/src/leaf.rs Outdated
Comment thread quickwit/quickwit-proto/protos/quickwit/search.proto
Comment thread quickwit/quickwit-search/src/lib.rs
@fulmicoton-dd fulmicoton-dd force-pushed the paul.masurel/search_metrics branch from 89a27eb to 46a1934 Compare May 13, 2026 11:27
@fulmicoton-dd fulmicoton-dd force-pushed the paul.masurel/search_metrics branch from 46a1934 to 20759cc Compare May 13, 2026 11:29
@fulmicoton-dd fulmicoton-dd force-pushed the paul.masurel/search_metrics branch from 9f87ffb to 3401854 Compare May 13, 2026 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants