feat: track resource stats across split, leaf and root search by fulmicoton · Pull Request #6416 · quickwit-oss/quickwit

fulmicoton · 2026-05-11T11:01:38Z

Description

Introduce a resource_stats payload on LeafSearchResponse and SearchResponse so callers can see how a query spent its budget across the split → leaf → root hierarchy.

Proto

Three new messages in quickwit-proto/protos/quickwit/search.proto:

SplitResourceStats — per-split counters (split_num_docs, matched_num_docs), input working set (input_memory_bytes), storage download (download_num_bytes, download_num_requests), and per-phase microsecond timings (permit wait, warmup, cpu-pool wait, cpu search).
LeafResourceStats — per-leaf counters: partial-result-cache hits, lambda dispatched / lambda-succeeded splits, the lambda-bottleneck signal, locally-executed splits; plus split_resources_worst (worst split by phase-sum), split_resources_sum (field-wise sum over local splits), and the leaf's wall_time_microsecs.
RootResourceStats — per-search aggregates: leaf_resources_worst, leaf_resources_sum (field-wise sums of every leaf), leaf-call counts (excluding and including retries), and failed-split count.

Plumbing

CountingStorage (new wrapper in quickwit-storage) records downloaded bytes / GET requests per request.
leaf_search_single_split assembles SplitResourceStats from the counters and per-phase Instants; it writes to either localexec_* or lambda_success_* depending on quickwit_common::is_running_in_lambda — a process-wide cached bool driven by the AWS_LAMBDA_FUNCTION_NAME env var the Lambda runtime sets.
run_offloaded_search_tasks captures the lambda dispatch totals (lambda_num_splits / lambda_num_docs) and forwards them via a new IncrementalCollector::add_lambda_totals. Lambda failure counting stays in the client because transport-level failures cannot be observed by the lambda server.
LeafSearchCache::put rewrites resource_stats to the cache-hit shape at write time, so cache reads need no per-read mutation.
multi_index_leaf_search stamps wall_time_microsecs on the merged LeafResourceStats.
lambda_bottleneck is determined inside single_doc_mapping_leaf_search via an AtomicBool race between the local and offload futures (last writer wins after tokio::join!).
ClusterClient::leaf_search takes an Arc<AtomicU64> so the root can count leaf attempts including retries.
search_partial_hits_phase now returns an Option<RootResourceStats> alongside the merged response; compute_root_resource_stats produces it from the per-leaf responses before they are merged. root_search_aux attaches it to SearchResponse.resource_stats.
The gRPC Scroll handler surfaces resource_stats when a scroll page triggered an inner search (pure cache-hit pages stay None).
add_leaf_stats is fully extensive: every numeric field is summed, including wall_time_microsecs and lambda_bottleneck (the latter becomes a count of leaves where lambda was the bottleneck once aggregated).

Lambda outcome metric

LAMBDA_METRICS.leaf_search_requests_total / leaf_search_duration_seconds now carry an outcome label whose value is success, timeout or other. Timeout is detected from SDK / EFS / SnapStart timeout errors.

How was this PR tested?

Unit tests on add_split_stats, add_leaf_stats, compute_root_resource_stats, and CountingStorage. The "sums every field" tests destructure on the proto types so a new field forces a compile error pointing at the test.
End-to-end gRPC root test test_root_search_populates_resource_stats exercising the same free function the RootSearch handler invokes, with mocked metastore and leaf services.
cargo nextest run -p quickwit-search passes locally.
cargo nextest run -p quickwit-storage --tests counting_storage passes locally.

Copilot

Pull request overview

This PR expands Quickwit’s search observability by introducing richer resource statistics across split/leaf/root search execution, including per-split storage download volume, and by refining Lambda invocation metrics.

Changes:

Add a CountingStorage wrapper to measure per-request downloaded bytes and GET request counts.
Replace/extend search “resource stats” in the search protobuf and propagate aggregated stats to root search responses.
Track additional timing/counters in leaf search (permit wait time, wall time, lambda bottleneck) and refine Lambda metrics labeling.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`quickwit/quickwit-storage/src/lib.rs`	Exposes the new counting storage wrapper types from the storage crate.
`quickwit/quickwit-storage/src/counting_storage.rs`	Adds `CountingStorage` + `DownloadCounters` to instrument storage reads.
`quickwit/quickwit-serve/src/jaeger_api/rest_handler.rs`	Updates tests/struct construction for the new `SearchResponse.resource_stats` field.
`quickwit/quickwit-search/src/service.rs`	Adds `resource_stats` to scroll responses (currently set to `None`).
`quickwit/quickwit-search/src/scroll_context.rs`	Adapts scroll batching to the new `(LeafSearchResponse, stats)` return type.
`quickwit/quickwit-search/src/root.rs`	Computes root-level resource stats and threads them into `SearchResponse`.
`quickwit/quickwit-search/src/lib.rs`	Reworks resource-stats merging utilities for the new leaf/split stats model.
`quickwit/quickwit-search/src/leaf.rs`	Collects per-split stats (downloads, timings), wall time, and lambda bottleneck signals.
`quickwit/quickwit-search/src/leaf_cache.rs`	Overwrites cached responses’ stats to reflect cache-hit counters.
`quickwit/quickwit-search/src/collector.rs`	Switches resource stat merging to the new leaf/split resource stats types.
`quickwit/quickwit-search/src/cluster_client.rs`	Tracks leaf-search call counts including retries; merges updated stats on retry.
`quickwit/quickwit-proto/src/codegen/quickwit/quickwit.search.rs`	Updates generated Rust types for the modified search protobuf messages.
`quickwit/quickwit-proto/protos/quickwit/search.proto`	Introduces `SplitResourceStats`, `LeafResourceStats`, `RootResourceStats`; adds `SearchResponse.resource_stats`.
`quickwit/quickwit-proto/build.rs`	Ensures `search.proto` changes retrigger codegen; adds clippy attribute for large enum variant.
`quickwit/quickwit-lambda-client/src/metrics.rs`	Renames Lambda metrics label key from `status` to `outcome`.
`quickwit/quickwit-lambda-client/src/invoker.rs`	Classifies Lambda invocation outcomes (`success
`quickwit/quickwit-jaeger/src/lib.rs`	Updates tests/struct construction for the new `SearchResponse.resource_stats` field.
`quickwit/quickwit-common/src/lib.rs`	Adds `is_running_in_lambda()` helper used by leaf search instrumentation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Introduce a `resource_stats` payload on `LeafSearchResponse` and `SearchResponse` so callers can see how a query spent its budget. Proto: three new messages in `quickwit-proto/protos/quickwit/search.proto`. - `SplitResourceStats`: per-split docs counted (`split_num_docs`, `matched_num_docs`), input working set (`input_memory_bytes`), storage download (`download_num_bytes`, `download_num_requests`), and per-phase microsecond timings (permit wait, warmup, cpu-pool wait, cpu search). - `LeafResourceStats`: per-leaf counters — partial-result-cache hits, lambda dispatched / lambda-succeeded splits, lambda-bottleneck flag, locally-executed splits — plus `split_resources_worst` (the worst split by phase-sum), `split_resources_sum` (field-wise sum over local splits) and the leaf's `wall_time_microsecs`. - `RootResourceStats`: per-search aggregates — `leaf_resources_worst`, `leaf_resources_sum` (field-wise sums of every leaf's stats), leaf-call counts including retries, and failed-split count. Plumbing. - `CountingStorage` in `quickwit-storage` wraps the request-scoped storage and records downloaded bytes / GET requests. - `leaf_search_single_split` builds `SplitResourceStats` from the counters and per-phase `Instant`s. It writes to either `localexec_*` or `lambda_success_*` depending on `quickwit_common::is_running_in_lambda`, which is a process-wide cached bool driven by `AWS_LAMBDA_FUNCTION_NAME`. - `run_offloaded_search_tasks` captures the lambda dispatch totals (`lambda_num_splits` / `lambda_num_docs`) and pushes them via a new `IncrementalCollector::add_lambda_totals` method. Lambda failure counting stays in the client because transport-level failures are invisible to the lambda server. - `LeafSearchCache::put` rewrites `resource_stats` to the cache-hit shape at write time so reads do not need to mutate. - `multi_index_leaf_search` stamps `wall_time_microsecs` on the merged `LeafResourceStats`. - `lambda_bottleneck` is determined in `single_doc_mapping_leaf_search` via an `AtomicBool` race between the local and offload futures (last writer wins after `tokio::join!`). - `ClusterClient::leaf_search` now takes an `Arc<AtomicU64>` so the root counts leaf attempts including retries. - `search_partial_hits_phase` returns an `Option<RootResourceStats>` alongside the merged response; `compute_root_resource_stats` produces it from the per-leaf responses before they are merged. `root_search_aux` writes it onto `SearchResponse.resource_stats`. - The gRPC `Scroll` handler surfaces `resource_stats` when a scroll page triggered an inner search (pure cache-hit pages stay `None`). - `add_leaf_stats` is now fully extensive: every numeric field is summed, including `wall_time_microsecs` and `lambda_bottleneck` (the latter becomes a count of leaves where lambda was the bottleneck once aggregated). Lambda outcome metric. - `LAMBDA_METRICS.leaf_search_requests_total` / `leaf_search_duration_seconds` now carry an `outcome` label whose value is `success`, `timeout` or `other` (timeout is detected from SDK / EFS / SnapStart errors). Tests. - Unit tests on `add_split_stats`, `add_leaf_stats`, `compute_root_resource_stats`, and `CountingStorage`. The "sums every field" tests destructure on the proto types so a new field forces a compile error pointing at the test. - End-to-end gRPC root test `test_root_search_populates_resource_stats` exercises the same function the `RootSearch` handler invokes, with mocked metastore and leaf services.

trinity-1686a

i think nothing take into account the fetch_docs phase. It's not typically the most expensive part, but that's still a blindspot

trinity-1686a · 2026-05-12T14:20:03Z

+    // The worst single-split contribution, ranked by
+    // `warmup + wait_for_cpu_pool + cpu_predicate + cpu_collection + cpu_harvest`
+    // (intentionally excludes `wait_for_search_permit`).
+    SplitResourceStats split_resources_worst = 10;


why include wait_for_cpu_pool but exclude wait_for_search_permit? Neither is the fault of the split, both are related to the whole node's load

fulmicoton-dd force-pushed the paul.masurel/search_metrics branch 6 times, most recently from 09dced5 to 494c14e Compare May 11, 2026 13:43

fulmicoton requested a review from Copilot May 11, 2026 13:46

Copilot started reviewing on behalf of fulmicoton May 11, 2026 13:47 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

fulmicoton-dd force-pushed the paul.masurel/search_metrics branch 5 times, most recently from 5b8f494 to ef5a68c Compare May 12, 2026 07:57

fulmicoton-dd changed the title ~~Paul.masurel/search metrics~~ feat: track resource stats across split, leaf and root search May 12, 2026

fulmicoton-dd force-pushed the paul.masurel/search_metrics branch 2 times, most recently from d67eb01 to 7dd2fdc Compare May 12, 2026 09:21

fulmicoton-dd force-pushed the paul.masurel/search_metrics branch from 7dd2fdc to 6287375 Compare May 12, 2026 11:16

fulmicoton marked this pull request as ready for review May 12, 2026 11:17

fulmicoton requested a review from a team as a code owner May 12, 2026 11:18

fulmicoton requested a review from trinity-1686a May 12, 2026 11:18

trinity-1686a approved these changes May 12, 2026

View reviewed changes

fulmicoton-dd force-pushed the paul.masurel/search_metrics branch from 89a27eb to 46a1934 Compare May 13, 2026 11:27

CR comment

20759cc

fulmicoton-dd force-pushed the paul.masurel/search_metrics branch from 46a1934 to 20759cc Compare May 13, 2026 11:29

fulmicoton-dd added 2 commits May 13, 2026 15:26

added min wait time + keeping top 2 worsts.

9143691

CR comment

3401854

fulmicoton-dd force-pushed the paul.masurel/search_metrics branch from 9f87ffb to 3401854 Compare May 13, 2026 14:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: track resource stats across split, leaf and root search#6416

feat: track resource stats across split, leaf and root search#6416
fulmicoton wants to merge 4 commits into
mainfrom
paul.masurel/search_metrics

fulmicoton commented May 11, 2026 •

edited by fulmicoton-dd

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

trinity-1686a left a comment

Uh oh!

trinity-1686a May 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

fulmicoton commented May 11, 2026 • edited by fulmicoton-dd Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Proto

Plumbing

Lambda outcome metric

How was this PR tested?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

trinity-1686a left a comment

Choose a reason for hiding this comment

Uh oh!

trinity-1686a May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fulmicoton commented May 11, 2026 •

edited by fulmicoton-dd

Loading