Skip to content

Integrate Stored Procedure and Query Plan request routing to Gateway V2 endpoint#47759

Open
jeet1995 wants to merge 69 commits into
Azure:mainfrom
jeet1995:AzCosmos_GatewayV2_QueryPlanSupport
Open

Integrate Stored Procedure and Query Plan request routing to Gateway V2 endpoint#47759
jeet1995 wants to merge 69 commits into
Azure:mainfrom
jeet1995:AzCosmos_GatewayV2_QueryPlanSupport

Conversation

@jeet1995

@jeet1995 jeet1995 commented Jan 21, 2026

Copy link
Copy Markdown
Member

Summary

This PR integrates Query Plan and Stored Procedure request routing into the Gateway V2 thin-client (proxy) path, adds the RNTBD protocol surface the proxy needs to generate a query plan, and deserializes the proxy-generated query plan back into the client query pipeline. It also fixes a thin-client query-plan error-status regression and a hybrid/full-text diagnostics gap, and ships an oracle-style E2E suite that validates thin-client parity against Direct TCP.

Production changes (azure-cosmos)

RNTBD protocol surface (Gateway V2 proxy)

  • RntbdConstants.java — new QueryPlan operation type (0x0042) and two new request headers: SupportedQueryFeatures (0x00FF, String) and QueryVersion (0x0100, SmallString). IDs match the server-side proxy (ADO PR 1982503).
  • RntbdRequestHeaders.java — populate the SupportedQueryFeatures and QueryVersion RNTBD tokens from their HTTP-header equivalents so the proxy can extract them from the RNTBD body.
  • RntbdRequestFrame.java — wire the new QueryPlan operation type onto the request frame.

Query-plan routing & deserialization

  • QueryPlanRetriever.java — advertise CountIf in SUPPORTED_QUERY_FEATURES; route getQueryPlanThroughGatewayAsync through thin-client mode and pass the partitionKeyDefinition; defense-in-depth guard that converts a non-2xx / malformed query-plan response into a clean 400 instead of a leaked exception.
  • PartitionedQueryExecutionInfo.java — thin-client deserialization overload that accepts the partition-key definition (and response timeline), used to construct the query pipeline from a proxy-generated query plan.
  • QueryInfo.javagetGroupByAliasToAggregateType() and CountIf handling.
  • DocumentQueryExecutionContextFactory.java, IDocumentQueryClient.java — thread the partition-key definition into query-plan retrieval.

Stored-procedure routing

  • RxDocumentClientImpl.java, RxDocumentServiceRequest.java, ThinClientStoreModel.java — route stored-procedure execution through the Gateway V2 thin-client path.

Bug fix — thin-client query-plan error status (statusCode 0400)

  • RxGatewayStoreModel.validateOrThrow — a thin-client query-plan error frame (non-JSON, NUL-padded body) caused new CosmosError(body) to throw, which escaped before the intended throw dce and surfaced upstream as statusCode 0. It now falls back to a sanitized CosmosError, so the existing throw carries the real 400 with the server-provided substatus and message preserved. Strictly additive: 2xx responses and valid-JSON error bodies are byte-identical to before — only a previously-leaked exception path is corrected.

Diagnostics fix — hybrid / full-text

  • HybridSearchDocumentQueryExecutionContext.java — propagate the component-query client-side request statistics into the synthetic final FeedResponse, so endpoint diagnostics still show the core response path for RRF(...) / full-text queries.

Partition-key range routing

  • PartitionKeyInternalHelper.java — convert PartitionKeyInternal ranges into sorted EPK ranges for multi-range routing.

Test changes (azure-cosmos-tests)

The monolithic ThinClientE2ETest is replaced (−378) by a focused thin-client E2E suite (TestNG group thinclient, proxy :10250):

Test class Coverage
ThinClientQueryE2ETest 84 oracle-style tests — query parity, endpoint diagnostics, full-text, hybrid, vector, continuation draining
ThinClientChangeFeedE2ETest forFullRange(), forLogicalPartition(), incremental change feed
ThinClientPointOperationE2ETest CRUD + Patch, bulk, batch
ThinClientStoredProcedureE2ETest Stored procedure execute, no-PK error, PartitionKey.NONE
PartitionKeyInternalTest Client-side conversion of PartitionKeyInternal ranges to sorted EPK ranges
QueryPlanRetrieverSupportedFeaturesTest Advertised query-feature flags
ReadManyByPartitionKeyQueryPlanRoutingTest readMany query-plan routing
GatewayReadConsistencyStrategySpyWireTest Gateway read-consistency wire behavior

Plus: ThinClientTestBase / TestSuiteBase thin-client helpers, pom.xml thinclient profile wiring, and THINCLIENT_TEST_MATRIX.md coverage matrix.

Query-test methodology

ThinClientQueryE2ETest runs each query shape against the same seeded data through two paths:

  1. Direct TCP — baseline path to backend partition replicas.
  2. Gateway V2 thin client — system-under-test path through the thin-client proxy.

Assertions: (1) thin-client diagnostics include a request to the :10250 proxy endpoint, (2) Direct and thin-client result counts match, (3) result contents match — preserving order for ORDER BY queries and compared as sorted sets otherwise.

Assertion hardening (F1–F5)

  • F1 — strict 400 + unconditional :10250 endpoint assertion on invalid query (locks the statusCode 0400 fix above; no longer tolerates 0).
  • F2 — ordered-vs-sorted-set document-ID comparison (cross-partition order is only asserted for ORDER BY).
  • F3 — document-ID set equality and duplicate detection across drained continuation pages.
  • F41e-6 numeric tolerance for scalar and GROUP BY aggregates (avoids SUM/AVG float-formatting false mismatches).
  • F5 — validated vector / full-text / full-text-score-ranking / hybrid parity end-to-end (no proxy capability gap).

Query coverage validated

  • filters and projections
  • ORDER BY, DISTINCT, TOP, OFFSET / LIMIT
  • aggregates and GROUP BY
  • JOIN, EXISTS, LIKE, BETWEEN
  • string, math, type, array, and conditional functions
  • vector search, full-text ranking, hybrid RRF(...) queries
  • multi-range EPK routing
  • continuation-token draining (lowered page size)

Query feature header validation

QueryPlanRetrieverSupportedFeaturesTest verifies Java now advertises CountIf while intentionally not advertising:

  • ListAndSetAggregate — Java does not yet implement MAKELIST / MAKESET aggregation.
  • HybridSearchSkipOrderByRewrite — currently fails Java thin-client hybrid validation against staging with a backend 400 / SC1001 syntax error.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

@jeet1995

Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995 jeet1995 changed the title Az cosmos gateway v2 query plan support [Gateway V2 / DO NOT MERGE]: Integrate Stored Procedure and Query Plan request routing to a Gateway V2 endpoint. Jan 29, 2026
@jeet1995

Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995

Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995 jeet1995 changed the title [Gateway V2 / DO NOT MERGE]: Integrate Stored Procedure and Query Plan request routing to a Gateway V2 endpoint. [Gateway V2][DO NOT MERGE]: Integrate Stored Procedure and Query Plan request routing to a Gateway V2 endpoint. Jan 30, 2026

@xinlian12 xinlian12 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

@FabianMeiswinkel FabianMeiswinkel left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Thanks

@xinlian12

Copy link
Copy Markdown
Member

Review complete (01:52)

Posted 3 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@xinlian12

Copy link
Copy Markdown
Member

Review complete (17:17)

Posted 2 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@xinlian12

Copy link
Copy Markdown
Member

Review complete (51:55)

Posted 1 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

jeet1995 and others added 2 commits June 12, 2026 22:06
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995

Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

jeet1995 and others added 2 commits June 14, 2026 00:24
readManyByPartitionKeys validates any caller-supplied custom query by
fetching a query plan and asserting it is single-partition and
non-hybrid. Until now that validation called fetchQueryPlanForValidation
with no DocumentCollection, so the plan request was pinned to Gateway V1
(useGatewayMode = (partitionKeyDefinition == null)).

Thread the container's DocumentCollection from
RxDocumentClientImpl.validateCustomQueryForReadManyByPartitionKeys ->
DocumentQueryExecutionContextFactory.fetchQueryPlanForValidation so
QueryPlanRetriever has the PartitionKeyDefinition needed to convert
PartitionKeyInternal-formatted queryRanges from the thin client (Gateway
V2) proxy into the EPK-hex Range<String> entries the query pipeline
consumes. With this wiring, the validation query plan goes to the thin
client when the client is configured for it, and remains on Gateway V1
otherwise. No behavior change on the non-thin-client path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add ReadManyByPartitionKeyQueryPlanRoutingTest unit tests that pin the useGatewayMode gate in QueryPlanRetriever: gateway mode when DocumentCollection is null and partitioned mode when a PartitionKeyDefinition is present.

- Add three readManyByPartitionKeys E2E tests to ThinClientQueryE2ETest that exercise the validation QueryPlan path through Direct TCP (baseline) and Gateway V2 (thin client), covering no-custom-query, projection+filter, and parameterized variants. Each thin-client diagnostics page is asserted to use the :10250 endpoint.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995

Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

QueryPlan requests intentionally carry no RCS/CL headers (matches the V1
HTTP behavior). When the V2 thin-client routes the QueryPlan precursor
through the same :10250 endpoint as the data query, the spy must skip
the QueryPlan frame so the assertion checks the actual data-query frame.

This mirrors the IS_QUERY_PLAN_REQUEST filter on the V1 path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995

Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

jeet1995 and others added 3 commits June 16, 2026 15:23
The thin-client/Gateway-V2 proxy returns a non-2xx response with a raw,
non-JSON, NUL-padded error body for invalid-syntax queries. In
RxGatewayStoreModel.validateOrThrow, new CosmosError(body) attempted to
parse that body as JSON and threw IllegalArgumentException, which escaped
the method before the existing status-carrying throw could run. Upstream
then wrapped it as statusCode 0.

Wrap the CosmosError(body) construction in a narrow try/catch
(IllegalArgumentException) and fall back to the non-parsing
CosmosError(errorCode, message) constructor with a sanitized body. The
existing throw now fires with the correct status (400) and the proxy
error text is preserved.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
F1: strict 400 + unconditional thin-client endpoint check on invalid query, locking the statusCode-0->400 fix.
F2: ordered-vs-sorted-set document-ID comparison (ORDER BY sequence-compared, others set-compared).
F3: ID-set equality + no-duplicate check across drained continuation pages.
F4: numeric-tolerance (1e-6) comparison for scalar and GROUP BY aggregates to avoid float-formatting false mismatches.
F5: validated vector/full-text/hybrid queries match Direct vs thin-client end-to-end through the proxy.

Validated live via -Pthinclient: 84 tests, 0 failures.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Rewrite THINCLIENT_TEST_MATRIX.md as a reviewable test-design specification reverse-engineered from the committed ThinClientQueryE2ETest code (84 tests). Documents the differential-testing oracle (Direct :443 baseline vs thin client :10250 SUT), the data-model fixture, every assertion contract (endpoint provenance, ordered-vs-unordered ID equality, scalar/GROUP BY tolerance), the full 84-test matrix, the F1-F5 hardened special cases (continuation draining, invalid-query 400, vector/FTS/hybrid ranking, readMany validation path), advertised-feature coverage, and known gaps (CountIf/DCount/MultipleOrderBy) for reviewer sign-off.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995

Copy link
Copy Markdown
Member Author

@sdkReviewAgent

@jeet1995 jeet1995 changed the title [WIP]: Integrate Stored Procedure and Query Plan request routing to Gateway V2 endpoint Integrate Stored Procedure and Query Plan request routing to Gateway V2 endpoint Jun 16, 2026
@jeet1995

Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Executes actionable review feedback on ThinClientQueryE2ETest plus two
harness fixes surfaced during live validation against thin-client-multi-region-ci.

ThinClientQueryE2ETest:
- Strict ordering: ORDER BY results validated with isStrictlyOrdered across
  the board (stricter-by-default, per reviewer guidance) instead of set parity.
- Add testMultipleOrderBy() with a composite-index container to cover the
  MultipleOrderBy query feature.
- Add testDCount() using the canonical Cosmos DCount idiom
  (COUNT over a DISTINCT VALUE subquery); SQL-standard COUNT(DISTINCT ...) is
  not valid Cosmos SQL grammar.
- Reword multi-EPK-range comment/javadoc to reflect emulator/backend reality
  (multiple metadata ranges served by a single backend partition; SDK routing
  and query pipeline still exercised).

Harness fixes:
- TestSuiteBase: add wait-and-poll utility waitForCollectionToBeAvailableToRead
  (predicate on NotFound/substatus 1013) to deflake "Collection is not yet
  available for read" on freshly created containers; update call sites in
  OrderbyDocumentQueryTest, NonStreamingOrderByQueryVectorSearchTest,
  QueryValidationTests, ReadFeedCollectionsTest.
- SinglePartitionDocumentQueryTest: make the processMessage Mockito assertion
  mode-aware (thin-client routes query plan + query => times(2)).

Validated live against thin-client-multi-region-ci: 86/86 green.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995

Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants