Skip to content

feat(storage): Enable full object checksum PR 3/3: integrate full-object checksum in AsyncMultiRangeDownloader#17263

Merged
chandra-siri merged 20 commits into
mainfrom
feat/downloader-checksum-integration
Jun 4, 2026
Merged

feat(storage): Enable full object checksum PR 3/3: integrate full-object checksum in AsyncMultiRangeDownloader#17263
chandra-siri merged 20 commits into
mainfrom
feat/downloader-checksum-integration

Conversation

@chandra-siri

@chandra-siri chandra-siri commented May 26, 2026

Copy link
Copy Markdown
Contributor

1. Overview of the Solution

This solution implements end-to-end full-object checksum validation in AsyncMultiRangeDownloader for the asynchronous Google Cloud Storage Python client library. As asynchronous multiplexed downloads of non-contiguous ranges are performed concurrently over a single bidirectional gRPC connection, this feature automatically and incrementally calculates a rolling checksum as bytes arrive and validates it against the server's authoritative object checksum once the download completes.

The technical approach consists of three coordinated layers:

  • _AsyncReadObjectStream (Stream Ingestion): Safely extracts the authoritative server checksum (full_obj_server_crc32c) and finalization status (is_finalized) from the object metadata received in the first data payload response of the stream.
  • _ReadResumptionStrategy & _DownloadState (Verification Logic): Computes an isolated, persistent rolling checksum in the individual _DownloadState object to ensure calculations do not bleed across concurrent multiplexed ranges. Crucially, the rolling hash updates only after buffer writes succeed to prevent state corruption during retry re-connects, raising a DataCorruption exception on completion if a mismatch occurs.
  • AsyncMultiRangeDownloader (Orchestration & Cleanup): Detects candidate full-object ranges (e.g., (0, 0) or (0, persisted_size)), propagates checksum settings to the resumption strategy, and guarantees robust cleanup (closing the stream immediately and unregistering IDs) if data corruption or write errors occur.

2. What This PR Specifically Does

This PR implements Step 3: Downloader Orchestration & End-to-End Integration/System Tests of the solution:

  • Relocates raise_if_no_fast_crc32c() validation to the execution phase (download_ranges()) instead of construction time.
  • Propagates stream details (is_finalized, full_obj_server_crc32c) to the resumption state dictionary.
  • Detects implicit full-object downloads ((0, 0)) or explicit full-object downloads ((0, persisted_size)) post-open(), and flags them for validation.
  • Implements the robust cleanup guarantee in download_ranges(): wraps execution in a robust try...finally block to close the stream immediately and unregister multiplexer range IDs upon a DataCorruption exception.
  • Adds integration tests in test_async_multi_range_downloader.py and extensive end-to-end system tests in test_zonal.py checking finalized, unfinalized (appendable), explicit, implicit, and bypassed range downloads against live GCS buckets.

@chandra-siri chandra-siri requested a review from a team as a code owner May 26, 2026 20:54

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces checksum validation for full object downloads in the AsyncMultiRangeDownloader by tracking rolling checksums and handling DataCorruption exceptions. It also defers the CRC32C C-extension check to the download phase. The review feedback points out an optimization opportunity in the is_full_object_read heuristic to prevent unnecessary CRC32C computations when checksums are disabled or when the object is unfinalized.

@chandra-siri chandra-siri changed the title feat(storage): integrate full-object checksum in AsyncMultiRangeDownloader feat(storage): full object checksum: integrate full-object checksum in AsyncMultiRangeDownloader May 26, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a system test that fails if the validation is skipped or applied incorrectly?
If not, consider adding a test that simulates a validation failure to verify that the corruption error is bubbled up the stack.

Base automatically changed from feat/full-object-rolling-checksum to main June 4, 2026 11:22
…sum-integration

# Conflicts:
#	packages/google-cloud-storage/google/cloud/storage/asyncio/retry/reads_resumption_strategy.py
#	packages/google-cloud-storage/tests/unit/asyncio/retry/test_reads_resumption_strategy.py
@chandra-siri chandra-siri enabled auto-merge (squash) June 4, 2026 13:11
@chandra-siri chandra-siri changed the title feat(storage): full object checksum: integrate full-object checksum in AsyncMultiRangeDownloader feat(storage): Enable full object checksum PR 3/3: integrate full-object checksum in AsyncMultiRangeDownloader Jun 4, 2026
@chandra-siri chandra-siri merged commit b6a85e4 into main Jun 4, 2026
32 checks passed
@chandra-siri chandra-siri deleted the feat/downloader-checksum-integration branch June 4, 2026 13:33
sofisl added a commit that referenced this pull request Jun 11, 2026
PR created by the Librarian CLI to initialize a release. Merging this PR
will auto trigger a release.

Librarian Version: v0.19.0
Language Image:
us-central1-docker.pkg.dev/cloud-sdk-librarian-prod/images-prod/python-librarian-generator@sha256:234b9d1f2ddb057ed7ac6a38db0bf8163d839c65c6cf88ade52530cddebce59e
<details><summary>gapic-generator: v1.35.0</summary>

##
[v1.35.0](gapic-generator-v1.34.1...gapic-generator-v1.35.0)
(2026-06-11)

### Features

* setup.py matches prerelease versions (#17370)
([25b857e](25b857e1))

### Bug Fixes

* require protobuf 6.33.5 to address CVE-2026-0994 (#17349)
([6642263](66422636))

</details>


<details><summary>google-auth: v2.54.0</summary>

##
[v2.54.0](google-auth-v2.53.0...google-auth-v2.54.0)
(2026-06-11)

### Features

* implement regional access boundary support for standalone JWT and
async service accounts (#17025)
([35af616](35af6168))

### Bug Fixes

* configure mTLS for impersonated credentials (#17404)
([57269d5](57269d56))

* fail-fast on missing ECP config file to avoid 30s hang (#17377)
([e096127](e0961270))

* Rename the &#39;seed&#39; argument for setting an initial regional
access boundary for clarity (#17186)
([e5c8cf9](e5c8cf92))

* update incorrect urls in setup.py to point at monorepo vs splitrepo
(#17237)
([eaed04b](eaed04ba))

</details>


<details><summary>google-cloud-alloydb: v0.11.0</summary>

##
[v0.11.0](google-cloud-alloydb-v0.10.0...google-cloud-alloydb-v0.11.0)
(2026-06-11)

### Features

* update API sources and regenerate (#17413)
([59fe7cf](59fe7cf8))

</details>


<details><summary>google-cloud-biglake: v0.5.0</summary>

##
[v0.5.0](google-cloud-biglake-v0.4.0...google-cloud-biglake-v0.5.0)
(2026-06-11)

### Features

* update API sources and regenerate (#17431)
([2e75c78](2e75c78c))

</details>


<details><summary>google-cloud-ces: v0.7.0</summary>

##
[v0.7.0](google-cloud-ces-v0.6.0...google-cloud-ces-v0.7.0)
(2026-06-11)

### Features

* update API sources and regenerate (#17413)
([59fe7cf](59fe7cf8))

</details>


<details><summary>google-cloud-confidentialcomputing: v0.11.0</summary>

##
[v0.11.0](google-cloud-confidentialcomputing-v0.10.0...google-cloud-confidentialcomputing-v0.11.0)
(2026-06-11)

### Features

* update API sources and regenerate (#17413)
([59fe7cf](59fe7cf8))

</details>


<details><summary>google-cloud-modelarmor: v0.7.0</summary>

##
[v0.7.0](google-cloud-modelarmor-v0.6.0...google-cloud-modelarmor-v0.7.0)
(2026-06-11)

### Features

* update API sources and regenerate (#17413)
([59fe7cf](59fe7cf8))

</details>


<details><summary>google-cloud-network-services: v0.10.0</summary>

##
[v0.10.0](google-cloud-network-services-v0.9.0...google-cloud-network-services-v0.10.0)
(2026-06-11)

### Features

* update API sources and regenerate (#17431)
([2e75c78](2e75c78c))

</details>


<details><summary>google-cloud-oracledatabase: v0.6.0</summary>

##
[v0.6.0](google-cloud-oracledatabase-v0.5.0...google-cloud-oracledatabase-v0.6.0)
(2026-06-11)

### Features

* update API sources and regenerate (#17413)
([59fe7cf](59fe7cf8))

</details>


<details><summary>google-cloud-spanner: v3.68.0</summary>

##
[v3.68.0](google-cloud-spanner-v3.67.0...google-cloud-spanner-v3.68.0)
(2026-06-11)

### Features

* add asynchronous code snippets and minor cleanup changes (#17337)
([d6aaf61](d6aaf610))

### Performance Improvements

* optimize query result decoding (#17375)
([3f70b2f](3f70b2ff))

</details>


<details><summary>google-cloud-storage: v3.12.0</summary>

##
[v3.12.0](google-cloud-storage-v3.11.0...google-cloud-storage-v3.12.0)
(2026-06-11)

### Features

* full object checksum: implement rolling checksum and verification in
reads resumption strategy (#17262)
([2361ba6](2361ba6e))

* Enable full object checksum PR 1/3 : parse finalize_time and server
crc32c in async object stream (#17261)
([72c7a27](72c7a272))

* full object checksum: integrate full-object checksum in
AsyncMultiRangeDownloader (#17263)
([b6a85e4](b6a85e49))

</details>


<details><summary>google-developer-knowledge: v0.1.0</summary>

##
[v0.1.0](google-developer-knowledge-v0.0.0...google-developer-knowledge-v0.1.0)
(2026-06-11)

### Features

* add google-developer-knowledge (#17417)
([ca02afc](ca02afce))

</details>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants