fix: Forward storage_options to parquet metadata reads#353
Open
Mattijs De Paepe (mattijsdp) wants to merge 7 commits into
Open
fix: Forward storage_options to parquet metadata reads#353Mattijs De Paepe (mattijsdp) wants to merge 7 commits into
Mattijs De Paepe (mattijsdp) wants to merge 7 commits into
Conversation
Schema/collection/failure-info reads passed `storage_options` (and `credential_provider`) to the data read via `pl.read_parquet` / `pl.scan_parquet`, but the separate embedded-schema metadata read called `pl.read_parquet_metadata` with no options. Against non-AWS S3-compatible stores reached purely through `storage_options` (lakeFS, MinIO, R2, Tigris, …) the metadata read fell back to the default AWS credential chain and endpoint, breaking typed reads. Thread the storage-related options into all metadata reads in `_storage/parquet.py` via a small `_metadata_read_options` helper. Fixes Quantco#352 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`read_parquet_metadata_schema` and `read_parquet_metadata_collection` read parquet metadata from a (possibly remote) source but accepted no options, so they could not reach non-AWS S3-compatible stores either. Accept and forward `**kwargs` (e.g. `storage_options`, `credential_provider`) to `pl.read_parquet_metadata`, matching `read_parquet`/`scan_parquet`. Add s3-marked regression tests covering both helpers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…retries Address review feedback on the storage-options forwarding fix: - Match the file's docstring convention for the public metadata helpers: drop the enumerated `storage_options`/`credential_provider` note and use the same terse "passed directly to :meth:`polars.read_parquet_metadata`" wording as `read_parquet`/`scan_parquet`. - Forward `retries` alongside `storage_options`/`credential_provider` in `_metadata_read_options`, since `read_parquet_metadata` accepts it and it is storage-reaching. Clarify in the docstring why the call sites must filter the scan/read kwargs (the narrower `read_parquet_metadata` signature rejects options like `n_rows`/`columns`) instead of forwarding everything. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Shrink `_metadata_read_options` to a one-line comment matching the other private helpers in the module (no oversized docstring). - Extract an `s3_storage_options` fixture (mirrors `s3_tmp_path`, but strips the AWS_* env vars so the store is reachable *only* via `storage_options`) and use it across the schema, collection and failure-info regression tests. - Add a failure-info regression test covering the `scan_failure_info` metadata read, and split the schema test so the typed read and the standalone `read_parquet_metadata_schema` helper are asserted independently. - Drop the end-to-end collection typed-read test: it cannot pass via `storage_options` alone because member discovery goes through fsspec (`url_to_fs`/`fs.exists`), which does not receive `storage_options` -- a separate limitation from the polars metadata read this PR fixes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Documentation should stand on its own, so remove the issue-tracker links from the `s3_storage_options` fixture and the storage-options regression tests, and shrink their docstrings to a single line in line with the surrounding tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR ensures Parquet metadata reads (schema/collection/failure info) correctly forward storage_options so S3-backed files work even when credentials/endpoints are provided only via storage_options.
Changes:
- Forward metadata read options to
polars.read_parquet_metadataacross schema/collection/failure-info code paths. - Add an
s3_storage_optionsfixture that stripsAWS_*env vars to validate forwarding behavior. - Add S3-marked regression tests covering schema, collection, and failure-info metadata reads.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/schema/test_read_write_parquet.py | Adds S3 regression tests ensuring schema metadata reads forward storage_options. |
| tests/failure_info/test_storage.py | Adds S3 regression test ensuring failure-info metadata reads forward storage_options. |
| tests/conftest.py | Introduces s3_storage_options fixture that requires passing options explicitly. |
| tests/collection/test_storage.py | Adds S3 regression test ensuring collection metadata reads forward storage_options. |
| dataframely/schema.py | Allows forwarding kwargs (e.g., storage_options) to pl.read_parquet_metadata. |
| dataframely/collection/collection.py | Allows forwarding kwargs (e.g., storage_options) to pl.read_parquet_metadata. |
| dataframely/_storage/parquet.py | Forwards filtered read options to metadata reads and centralizes filtering logic. |
Comment on lines
+253
to
+261
| def _metadata_read_options(kwargs: dict[str, Any]) -> dict[str, Any]: | ||
| # Forward only the options that `read_parquet_metadata` accepts (it has a narrower | ||
| # signature than `read_parquet`/`scan_parquet`) so the metadata read reaches the | ||
| # same store as the data read. | ||
| return { | ||
| key: kwargs[key] | ||
| for key in ("storage_options", "credential_provider", "retries") | ||
| if key in kwargs | ||
| } |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #353 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 56 56
Lines 3458 3461 +3
=========================================
+ Hits 3458 3461 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Fixes #352.
Schema.read_parquet/scan_parquet(plus collection and failure-info reads) failed against S3-compatible stores reached viastorage_options. The data read gotstorage_options, but the separate embedded-metadata read didn't — so polars fell back to the default AWS endpoint/credentials.Changes
_storage/parquet.py— forward the storage optionspl.read_parquet_metadataaccepts (storage_options,credential_provider,retries) to every metadata read, via a small_metadata_read_options(kwargs)helper. Only present keys are forwarded, so defaults are unchanged.read_parquet_metadata_schema(schema.py) andread_parquet_metadata_collection(collection/collection.py) now accept and forward**kwargs.Note: limited to polars parquet-metadata reads (covers the issue's single-file repro). A
storage_options-only collection read still fails earlier, at fsspec-based member discovery (url_to_fs/glob), which uses a different option vocabulary than polars and would need a per-backend translation I think.