Skip to content

fix: Forward storage_options to parquet metadata reads#353

Open
Mattijs De Paepe (mattijsdp) wants to merge 7 commits into
Quantco:mainfrom
mattijsdp:fix/parquet-metadata-storage-options
Open

fix: Forward storage_options to parquet metadata reads#353
Mattijs De Paepe (mattijsdp) wants to merge 7 commits into
Quantco:mainfrom
mattijsdp:fix/parquet-metadata-storage-options

Conversation

@mattijsdp
Copy link
Copy Markdown

@mattijsdp Mattijs De Paepe (mattijsdp) commented Jun 2, 2026

Motivation

Fixes #352.

Schema.read_parquet/scan_parquet (plus collection and failure-info reads) failed against S3-compatible stores reached via storage_options. The data read got storage_options, but the separate embedded-metadata read didn't — so polars fell back to the default AWS endpoint/credentials.

Changes

  • _storage/parquet.py — forward the storage options pl.read_parquet_metadata accepts (storage_options, credential_provider, retries) to every metadata read, via a small _metadata_read_options(kwargs) helper. Only present keys are forwarded, so defaults are unchanged.
  • read_parquet_metadata_schema (schema.py) and read_parquet_metadata_collection (collection/collection.py) now accept and forward **kwargs.

Note: limited to polars parquet-metadata reads (covers the issue's single-file repro). A storage_options-only collection read still fails earlier, at fsspec-based member discovery (url_to_fs/glob), which uses a different option vocabulary than polars and would need a per-backend translation I think.

Schema/collection/failure-info reads passed `storage_options` (and
`credential_provider`) to the data read via `pl.read_parquet` /
`pl.scan_parquet`, but the separate embedded-schema metadata read called
`pl.read_parquet_metadata` with no options. Against non-AWS S3-compatible
stores reached purely through `storage_options` (lakeFS, MinIO, R2, Tigris,
…) the metadata read fell back to the default AWS credential chain and
endpoint, breaking typed reads.

Thread the storage-related options into all metadata reads in
`_storage/parquet.py` via a small `_metadata_read_options` helper.

Fixes Quantco#352

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`read_parquet_metadata_schema` and `read_parquet_metadata_collection` read
parquet metadata from a (possibly remote) source but accepted no options, so
they could not reach non-AWS S3-compatible stores either. Accept and forward
`**kwargs` (e.g. `storage_options`, `credential_provider`) to
`pl.read_parquet_metadata`, matching `read_parquet`/`scan_parquet`.

Add s3-marked regression tests covering both helpers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…retries

Address review feedback on the storage-options forwarding fix:

- Match the file's docstring convention for the public metadata helpers: drop
  the enumerated `storage_options`/`credential_provider` note and use the same
  terse "passed directly to :meth:`polars.read_parquet_metadata`" wording as
  `read_parquet`/`scan_parquet`.
- Forward `retries` alongside `storage_options`/`credential_provider` in
  `_metadata_read_options`, since `read_parquet_metadata` accepts it and it is
  storage-reaching. Clarify in the docstring why the call sites must filter the
  scan/read kwargs (the narrower `read_parquet_metadata` signature rejects
  options like `n_rows`/`columns`) instead of forwarding everything.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Shrink `_metadata_read_options` to a one-line comment matching the other
  private helpers in the module (no oversized docstring).
- Extract an `s3_storage_options` fixture (mirrors `s3_tmp_path`, but strips the
  AWS_* env vars so the store is reachable *only* via `storage_options`) and use
  it across the schema, collection and failure-info regression tests.
- Add a failure-info regression test covering the `scan_failure_info` metadata
  read, and split the schema test so the typed read and the standalone
  `read_parquet_metadata_schema` helper are asserted independently.
- Drop the end-to-end collection typed-read test: it cannot pass via
  `storage_options` alone because member discovery goes through fsspec
  (`url_to_fs`/`fs.exists`), which does not receive `storage_options` -- a
  separate limitation from the polars metadata read this PR fixes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Documentation should stand on its own, so remove the issue-tracker links from
the `s3_storage_options` fixture and the storage-options regression tests, and
shrink their docstrings to a single line in line with the surrounding tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR ensures Parquet metadata reads (schema/collection/failure info) correctly forward storage_options so S3-backed files work even when credentials/endpoints are provided only via storage_options.

Changes:

  • Forward metadata read options to polars.read_parquet_metadata across schema/collection/failure-info code paths.
  • Add an s3_storage_options fixture that strips AWS_* env vars to validate forwarding behavior.
  • Add S3-marked regression tests covering schema, collection, and failure-info metadata reads.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/schema/test_read_write_parquet.py Adds S3 regression tests ensuring schema metadata reads forward storage_options.
tests/failure_info/test_storage.py Adds S3 regression test ensuring failure-info metadata reads forward storage_options.
tests/conftest.py Introduces s3_storage_options fixture that requires passing options explicitly.
tests/collection/test_storage.py Adds S3 regression test ensuring collection metadata reads forward storage_options.
dataframely/schema.py Allows forwarding kwargs (e.g., storage_options) to pl.read_parquet_metadata.
dataframely/collection/collection.py Allows forwarding kwargs (e.g., storage_options) to pl.read_parquet_metadata.
dataframely/_storage/parquet.py Forwards filtered read options to metadata reads and centralizes filtering logic.

Comment thread tests/failure_info/test_storage.py
Comment thread tests/failure_info/test_storage.py
Comment on lines +253 to +261
def _metadata_read_options(kwargs: dict[str, Any]) -> dict[str, Any]:
# Forward only the options that `read_parquet_metadata` accepts (it has a narrower
# signature than `read_parquet`/`scan_parquet`) so the metadata read reaches the
# same store as the data read.
return {
key: kwargs[key]
for key in ("storage_options", "credential_provider", "retries")
if key in kwargs
}
@mattijsdp Mattijs De Paepe (mattijsdp) marked this pull request as draft June 2, 2026 19:04
@mattijsdp Mattijs De Paepe (mattijsdp) changed the title Fix/parquet metadata storage options fix: Forward storage_options to parquet metadata reads Jun 2, 2026
@github-actions github-actions Bot added the fix label Jun 2, 2026
@mattijsdp Mattijs De Paepe (mattijsdp) marked this pull request as ready for review June 2, 2026 20:25
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (50e18a8) to head (5df2685).

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #353   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           56        56           
  Lines         3458      3461    +3     
=========================================
+ Hits          3458      3461    +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet metadata read ignores storage_options, breaking typed reads from non-AWS S3 stores

2 participants