feat: support arrow pycapsule streams#3447
Open
abnobdoss wants to merge 2 commits into
Open
Conversation
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that's incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
271f50d to
38e61ea
Compare
added 2 commits
June 30, 2026 21:28
Adopt the Arrow PyCapsule interface on both sides of the read/write boundary. append/overwrite accept any object implementing __arrow_c_stream__ (coerced to a streaming RecordBatchReader), and Table plus every scan expose __arrow_c_stream__ via BaseScan (not just DataScan) so any Arrow consumer can ingest them.
38e61ea to
6b1fbd8
Compare
Contributor
Author
|
Still active - happy to address any feedback. |
kylebarron
approved these changes
Jul 1, 2026
Comment on lines
+3064
to
+3082
|
|
||
| def _coerce_arrow_input(df: pa.Table | pa.RecordBatchReader | ArrowStreamExportable) -> pa.Table | pa.RecordBatchReader: | ||
| """Normalize Arrow write input to a pa.Table or pa.RecordBatchReader. | ||
|
|
||
| Native pyarrow inputs pass through unchanged; any object implementing the | ||
| Arrow PyCapsule stream interface (``__arrow_c_stream__``) is imported as a | ||
| streaming RecordBatchReader. | ||
| """ | ||
| if isinstance(df, (pa.Table, pa.RecordBatchReader)): | ||
| return df | ||
|
|
||
| # Any object implementing the Arrow PyCapsule stream interface. | ||
| if hasattr(df, "__arrow_c_stream__"): | ||
| return pa.RecordBatchReader.from_stream(df) | ||
|
|
||
| raise ValueError( | ||
| f"Expected pa.Table, pa.RecordBatchReader, or an object implementing the " | ||
| f"Arrow PyCapsule interface (__arrow_c_stream__), got: {df!r}" | ||
| ) |
Member
There was a problem hiding this comment.
This looks to be the core change in this PR and looks valid 👍
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #2680
Closes #1655
Rationale for this change
PyIceberg is coupled to PyArrow at its read/write boundary:
append/overwritereject anything that isn't apa.Table/pa.RecordBatchReader, and external Arrow consumers can't read a table/scan withoutto_arrow(). Users of other Arrow-native libraries (polars, arro3, nanoarrow, …) therefore have to convert to PyArrow explicitly.This PR adopts the Arrow PyCapsule interface on both sides:
append/overwriteaccept any object implementing__arrow_c_stream__, in addition to PyArrow types.TableandDataScanimplement__arrow_c_stream__, so they can be handed to any Arrow consumer.PyArrow inputs are unchanged; other producers are wrapped as a streaming
RecordBatchReader. PyArrow remains an internal write dependency, only the caller-side requirement is removed.Side effect: bin-packing now falls back to referenced buffer size for Arrow view types (e.g.
string_view) that PyArrow can't size vianbytes, since recent Polars exports produce them.Not in scope:
upsert/dynamic_partition_overwritestill require a materializedpa.Table(they do random access/joins, not streaming).append/overwriteon a partitioned table still raisesNotImplementedError, the same restriction aspa.RecordBatchReadertoday.pa.Tablewrites are unaffected either way.Are these changes tested?
Yes.
tests/table/test_arrow_capsule.py(no Docker) covers coercion,append/overwriteacross all input forms, the partitioned regression, and round-trips throughpa.table().tests/io/test_pyarrow.pycovers thestring_viewbin-packing fallback.Are there any user-facing changes?
Yes, additive and backwards compatible.
append/overwriteaccept Arrow PyCapsule producers;Table/DataScanimplement__arrow_c_stream__. No change for existing PyArrow inputs.