feat: support arrow pycapsule streams by abnobdoss · Pull Request #3447 · apache/iceberg-python

abnobdoss · 2026-05-31T20:45:15Z

Closes #2680
Closes #1655

Rationale for this change

PyIceberg is coupled to PyArrow at its read/write boundary: append / overwrite reject anything that isn't a pa.Table / pa.RecordBatchReader, and external Arrow consumers can't read a table/scan without to_arrow(). Users of other Arrow-native libraries (polars, arro3, nanoarrow, …) therefore have to convert to PyArrow explicitly.

This PR adopts the Arrow PyCapsule interface on both sides:

Input (Allow Arrow Capsule Interface #2680): append / overwrite accept any object implementing __arrow_c_stream__, in addition to PyArrow types.
Output ([feature] Investigate integrations leveraging the PyCapsule protocol #1655): Table and DataScan implement __arrow_c_stream__, so they can be handed to any Arrow consumer.

import polars as pl

df = pl.DataFrame(table.scan())     # read: a scan is an Arrow producer
table.append(some_polars_frame)     # write: a polars/arro3/… frame is too

PyArrow inputs are unchanged; other producers are wrapped as a streaming RecordBatchReader. PyArrow remains an internal write dependency, only the caller-side requirement is removed.

Side effect: bin-packing now falls back to referenced buffer size for Arrow view types (e.g. string_view) that PyArrow can't size via nbytes, since recent Polars exports produce them.

Not in scope:

upsert/dynamic_partition_overwrite still require a materialized pa.Table (they do random access/joins, not streaming).
A PyCapsule producer to append/overwrite on a partitioned table still raises NotImplementedError, the same restriction as pa.RecordBatchReader today.
Materialized pa.Table writes are unaffected either way.

Are these changes tested?

Yes. tests/table/test_arrow_capsule.py (no Docker) covers coercion, append/overwrite across all input forms, the partitioned regression, and round-trips through pa.table(). tests/io/test_pyarrow.py covers the string_view bin-packing fallback.

Are there any user-facing changes?

Yes, additive and backwards compatible. append/overwrite accept Arrow PyCapsule producers; Table/DataScan implement __arrow_c_stream__. No change for existing PyArrow inputs.

github-actions · 2026-07-01T00:53:19Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that's incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

Adopt the Arrow PyCapsule interface on both sides of the read/write boundary. append/overwrite accept any object implementing __arrow_c_stream__ (coerced to a streaming RecordBatchReader), and Table plus every scan expose __arrow_c_stream__ via BaseScan (not just DataScan) so any Arrow consumer can ingest them.

abnobdoss · 2026-07-01T02:31:41Z

Still active - happy to address any feedback.

kylebarron · 2026-07-01T15:49:58Z

+
+def _coerce_arrow_input(df: pa.Table | pa.RecordBatchReader | ArrowStreamExportable) -> pa.Table | pa.RecordBatchReader:
+    """Normalize Arrow write input to a pa.Table or pa.RecordBatchReader.
+
+    Native pyarrow inputs pass through unchanged; any object implementing the
+    Arrow PyCapsule stream interface (``__arrow_c_stream__``) is imported as a
+    streaming RecordBatchReader.
+    """
+    if isinstance(df, (pa.Table, pa.RecordBatchReader)):
+        return df
+
+    # Any object implementing the Arrow PyCapsule stream interface.
+    if hasattr(df, "__arrow_c_stream__"):
+        return pa.RecordBatchReader.from_stream(df)
+
+    raise ValueError(
+        f"Expected pa.Table, pa.RecordBatchReader, or an object implementing the "
+        f"Arrow PyCapsule interface (__arrow_c_stream__), got: {df!r}"
+    )


This looks to be the core change in this PR and looks valid 👍

github-actions Bot added the stale label Jul 1, 2026

abnobdoss force-pushed the arrow-pycapsule-stream branch from 271f50d to 38e61ea Compare July 1, 2026 02:01

Abanoub Doss added 2 commits June 30, 2026 21:28

test: update arrow input error expectations

6b1fbd8

abnobdoss force-pushed the arrow-pycapsule-stream branch from 38e61ea to 6b1fbd8 Compare July 1, 2026 02:29

kylebarron approved these changes Jul 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: support arrow pycapsule streams#3447

feat: support arrow pycapsule streams#3447
abnobdoss wants to merge 2 commits into
apache:mainfrom
abnobdoss:arrow-pycapsule-stream

abnobdoss commented May 31, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 1, 2026

Uh oh!

abnobdoss commented Jul 1, 2026

Uh oh!

kylebarron Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

abnobdoss commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions Bot commented Jul 1, 2026

Uh oh!

abnobdoss commented Jul 1, 2026

Uh oh!

kylebarron Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

abnobdoss commented May 31, 2026 •

edited

Loading