Skip to content

feat: support arrow pycapsule streams#3447

Open
abnobdoss wants to merge 2 commits into
apache:mainfrom
abnobdoss:arrow-pycapsule-stream
Open

feat: support arrow pycapsule streams#3447
abnobdoss wants to merge 2 commits into
apache:mainfrom
abnobdoss:arrow-pycapsule-stream

Conversation

@abnobdoss

@abnobdoss abnobdoss commented May 31, 2026

Copy link
Copy Markdown
Contributor

Closes #2680
Closes #1655

Rationale for this change

PyIceberg is coupled to PyArrow at its read/write boundary: append / overwrite reject anything that isn't a pa.Table / pa.RecordBatchReader, and external Arrow consumers can't read a table/scan without to_arrow(). Users of other Arrow-native libraries (polars, arro3, nanoarrow, …) therefore have to convert to PyArrow explicitly.

This PR adopts the Arrow PyCapsule interface on both sides:

import polars as pl

df = pl.DataFrame(table.scan())     # read: a scan is an Arrow producer
table.append(some_polars_frame)     # write: a polars/arro3/… frame is too

PyArrow inputs are unchanged; other producers are wrapped as a streaming RecordBatchReader. PyArrow remains an internal write dependency, only the caller-side requirement is removed.

Side effect: bin-packing now falls back to referenced buffer size for Arrow view types (e.g. string_view) that PyArrow can't size via nbytes, since recent Polars exports produce them.

Not in scope:

  • upsert/dynamic_partition_overwrite still require a materialized pa.Table (they do random access/joins, not streaming).
  • A PyCapsule producer to append/overwrite on a partitioned table still raises NotImplementedError, the same restriction as pa.RecordBatchReader today.
  • Materialized pa.Table writes are unaffected either way.

Are these changes tested?

Yes. tests/table/test_arrow_capsule.py (no Docker) covers coercion, append/overwrite across all input forms, the partitioned regression, and round-trips through pa.table(). tests/io/test_pyarrow.py covers the string_view bin-packing fallback.

Are there any user-facing changes?

Yes, additive and backwards compatible. append/overwrite accept Arrow PyCapsule producers; Table/DataScan implement __arrow_c_stream__. No change for existing PyArrow inputs.

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that's incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions Bot added the stale label Jul 1, 2026
@abnobdoss abnobdoss force-pushed the arrow-pycapsule-stream branch from 271f50d to 38e61ea Compare July 1, 2026 02:01
Abanoub Doss added 2 commits June 30, 2026 21:28
Adopt the Arrow PyCapsule interface on both sides of the read/write
boundary. append/overwrite accept any object implementing
__arrow_c_stream__ (coerced to a streaming RecordBatchReader), and
Table plus every scan expose __arrow_c_stream__ via BaseScan (not just
DataScan) so any Arrow consumer can ingest them.
@abnobdoss abnobdoss force-pushed the arrow-pycapsule-stream branch from 38e61ea to 6b1fbd8 Compare July 1, 2026 02:29
@abnobdoss

Copy link
Copy Markdown
Contributor Author

Still active - happy to address any feedback.

Comment thread pyiceberg/io/pyarrow.py
Comment on lines +3064 to +3082

def _coerce_arrow_input(df: pa.Table | pa.RecordBatchReader | ArrowStreamExportable) -> pa.Table | pa.RecordBatchReader:
"""Normalize Arrow write input to a pa.Table or pa.RecordBatchReader.

Native pyarrow inputs pass through unchanged; any object implementing the
Arrow PyCapsule stream interface (``__arrow_c_stream__``) is imported as a
streaming RecordBatchReader.
"""
if isinstance(df, (pa.Table, pa.RecordBatchReader)):
return df

# Any object implementing the Arrow PyCapsule stream interface.
if hasattr(df, "__arrow_c_stream__"):
return pa.RecordBatchReader.from_stream(df)

raise ValueError(
f"Expected pa.Table, pa.RecordBatchReader, or an object implementing the "
f"Arrow PyCapsule interface (__arrow_c_stream__), got: {df!r}"
)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks to be the core change in this PR and looks valid 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow Arrow Capsule Interface [feature] Investigate integrations leveraging the PyCapsule protocol

2 participants