fix(upsert): bound file match filter to avoid segfault by abnobdoss · Pull Request #3420 · apache/iceberg-python

abnobdoss · 2026-05-27T00:58:58Z

Rationale for this change

This fixes the initial upsert scan-planning segfault / recursion failure caused by large composite-key match filters. It should also improve planning performance because the current exact PyIceberg expression tree is expensive to construct, bind, and evaluate for large source batches.

For composite join keys, the initial upsert scan currently builds one And(EqualTo(...), EqualTo(...)) predicate per unique source key and reduces them into a left-leaning Or tree. Large source batches therefore create deep recursive expression trees before the target scan starts; expression visitors then walk Or.left / Or.right recursively, so stack usage scales with the number of unique key combinations.

This change adds a bounded create_file_match_filter for the initial target file-pruning scan. It uses per-column min/max bounds, plus IsNull where needed, so the initial scan predicate size is tied to the number of join columns rather than the number of source rows. The exact create_match_filter is still used later for overwrite and insert filtering, so row-level correctness is unchanged.

The tradeoff is that the bounded file-pruning predicate is conservative: for composite keys it can scan cross-product false positives, which downstream exact matching culls before update/insert decisions. This can technically increase the number of files scanned, but the current failure mode makes upsert impossible for affected large composite-key inputs. The intent of this PR is to remove that crash first; if there is appetite to improve pruning further, a follow-up can use a more selective bounding-box or more sophisticated algorithm while preserving the bounded expression shape.

Match-filter timings below include expression construction, schema binding, PyArrow expression conversion, and applying the resulting Arrow filter to an in-memory table. These do not measure end-to-end upsert time, which depends on table layout, file pruning, IO, and downstream exact matching.

Benchmark script: https://gist.github.com/abnobdoss/824698ea74c29c02ead8da6410b89b6e

Current exact match-filter path:

Unique source keys	1 join column	2 join columns	3 join columns
100	1.5 ms	7.3 ms	11 ms
1k	4.7 ms	87 ms	147 ms
10k	33 ms	SIGSEGV	SIGSEGV
100k	650 ms	SIGSEGV	SIGSEGV
1m	6.59 s	>30 s	>30 s

Bounded file-match filter path after this PR:

Unique source keys	1 join column	2 join columns	3 join columns
100	1.0 ms	1.1 ms	1.2 ms
1k	1.0 ms	1.1 ms	1.2 ms
10k	1.1 ms	1.1 ms	1.3 ms
100k	1.3 ms	1.6 ms	1.9 ms
1m	3.7 ms	6.8 ms	9.0 ms

Follow-up work should address the remaining exact match-filter recursion paths used later during overwrite and insert filtering. Those predicates need exact semantics, so they should be fixed separately.

Are these changes tested?

Yes.

Added regression coverage for the large composite-key initial scan path, plus focused coverage for bounded file-match filter construction and the conservative false-positive culling path.

Ran:

uv run python -m pytest tests/table/test_upsert.py -k "large_composite_key_initial_scan or file_match_filter or create_match_filter"
uv run python -m pytest tests/table/test_upsert.py
commit hooks

The added test_upsert_large_composite_key_initial_scan_does_not_recurse regression test runs the large composite-key upsert in a subprocess so runtimes that would segfault report a pytest failure instead of terminating the pytest worker; the equivalent base-branch path exits 139 under Python 3.10, while this branch exits successfully.

Are there any user-facing changes?

No API changes. Upserts with large composite-key source batches should avoid the initial scan-planning recursion failure and may scan a conservative superset of candidate files before exact row matching.

fix(upsert): use bounded file match filter to avoid scan recursion

bbfc654

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(upsert): bound file match filter to avoid segfault#3420

fix(upsert): bound file match filter to avoid segfault#3420
abnobdoss wants to merge 1 commit into
apache:mainfrom
abnobdoss:fix/upsert-file-match-filter

abnobdoss commented May 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abnobdoss commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

abnobdoss commented May 27, 2026 •

edited

Loading