Skip to content

fix(upsert): bound file match filter to avoid segfault#3420

Open
abnobdoss wants to merge 1 commit into
apache:mainfrom
abnobdoss:fix/upsert-file-match-filter
Open

fix(upsert): bound file match filter to avoid segfault#3420
abnobdoss wants to merge 1 commit into
apache:mainfrom
abnobdoss:fix/upsert-file-match-filter

Conversation

@abnobdoss
Copy link
Copy Markdown

@abnobdoss abnobdoss commented May 27, 2026

Rationale for this change

This fixes the initial upsert scan-planning segfault / recursion failure caused by large composite-key match filters. It should also improve planning performance because the current exact PyIceberg expression tree is expensive to construct, bind, and evaluate for large source batches.

For composite join keys, the initial upsert scan currently builds one And(EqualTo(...), EqualTo(...)) predicate per unique source key and reduces them into a left-leaning Or tree. Large source batches therefore create deep recursive expression trees before the target scan starts; expression visitors then walk Or.left / Or.right recursively, so stack usage scales with the number of unique key combinations.

This change adds a bounded create_file_match_filter for the initial target file-pruning scan. It uses per-column min/max bounds, plus IsNull where needed, so the initial scan predicate size is tied to the number of join columns rather than the number of source rows. The exact create_match_filter is still used later for overwrite and insert filtering, so row-level correctness is unchanged.

The tradeoff is that the bounded file-pruning predicate is conservative: for composite keys it can scan cross-product false positives, which downstream exact matching culls before update/insert decisions. This can technically increase the number of files scanned, but the current failure mode makes upsert impossible for affected large composite-key inputs. The intent of this PR is to remove that crash first; if there is appetite to improve pruning further, a follow-up can use a more selective bounding-box or more sophisticated algorithm while preserving the bounded expression shape.

Match-filter timings below include expression construction, schema binding, PyArrow expression conversion, and applying the resulting Arrow filter to an in-memory table. These do not measure end-to-end upsert time, which depends on table layout, file pruning, IO, and downstream exact matching.

Benchmark script: https://gist.github.com/abnobdoss/824698ea74c29c02ead8da6410b89b6e

Current exact match-filter path:

Unique source keys 1 join column 2 join columns 3 join columns
100 1.5 ms 7.3 ms 11 ms
1k 4.7 ms 87 ms 147 ms
10k 33 ms SIGSEGV SIGSEGV
100k 650 ms SIGSEGV SIGSEGV
1m 6.59 s >30 s >30 s

Bounded file-match filter path after this PR:

Unique source keys 1 join column 2 join columns 3 join columns
100 1.0 ms 1.1 ms 1.2 ms
1k 1.0 ms 1.1 ms 1.2 ms
10k 1.1 ms 1.1 ms 1.3 ms
100k 1.3 ms 1.6 ms 1.9 ms
1m 3.7 ms 6.8 ms 9.0 ms

Follow-up work should address the remaining exact match-filter recursion paths used later during overwrite and insert filtering. Those predicates need exact semantics, so they should be fixed separately.

Are these changes tested?

Yes.

Added regression coverage for the large composite-key initial scan path, plus focused coverage for bounded file-match filter construction and the conservative false-positive culling path.

Ran:

  • uv run python -m pytest tests/table/test_upsert.py -k "large_composite_key_initial_scan or file_match_filter or create_match_filter"
  • uv run python -m pytest tests/table/test_upsert.py
  • commit hooks

The added test_upsert_large_composite_key_initial_scan_does_not_recurse regression test runs the large composite-key upsert in a subprocess so runtimes that would segfault report a pytest failure instead of terminating the pytest worker; the equivalent base-branch path exits 139 under Python 3.10, while this branch exits successfully.

Are there any user-facing changes?

No API changes. Upserts with large composite-key source batches should avoid the initial scan-planning recursion failure and may scan a conservative superset of candidate files before exact row matching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant