feat: add array_sum scalar function#22542
Conversation
| } | ||
|
|
||
| // `slice` resets the logical offset to 0, so `i` below is 0-based within the slice. | ||
| let slice = values.slice(start, len); |
There was a problem hiding this comment.
Can probably get away without need a slice here, just manually iterate over the range start..end
| let mut sum = 0.0_f64; | ||
| let mut any_valid = false; | ||
| for i in 0..len { | ||
| if !slice.is_null(i) { |
There was a problem hiding this comment.
| if !slice.is_null(i) { | |
| if slice.is_valid(i) { |
…emantics Address Jefffrey's review on apache#22542: - Drop the per-row Float64Array::slice allocation; iterate start..end directly on values - Use values.is_valid(i) instead of !slice.is_null(i) Also pre-empt the empty-array question Jefffrey raised on the merged sibling apache#22703 (array_product): empty arrays now return NULL via the existing any_valid=false path, matching PostgreSQL, DuckDB list_sum, and SQL Standard SUM-of-empty-set. Updated SLT empty-array expectations and PR body accordingly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Thanks for the review. Addressed both asks: dropped the intermediate Also pre-emptively matched the empty-array semantics to the merged |
Adds `array_sum(array)` returning the sum of elements in a numeric array. Aliased as `list_sum`. Part of the per-function split sequence on tracking issue apache#21536, following the pattern of the already-merged PRs in this series (cosine_distance apache#21542, inner_product apache#21861, array_normalize apache#22013, array_scale apache#22466). Semantics: - NULL row in array -> NULL row out - NULL elements are skipped (SQL aggregate convention; matches PostgreSQL array_sum, DuckDB list_sum, Spark aggregate). A row whose every element is NULL yields NULL. - Empty array -> 0.0 (additive identity, matches SQL SUM over no rows conceptually, and DuckDB list_sum([]) = 0) Input is List/LargeList/FixedSizeList of any numeric type; elements are coerced to Float64. Output is Float64.
…emantics Address Jefffrey's review on apache#22542: - Drop the per-row Float64Array::slice allocation; iterate start..end directly on values - Use values.is_valid(i) instead of !slice.is_null(i) Also pre-empt the empty-array question Jefffrey raised on the merged sibling apache#22703 (array_product): empty arrays now return NULL via the existing any_valid=false path, matching PostgreSQL, DuckDB list_sum, and SQL Standard SUM-of-empty-set. Updated SLT empty-array expectations and PR body accordingly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
After rebasing onto apache/main, the scalar_functions.md merge conflict was resolved by taking main's version. Re-ran dev/update_function_docs.sh to reinsert the array_sum entry (with NULL empty-array semantics). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Jefffrey
left a comment
There was a problem hiding this comment.
should be good once conflicts resolved
The round-2 refactor changed the empty-array behavior from 0.0 to NULL in the Rust code and SLT, but missed the #[user_doc] description macro which still claimed "Returns 0.0 for an empty array." That stale text propagated through dev/update_function_docs.sh into scalar_functions.md. Updated the docstring to match the actual behavior; re-ran the doc regen to refresh scalar_functions.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Thanks for the approval. Just rebased onto |
## Which issue does this PR close? Partial of apache#21536 — `array_sum` (first of the array aggregates in the series). ## Rationale for this change Continues the per-function split sequence requested by @alamb on apache#21536. Four sibling PRs already merged: `cosine_distance` (apache#21542), `inner_product` (apache#21861), `array_normalize` (apache#22013), `array_scale` (apache#22466). `array_add` is in flight as apache#22459 by @SubhamSinghal. `array_sum` is the first of the three array-aggregate functions (sum, product, avg). Its semantics set the pattern for the other two aggregates. ## What changes are included in this PR? - New scalar UDF `array_sum(array)` in `datafusion/functions-nested/src/array_sum.rs` - Module wire-up + registration in `datafusion/functions-nested/src/lib.rs` - SLT tests at `datafusion/sqllogictest/test_files/array_sum.slt` - Auto-generated docs entry in `docs/source/user-guide/sql/scalar_functions.md` **Signature:** \`List/LargeList/FixedSizeList<numeric>\` in, \`Float64\` out (one scalar per row). Numeric inner types coerced to \`Float64\`. **NULL semantics — SQL aggregate convention (deliberate divergence from binary-op siblings):** - NULL row → NULL row out - NULL elements are **skipped**, matching PostgreSQL \`array_sum\`, DuckDB \`list_sum\`, Spark \`aggregate\`. Binary-op siblings (\`inner_product\`, \`array_normalize\`) null-row on NULL element because their per-element operation is undefined on NULL; aggregates conventionally skip NULLs in SQL. - All-NULL row → NULL out (matches \`SUM(...)\` over an all-NULL column) - **Empty array → NULL** (matches sibling `array_product` apache#22703, PostgreSQL, DuckDB `list_sum`, SQL Standard SUM-of-empty-set) **Alias:** \`list_sum\` (matches the precedent of \`array_normalize\`→\`list_normalize\`, \`array_scale\`→\`list_scale\`). ## Are these changes tested? Yes. SLT covers happy paths, empty arrays, NULL row, NULL elements (mix + all-NULL), all list variants (List/LargeList/FixedSizeList), numeric coercion (Float32/Int64/integer literals), multi-row composition, error paths, return type, and the \`list_sum\` alias. ## Are there any user-facing changes? Yes — new SQL scalar function \`array_sum(array)\` and its alias \`list_sum\`. --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Which issue does this PR close?
Partial of #21536 —
array_sum(first of the array aggregates in the series).Rationale for this change
Continues the per-function split sequence requested by @alamb on #21536. Four sibling PRs already merged:
cosine_distance(#21542),inner_product(#21861),array_normalize(#22013),array_scale(#22466).array_addis in flight as #22459 by @SubhamSinghal.array_sumis the first of the three array-aggregate functions (sum, product, avg). Its semantics set the pattern for the other two aggregates.What changes are included in this PR?
array_sum(array)indatafusion/functions-nested/src/array_sum.rsdatafusion/functions-nested/src/lib.rsdatafusion/sqllogictest/test_files/array_sum.sltdocs/source/user-guide/sql/scalar_functions.mdSignature: `List/LargeList/FixedSizeList` in, `Float64` out (one scalar per row). Numeric inner types coerced to `Float64`.
NULL semantics — SQL aggregate convention (deliberate divergence from binary-op siblings):
NULL row → NULL row out
NULL elements are skipped, matching PostgreSQL `array_sum`, DuckDB `list_sum`, Spark `aggregate`. Binary-op siblings (`inner_product`, `array_normalize`) null-row on NULL element because their per-element operation is undefined on NULL; aggregates conventionally skip NULLs in SQL.
All-NULL row → NULL out (matches `SUM(...)` over an all-NULL column)
Empty array → NULL (matches sibling
array_productAddarray_productUDF #22703, PostgreSQL, DuckDBlist_sum, SQL Standard SUM-of-empty-set)Alias: `list_sum` (matches the precedent of `array_normalize`→`list_normalize`, `array_scale`→`list_scale`).
Are these changes tested?
Yes. SLT covers happy paths, empty arrays, NULL row, NULL elements (mix + all-NULL), all list variants (List/LargeList/FixedSizeList), numeric coercion (Float32/Int64/integer literals), multi-row composition, error paths, return type, and the `list_sum` alias.
Are there any user-facing changes?
Yes — new SQL scalar function `array_sum(array)` and its alias `list_sum`.