Support different scales for decimal binary math functions by theirix · Pull Request #19874 · apache/datafusion

theirix · 2026-01-18T15:39:00Z

Which issue does this PR close?

Closes Improve scale support for binary decimal operations #19621.

Rationale for this change

A helper calculate_binary_math and UDFs relying on it could behave strangely if the scales of inputs and outputs are different. Original logic didn't fully handle it.

So let's introduce calculate_binary_math_decimal and calculate_binary_math_numeric functions with a proper handling of arguments of different scales and type casting for input and output arguments.

They supersede calculate_binary_math and calculate_binary_math_decimal because they have a slightly different functor signature that automatically passes the effective precision and scale (even if rescaled). The rest is compatible.

What changes are included in this PR?

New functions
Port existing UDFs to new functions

Are these changes tested?

Existing unit tests
SLTs

Are there any user-facing changes?

Older functions could be deprecated. Since they are a part of the public interface of datafusion-functions, I just placed a comment without a full-fledged deprecate macro. Up to discussion whether it should be used

Introduce calculate_binary_math_decimal and calculate_binary_math_numeric functions with a proper handling of arguments of different scales and type casting. They supersede calculate_binary_math and calculate_binary_math_decimal due to having a slightly different functior signature with automatic passing of effective precision and scale (even if rescaled).

theirix · 2026-01-18T16:54:18Z

Recent related changes: #18525 and #19384 . Epic #18889

Jefffrey · 2026-01-24T13:33:03Z

I'm having trouble understanding the rationale here; log, power and round at most have one decimal input, and only round preserves the decimal type whereas the others will return floats anyway. So all this handling for getting precision/scale of left/right inputs, adjusting the scale of the output decimal, seems unused?

Also for round we have a PR relating to altering precision/scale:

fix: increase ROUND decimal precision to prevent overflow truncation #19926

theirix · 2026-01-25T15:03:51Z

I'm having trouble understanding the rationale here; log, power and round at most have one decimal input, and only round preserves the decimal type whereas the others will return floats anyway. So all this handling for getting precision/scale of left/right inputs, adjusting the scale of the output decimal, seems unused?

For most of these functions, agree, only one argument is decimal, so we execute the decimal/non-decimal case with code for adjusting both scales untouched.

Overall, I can see the following benefits:

simplifying a caller code by duplicating non-obvious logic to this helper
for decimal/non-decimal case, it performs casting of input and output types, so parameter scale is not lost when operating on a native type
for decimal outputs, it removes the burden of setting the precision and scale on output
for array cases, it is hard to write manual code of scaling to output type (not always a default Decimal(38,10)), so cast_array_to does it for the caller
providing effective scale and precision of a type to the user-provided kernel without capturing it from arguments

So, symmetric functions like gcd/lcm (WIP), mod, div, etc benefit most from this PR.

From the first glance, a log function can be greatly simplified by dropping unscale_to_* calls in kernels and extra casting.

For pow, originally it was Dec x Float -> Dec, but since introducing a fallback to the float version from #19369 (still thinking when it is necessary), it is less relevant.

Also for round we have a PR relating to altering precision/scale:
* [fix: increase ROUND decimal precision to prevent overflow truncation #19926](https://github.com/apache/datafusion/pull/19926)

Yes, I discovered it recently. I am wondering whether it could also be simplified using this PR, since it handles different input and output precisions and scales.

Jefffrey · 2026-01-29T09:10:45Z

I agree log can be greatly simplified, and it's something I'd love to get around to eventually (same for power), however I'm not sure if this is the way to do it; the code introduced here at a glance is not trivial and it doesn't solve any use case at hand (in that nothing is being simplified by this PR).

Perhaps it would be easier to see the use case if one of the functions which could benefit from this are refactored to use it in a way that has clear benefits over the existing calculate_binary_decimal_math? As this PR is, it looks like substituting the function being called, but not much else changes for the callers?

theirix · 2026-02-11T23:06:49Z

Hello @Jefffrey,

I agree, there aren't many functions that support both decimal types. Most likely, it will be only gcd and lcm, aside from user-defined functions. For other functions, the changes are mostly neutral.

The rationale for this PR is to support these symmetric functions for both decimal types and avoid scaling issues, as shown below.

This implementation of gcd (lcm can be done similarly) with this PR is shown here -
f9ffc82...theirix:datafusion:gcd-decimal-new-api . It uses a new API and utilises a new arrow num_traits support, so decimals are treated as any other numbers.

When I tried to implement it using the old function, I immediately encountered a problem with the old function calculate_binary_decimal_math because it doesn't take scale into account, so the result is incorrect. Shortly, it failed with select gcd(2::decimal(38, 0), 3::decimal(38, 0));, with a cast of a proper input decimal (3::Decimal) to a raw arrow type Decimal128Type, abruptly adding default scale DECIMAL_DEFAULT_SCALE=10, so it becomes 30000000000. It didn't happen if the second argument is a non-decimal number (pow, log, etc). Logs:

Details

``` [2026-02-09T22:17:42Z INFO datafusion_functions::math::gcd] invoke gcd with PrimitiveArray [ 2, ] and Scalar(Decimal128(Some(3),38,0)) [2026-02-09T22:17:42Z INFO datafusion_functions::utils] Calculating binary math with left PrimitiveArray [ 2, ] and right Scalar(Decimal128(Some(30000000000),38,10)) [2026-02-09T22:17:42Z INFO datafusion_functions::math::gcd] euclid_gcd_unsigned a=2 b=30000000000 -> Ok(2) [2026-02-09T22:17:42Z INFO datafusion_functions::utils] Calculating binary math with result PrimitiveArray [ 2, ] ```

This implementation avoids this bug by properly analysing scale, so rescaling shouldn't be handled in the UDF code. If you check the implementation closely, it uses rescaling only for the decimal-decimal case. Other cases have code almost identical to the old function.

I can see two approaches for implementing decimal-decimal math functions:

(as in this PR). Fix the utility function to be scale-aware to prevent correctness issues. It has more code, but handles all cases properly with casting, scaling, and error checking. UDFs remain simple.
Handle it per-UDF. Move this logic to decimal-decimal UDFs (gcd, lcm). It keeps the utils function lean. At the same time, it adds more complex logic on the UDF side, for both of them. It may also lead to similar bugs and edge cases.

I favour the first approach for consistency, but I'm happy to refactor with the second approach if you find it more suitable.

github-actions · 2026-04-27T02:16:50Z

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

theirix · 2026-04-27T21:29:42Z

Hello @Jefffrey,

I agree, there aren't many functions that support both decimal types. Most likely, it will be only gcd and lcm, aside from user-defined functions. For other functions, the changes are mostly neutral.

The rationale for this PR is to support these symmetric functions for both decimal types and avoid scaling issues, as shown below.

This implementation of gcd (lcm can be done similarly) with this PR is shown here - f9ffc82...theirix:datafusion:gcd-decimal-new-api . It uses a new API and utilises a new arrow num_traits support, so decimals are treated as any other numbers.

When I tried to implement it using the old function, I immediately encountered a problem with the old function calculate_binary_decimal_math because it doesn't take scale into account, so the result is incorrect. Shortly, it failed with select gcd(2::decimal(38, 0), 3::decimal(38, 0));, with a cast of a proper input decimal (3::Decimal) to a raw arrow type Decimal128Type, abruptly adding default scale DECIMAL_DEFAULT_SCALE=10, so it becomes 30000000000. It didn't happen if the second argument is a non-decimal number (pow, log, etc). Logs:
Details

This implementation avoids this bug by properly analysing scale, so rescaling shouldn't be handled in the UDF code. If you check the implementation closely, it uses rescaling only for the decimal-decimal case. Other cases have code almost identical to the old function.

I can see two approaches for implementing decimal-decimal math functions:
1. (as in this PR). Fix the utility function to be scale-aware to prevent correctness issues. It has more code, but handles all cases properly with casting, scaling, and error checking. UDFs remain simple.

2. Handle it per-UDF. Move this logic to decimal-decimal UDFs (`gcd`, `lcm`). It keeps the utils function lean. At the same time, it adds more complex logic on the UDF side, for both of them. It may also lead to similar bugs and edge cases.
I favour the first approach for consistency, but I'm happy to refactor with the second approach if you find it more suitable.

@Jefffrey , I would appreciate hearing your thoughts on this

theirix · 2026-06-16T06:23:45Z

Decided to go with the way (2) - support different scales per-udf since it is has a limited scope for two udfs only. This pr is an overkill. So closing

Done in #22655

## Which issue does this PR close? - Closes apache#19057. ## Rationale for this change A binary gcd and lcm UDF in the datafusion-functions crate supports only Int64, but not Decimals. Adding missing support for decimals. ## What changes are included in this PR? 1. Updated gcd and lcm functions to add decimal support. The integer path is more performant and stays intact. For decimals, the Euclidean algorithm is used for GCD 2. Added coercion rules: casting to decimals if any argument is decimal; otherwise, stay with ints as before 3. Common functionality extracted to `common.rs` to avoid inter-UDF dependency 4. In order to use `calculate_binary_math` for Decimals, updated it to accept a target type instead of raw `Decimal128Type::DATA_TYPE` - it causes scaling issues for these UDFs, see apache#19621 A bit more on (4). The driving force is this failing example: ```sql query R select gcd(2::decimal(38, 0), 3::decimal(38, 0)); ---- 1 ``` Previously in apache#19874, I suggested a more complicated solution to extend `calculate_binary_math`. However, it only affected gcd/lcm and could be considered overkill. This PR extends these functions with an extra parameter `cast_target` for `calculate_binary_decimal_math` to perform a proper cast to the actual type used, rather than to the default `Decimal128Type::DATA_TYPE` - it is much lighter. ## Are these changes tested? - Added unit test for UDFs with decimals for array and scalar paths - Added unit tests for the gcd/lcm math itself - Added new SLT tests for decimals ## Are there any user-facing changes? No

theirix added 3 commits January 18, 2026 14:58

Update power, log, round with new helpers

3c2cd10

Add more SLTs

31ade61

github-actions Bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Jan 18, 2026

theirix marked this pull request as ready for review January 18, 2026 16:40

Correct length for null arrays

f9ffc82

Jefffrey self-requested a review January 23, 2026 03:40

theirix mentioned this pull request Mar 12, 2026

Fix decimal log precision for non-power values #20433

Merged

github-actions Bot added the Stale PR has not had any activity for some time label Apr 27, 2026

Jefffrey removed the Stale PR has not had any activity for some time label Apr 27, 2026

theirix added 3 commits May 24, 2026 13:33

Merge branch 'main' into decimal-math-rework

06bb679

Merge fixes

47d7a6a

Unmerge submodules

c1e5a0e

theirix mentioned this pull request May 30, 2026

feat: decimal support for gcd and lcm #22655

Merged

theirix closed this Jun 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support different scales for decimal binary math functions #19874

Support different scales for decimal binary math functions #19874
theirix wants to merge 7 commits into
apache:mainfrom
theirix:decimal-math-rework

theirix commented Jan 18, 2026

Uh oh!

theirix commented Jan 18, 2026

Uh oh!

Jefffrey commented Jan 24, 2026

Uh oh!

theirix commented Jan 25, 2026

Uh oh!

Jefffrey commented Jan 29, 2026

Uh oh!

theirix commented Feb 11, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

theirix commented Apr 27, 2026

Uh oh!

theirix commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

theirix commented Jan 18, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

theirix commented Jan 18, 2026

Uh oh!

Jefffrey commented Jan 24, 2026

Uh oh!

theirix commented Jan 25, 2026

Uh oh!

Jefffrey commented Jan 29, 2026

Uh oh!

theirix commented Feb 11, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

theirix commented Apr 27, 2026

Uh oh!

theirix commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants