performance of saturating_mul can be improved by removing branches

While playing with `saturating_mul` to see if it can be made branchless and thus a const fn (sorry @RalfJung ;) ), I found that the performance of signed `saturating_mul` can actually be improved.

I tested two implementations:
 1. ```rust
    pub fn saturating_mul(a: i32, b: i32) -> i32 {
        a.saturating_mul(b)
    }
    ```
 2. ```rust
    #[inline]
    const fn cond_if_else(cond: bool, a: i32, b: i32) -> i32 {
        // If cond is false: not_mask is -1 == all ones
        // If cond is true: not_mask is 0 == all zeros
        let not_mask = (cond as i32).wrapping_sub(1);
        // If cond is false: (a & !all_ones) | (b & all_ones) == b
        // If cond is true: (a & !all_zeros) | (b & all_zeros) == a
        (a & !not_mask) | (b & not_mask)
    }
    
    pub const fn saturating_mul(a: i32, b: i32) -> i32 {
        let (val, overflow) = a.overflowing_mul(b);
        cond_if_else(
            !overflow,
            val,
            cond_if_else(
                (a < 0) != (b < 0),
                i32::min_value(),
                i32::max_value(),
            ),
        )
    }
    ```

The `cond_if_else` function acts kind of like C's ternary operator, including its constness, but works only for integers.

The second case seems to have a reciprocal throughput better by a factor of about 1.5 according to llvm-mca: https://godbolt.org/z/6PnCwB

For unsigned, the second case can become:
```rust
pub const fn saturating_mul(a: u32, b: u32) -> u32 {
    let (val, overflow) = a.overflowing_mul(b);
    cond_if_else(!overflow, val, u32::max_value())
}
```
In this case, there is no performance improvement and LLVM reduces this to the same IR, so the only advantage here would be constness. https://godbolt.org/z/0Fs0AD

I found some similar improvements for `saturating_sub`, even though currently the implementation uses `intrinsics::saturating_sub`; I found it a bit weird that the intrinsic performs worse. The reciprocal throughput can be improved by a factor of 1.13: https://godbolt.org/z/tcZgeE

My concern is that since the LLVM IR is different, even though llvm-mca indicates the difference is an improvement, there might be some case I don't know about where the performance regresses. Maybe there are other platforms where the saturating intrinsics result better performance? I think @nikic has a better understanding of these concerns.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

performance of saturating_mul can be improved by removing branches #65309

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

performance of saturating_mul can be improved by removing branches #65309

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions