While playing with saturating_mul to see if it can be made branchless and thus a const fn (sorry @RalfJung ;) ), I found that the performance of signed saturating_mul can actually be improved.
I tested two implementations:
-
pub fn saturating_mul(a: i32, b: i32) -> i32 {
a.saturating_mul(b)
}
-
#[inline]
const fn cond_if_else(cond: bool, a: i32, b: i32) -> i32 {
// If cond is false: not_mask is -1 == all ones
// If cond is true: not_mask is 0 == all zeros
let not_mask = (cond as i32).wrapping_sub(1);
// If cond is false: (a & !all_ones) | (b & all_ones) == b
// If cond is true: (a & !all_zeros) | (b & all_zeros) == a
(a & !not_mask) | (b & not_mask)
}
pub const fn saturating_mul(a: i32, b: i32) -> i32 {
let (val, overflow) = a.overflowing_mul(b);
cond_if_else(
!overflow,
val,
cond_if_else(
(a < 0) != (b < 0),
i32::min_value(),
i32::max_value(),
),
)
}
The cond_if_else function acts kind of like C's ternary operator, including its constness, but works only for integers.
The second case seems to have a reciprocal throughput better by a factor of about 1.5 according to llvm-mca: https://godbolt.org/z/6PnCwB
For unsigned, the second case can become:
pub const fn saturating_mul(a: u32, b: u32) -> u32 {
let (val, overflow) = a.overflowing_mul(b);
cond_if_else(!overflow, val, u32::max_value())
}
In this case, there is no performance improvement and LLVM reduces this to the same IR, so the only advantage here would be constness. https://godbolt.org/z/0Fs0AD
I found some similar improvements for saturating_sub, even though currently the implementation uses intrinsics::saturating_sub; I found it a bit weird that the intrinsic performs worse. The reciprocal throughput can be improved by a factor of 1.13: https://godbolt.org/z/tcZgeE
My concern is that since the LLVM IR is different, even though llvm-mca indicates the difference is an improvement, there might be some case I don't know about where the performance regresses. Maybe there are other platforms where the saturating intrinsics result better performance? I think @nikic has a better understanding of these concerns.
While playing with
saturating_multo see if it can be made branchless and thus a const fn (sorry @RalfJung ;) ), I found that the performance of signedsaturating_mulcan actually be improved.I tested two implementations:
The
cond_if_elsefunction acts kind of like C's ternary operator, including its constness, but works only for integers.The second case seems to have a reciprocal throughput better by a factor of about 1.5 according to llvm-mca: https://godbolt.org/z/6PnCwB
For unsigned, the second case can become:
In this case, there is no performance improvement and LLVM reduces this to the same IR, so the only advantage here would be constness. https://godbolt.org/z/0Fs0AD
I found some similar improvements for
saturating_sub, even though currently the implementation usesintrinsics::saturating_sub; I found it a bit weird that the intrinsic performs worse. The reciprocal throughput can be improved by a factor of 1.13: https://godbolt.org/z/tcZgeEMy concern is that since the LLVM IR is different, even though llvm-mca indicates the difference is an improvement, there might be some case I don't know about where the performance regresses. Maybe there are other platforms where the saturating intrinsics result better performance? I think @nikic has a better understanding of these concerns.