refactor: remove `arrow-ord` dependency in `arrow-cast` by Weijun-H · Pull Request #8716 · apache/arrow-rs

Weijun-H · 2025-10-27T09:16:11Z

Which issue does this PR close?

Closes Remove arrow-ord dependency in arrow-cast due to RunEndEncoded casting #8708

Rationale for this change

arrow-cast currently depends on arrow-ord only to use the generic partition kernel when casting arrays to RunEndEncoded. This is more dependency surface than needed for this path: REE encoding only needs to find boundaries between consecutive equal values.

This PR removes the arrow-ord dependency from arrow-cast and replaces the partition call with local run-boundary computation tailored to cast-to-REE.

What changes are included in this PR?

Removes arrow-ord from arrow-cast dependencies.
Replaces partition(&[cast_array]) with compute_run_boundaries in cast_to_run_end_encoded.
Adds specialized run-boundary scans for primitive, boolean, binary/string, fixed-size binary, and dictionary arrays.
Keeps a generic ArrayData fallback for less common types.
Uses downcast_primitive_array! to avoid repetitive primitive type dispatch.
Removes thin StringView / BinaryView wrappers that only called the generic fallback.

Are these changes tested?

Yes

Are there any user-facing changes?

No. This is an internal dependency and implementation change.

vegarsti · 2025-10-27T11:40:42Z

Could you try running the benchmark in this PR #8710 and see what the difference is? I thought cast_array.slice would be doing a clone, but it's not, so this might be quite fast.

alamb · 2025-10-27T19:08:32Z

I just reviewed the benchmark in

Add benchmark for casting to RunEndEncoded (REE) #8710

and I think it looks good to go. I'll merge it in and then run the benchmarks on this PR

alamb

Thanks @Weijun-H and @vegarsti

alamb · 2025-10-27T19:10:06Z

+    values_indexes.push(0);
+    let mut current_data = array.slice(0, 1).to_data();
+    for idx in 1..array.len() {
+        let next_data = array.slice(idx, 1).to_data();


I think this is likely to be substantially slower than what partition does, but we can see what the benchmarks show

alamb · 2025-10-27T19:11:23Z

@@ -134,16 +134,8 @@ pub(crate) fn cast_to_run_end_encoded<K: RunEndIndexType>(
        ));
    }

-    // Partition the array to identify runs of consecutive equal values
-    let partitions = partition(&[Arc::clone(cast_array)])?;
-    let mut run_ends = Vec::new();


I looked briefly at a profile for this function -- I think we could make it substantially faster by reducing allocatiosn with a pre-sized vector here (use partitions.count_ones() to know how many partitions are needed)

Oh, great idea!

Side note: How did you profile this, using samply (it looks like), cargo build --profile profiling, and ran e.g. a unit test?

I used Instruments that was part of Mac XCode -- it is pretty sweet as it will do whole system profiling (fire it up and start recording and it gathers the info for all processes)

Pushed your suggestion as a PR here: #8716, maybe you can run the benchmark on that too? 😇

alamb · 2025-10-27T19:24:59Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1017-gcp #18~24.04.1-Ubuntu SMP Tue Sep 23 17:51:44 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing 8708-remove-arrow-ord-in-arrow-cast (70b24d19012b3ba909e8c610ca84185de37278fd) to 62df32e diff
BENCH_NAME=cast_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench cast_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=8708-remove-arrow-ord-in-arrow-cast
Results will be posted here when complete

alamb · 2025-10-27T19:43:08Z

🤖: Benchmark completed

Details

group                                                              8708-remove-arrow-ord-in-arrow-cast     main
-----                                                              -----------------------------------     ----
cast binary view to string                                         1.00     68.6±0.30µs        ? ?/sec     1.07     73.4±0.31µs        ? ?/sec
cast binary view to string view                                    1.23    115.8±0.39µs        ? ?/sec     1.00     93.9±0.32µs        ? ?/sec
cast binary view to wide string                                    1.15     74.4±0.28µs        ? ?/sec     1.00     64.8±0.34µs        ? ?/sec
cast date32 to date64 512                                          1.00    293.1±0.84ns        ? ?/sec     1.03    301.6±1.53ns        ? ?/sec
cast date64 to date32 512                                          1.00    501.4±1.12ns        ? ?/sec     1.01    505.7±1.89ns        ? ?/sec
cast decimal128 to decimal128 512                                  1.01    610.2±0.91ns        ? ?/sec     1.00    604.3±1.45ns        ? ?/sec
cast decimal128 to decimal128 512 lower precision                  1.00      5.1±0.02µs        ? ?/sec     1.01      5.1±0.03µs        ? ?/sec
cast decimal128 to decimal128 512 with lower scale (infallible)    1.01      6.9±0.03µs        ? ?/sec     1.00      6.8±0.02µs        ? ?/sec
cast decimal128 to decimal128 512 with same scale                  1.07     81.1±0.12ns        ? ?/sec     1.00     75.8±0.16ns        ? ?/sec
cast decimal128 to decimal256 512                                  1.00      2.3±0.01µs        ? ?/sec     1.01      2.3±0.02µs        ? ?/sec
cast decimal256 to decimal128 512                                  1.00     48.6±0.14µs        ? ?/sec     1.00     48.5±0.34µs        ? ?/sec
cast decimal256 to decimal256 512                                  1.00     11.1±0.03µs        ? ?/sec     1.02     11.3±0.03µs        ? ?/sec
cast decimal256 to decimal256 512 with same scale                  1.09     82.2±0.20ns        ? ?/sec     1.00     75.7±0.20ns        ? ?/sec
cast decimal32 to decimal32 512                                    1.00      2.3±0.01µs        ? ?/sec     1.14      2.6±0.01µs        ? ?/sec
cast decimal32 to decimal32 512 lower precision                    1.00      2.8±0.03µs        ? ?/sec     1.02      2.8±0.01µs        ? ?/sec
cast decimal32 to decimal64 512                                    1.10    347.3±3.82ns        ? ?/sec     1.00    316.7±0.93ns        ? ?/sec
cast decimal64 to decimal32 512                                    1.01      3.0±0.01µs        ? ?/sec     1.00      3.0±0.01µs        ? ?/sec
cast decimal64 to decimal64 512                                    1.01    381.5±2.23ns        ? ?/sec     1.00    376.8±0.63ns        ? ?/sec
cast dict to string view                                           1.00     52.3±0.21µs        ? ?/sec     1.03     53.8±0.12µs        ? ?/sec
cast f32 to string 512                                             1.06     19.1±0.68µs        ? ?/sec     1.00     18.1±0.04µs        ? ?/sec
cast f64 to string 512                                             1.00     21.8±0.05µs        ? ?/sec     1.04     22.6±0.07µs        ? ?/sec
cast float32 to int32 512                                          1.00   1564.8±3.69ns        ? ?/sec     1.00   1560.3±4.08ns        ? ?/sec
cast float64 to float32 512                                        1.01   1088.4±3.42ns        ? ?/sec     1.00   1077.6±5.59ns        ? ?/sec
cast float64 to uint64 512                                         1.01   1769.7±5.39ns        ? ?/sec     1.00   1754.0±2.46ns        ? ?/sec
cast i64 to string 512                                             1.02     14.7±0.12µs        ? ?/sec     1.00     14.4±0.04µs        ? ?/sec
cast int32 to float32 512                                          1.02   1065.8±2.83ns        ? ?/sec     1.00   1047.8±4.28ns        ? ?/sec
cast int32 to float64 512                                          1.01   1071.1±4.97ns        ? ?/sec     1.00   1056.6±2.00ns        ? ?/sec
cast int32 to int32 512                                            1.01    201.1±1.01ns        ? ?/sec     1.00    198.7±0.45ns        ? ?/sec
cast int32 to int64 512                                            1.00   1084.5±1.52ns        ? ?/sec     1.08   1167.5±4.21ns        ? ?/sec
cast int32 to uint32 512                                           1.03   1517.6±5.18ns        ? ?/sec     1.00   1466.4±3.60ns        ? ?/sec
cast int64 to int32 512                                            1.00   1562.5±2.39ns        ? ?/sec     1.08   1684.9±3.44ns        ? ?/sec
cast no runs of int32s to ree<int32>                               18.65  1452.8±3.65µs        ? ?/sec     1.00     77.9±0.40µs        ? ?/sec
cast runs of 10 string to ree<int32>                               83.20  1357.7±3.84µs        ? ?/sec     1.00     16.3±0.08µs        ? ?/sec
cast runs of 1000 int32s to ree<int32>                             164.19  1339.8±2.22µs        ? ?/sec    1.00      8.2±0.04µs        ? ?/sec
cast string single run to ree<int32>                               57.34  1568.5±2.43µs        ? ?/sec     1.00     27.4±0.08µs        ? ?/sec
cast string to binary view 512                                     1.00      3.2±0.01µs        ? ?/sec     1.02      3.3±0.01µs        ? ?/sec
cast string view to binary view                                    1.00     97.8±0.20ns        ? ?/sec     1.00     97.3±0.20ns        ? ?/sec
cast string view to dict                                           1.00    173.5±0.35µs        ? ?/sec     1.04    180.1±0.30µs        ? ?/sec
cast string view to string                                         1.00     48.2±0.11µs        ? ?/sec     1.02     49.1±0.52µs        ? ?/sec
cast string view to wide string                                    1.00     48.4±0.16µs        ? ?/sec     1.07     51.8±0.22µs        ? ?/sec
cast time32s to time32ms 512                                       1.01    288.3±1.04ns        ? ?/sec     1.00    285.8±0.40ns        ? ?/sec
cast time32s to time64us 512                                       1.01    292.1±0.30ns        ? ?/sec     1.00    290.6±0.86ns        ? ?/sec
cast time64ns to time32s 512                                       1.00    503.3±4.33ns        ? ?/sec     1.01    507.9±0.76ns        ? ?/sec
cast timestamp_ms to i64 512                                       1.05    453.0±2.06ns        ? ?/sec     1.00    433.1±1.14ns        ? ?/sec
cast timestamp_ms to timestamp_ns 512                              1.06      2.3±0.02µs        ? ?/sec     1.00      2.2±0.00µs        ? ?/sec
cast timestamp_ns to timestamp_s 512                               1.01    200.6±1.05ns        ? ?/sec     1.00    197.7±0.35ns        ? ?/sec
cast utf8 to date32 512                                            1.00     11.4±0.02µs        ? ?/sec     1.00     11.4±0.04µs        ? ?/sec
cast utf8 to date64 512                                            1.08     46.3±0.08µs        ? ?/sec     1.00     42.8±0.16µs        ? ?/sec
cast utf8 to f32                                                   1.00     11.5±0.09µs        ? ?/sec     1.01     11.7±0.03µs        ? ?/sec
cast wide string to binary view 512                                1.00      5.6±0.01µs        ? ?/sec     1.00      5.6±0.01µs        ? ?/sec

alamb · 2025-10-27T19:48:03Z

cast no runs of int32s to ree 18.65 1452.8±3.65µs ? ?/sec 1.00 77.9±0.40µs ? ?/sec
cast runs of 10 string to ree 83.20 1357.7±3.84µs ? ?/sec 1.00 16.3±0.08µs ? ?/sec
cast runs of 1000 int32s to ree 164.19 1339.8±2.22µs ? ?/sec 1.00 8.2±0.04µs ? ?/sec
? cast string single run to ree 57.34 1568.5±2.43µs ? ?/sec 1.00 27.4±0.08µs ? ?/sec

As @vegarsti predicted, this PR appears to be quite a bit slower than using partition

Weijun-H · 2025-10-28T08:22:26Z

FYI @vegarsti , @alamb After several rounds of optimization, the current version delivers significant improvements over the previous one.

Type-specialized dispatch:
compute_run_boundaries now routes each physical layout (boolean, primitive scalars, binary/string, etc.) to a dedicated helper, allowing most arrays to bypass the slow, generic ArrayData comparison path.
Chunked primitive scanning:
The no-null primitive path uses scan_run_end, which compares 16 bytes at a time via u128 loads. When a chunk differs, it falls back to scalar iteration—reducing branches and bounds checks in the hot loop.
Targeted use of unsafe for performance:
Tight loops leverage get_unchecked, from_raw_parts, and read_unaligned to eliminate redundant bounds and alignment checks. Each unsafe block includes detailed safety comments describing the invariants upheld.
Generic fallback:
Less common types still rely on ArrayData equality but reuse the shared accumulator to produce consistent run and value outputs—without special-casing memory management.

cast string single run to ree<int32>
                        time:   [23.143 µs 23.180 µs 23.224 µs]
                        change: [−8.5926% −6.6138% −5.2622%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  9 (9.00%) high severe

cast runs of 10 string to ree<int32>
                        time:   [4.4857 µs 4.4924 µs 4.4999 µs]
                        change: [−35.582% −32.807% −30.598%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

cast runs of 1000 int32s to ree<int32>
                        time:   [1.9651 µs 1.9923 µs 2.0449 µs]
                        change: [−35.958% −34.582% −33.095%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

cast no runs of int32s to ree<int32>
                        time:   [27.745 µs 28.013 µs 28.291 µs]
                        change: [−27.957% −27.305% −26.645%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  14 (14.00%) high mild

alamb · 2025-10-28T19:33:30Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1017-gcp #18~24.04.1-Ubuntu SMP Tue Sep 23 17:51:44 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing 8708-remove-arrow-ord-in-arrow-cast (f9fc4fe4fe6d5195b69dd5bb6b7e454024883164) to 6c3e588 diff
BENCH_NAME=cast_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench cast_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=8708-remove-arrow-ord-in-arrow-cast
Results will be posted here when complete

alamb · 2025-10-28T19:51:10Z

🤖: Benchmark completed

Details

group                                                              8708-remove-arrow-ord-in-arrow-cast    main
-----                                                              -----------------------------------    ----
cast binary view to string                                         1.00     68.8±0.29µs        ? ?/sec    1.07     73.3±0.29µs        ? ?/sec
cast binary view to string view                                    1.24    115.9±0.41µs        ? ?/sec    1.00     93.4±0.16µs        ? ?/sec
cast binary view to wide string                                    1.14     73.8±0.32µs        ? ?/sec    1.00     64.9±0.27µs        ? ?/sec
cast date32 to date64 512                                          1.00    295.8±1.07ns        ? ?/sec    1.00    296.2±0.51ns        ? ?/sec
cast date64 to date32 512                                          1.03    512.0±4.08ns        ? ?/sec    1.00    499.2±0.86ns        ? ?/sec
cast decimal128 to decimal128 512                                  1.01    609.4±2.59ns        ? ?/sec    1.00    605.9±3.07ns        ? ?/sec
cast decimal128 to decimal128 512 lower precision                  1.02      5.2±0.03µs        ? ?/sec    1.00      5.1±0.01µs        ? ?/sec
cast decimal128 to decimal128 512 with lower scale (infallible)    1.00      6.8±0.09µs        ? ?/sec    1.00      6.8±0.01µs        ? ?/sec
cast decimal128 to decimal128 512 with same scale                  1.00     75.6±0.15ns        ? ?/sec    1.00     76.0±0.08ns        ? ?/sec
cast decimal128 to decimal256 512                                  1.01      2.3±0.00µs        ? ?/sec    1.00      2.3±0.01µs        ? ?/sec
cast decimal256 to decimal128 512                                  1.00     48.5±0.18µs        ? ?/sec    1.00     48.3±0.06µs        ? ?/sec
cast decimal256 to decimal256 512                                  1.00     11.5±0.08µs        ? ?/sec    1.05     12.1±0.08µs        ? ?/sec
cast decimal256 to decimal256 512 with same scale                  1.00     75.7±0.11ns        ? ?/sec    1.00     75.5±0.13ns        ? ?/sec
cast decimal32 to decimal32 512                                    1.00      2.3±0.03µs        ? ?/sec    1.13      2.6±0.02µs        ? ?/sec
cast decimal32 to decimal32 512 lower precision                    1.00      2.8±0.03µs        ? ?/sec    1.01      2.8±0.02µs        ? ?/sec
cast decimal32 to decimal64 512                                    1.10    347.8±2.21ns        ? ?/sec    1.00    316.6±0.56ns        ? ?/sec
cast decimal64 to decimal32 512                                    1.01      3.0±0.02µs        ? ?/sec    1.00      3.0±0.00µs        ? ?/sec
cast decimal64 to decimal64 512                                    1.01    381.6±5.70ns        ? ?/sec    1.00    376.0±0.49ns        ? ?/sec
cast dict to string view                                           1.00     52.5±0.10µs        ? ?/sec    1.02     53.8±0.09µs        ? ?/sec
cast f32 to string 512                                             1.03     18.7±0.04µs        ? ?/sec    1.00     18.3±0.04µs        ? ?/sec
cast f64 to string 512                                             1.00     21.4±0.12µs        ? ?/sec    1.06     22.6±0.12µs        ? ?/sec
cast float32 to int32 512                                          1.01   1577.3±2.44ns        ? ?/sec    1.00   1567.4±1.85ns        ? ?/sec
cast float64 to float32 512                                        1.02   1110.7±3.13ns        ? ?/sec    1.00   1091.8±1.88ns        ? ?/sec
cast float64 to uint64 512                                         1.02   1773.8±1.71ns        ? ?/sec    1.00   1742.5±3.28ns        ? ?/sec
cast i64 to string 512                                             1.00     14.4±0.04µs        ? ?/sec    1.02     14.7±0.13µs        ? ?/sec
cast int32 to float32 512                                          1.00   1015.5±1.26ns        ? ?/sec    1.04   1054.3±2.03ns        ? ?/sec
cast int32 to float64 512                                          1.03   1088.9±3.95ns        ? ?/sec    1.00   1053.8±1.81ns        ? ?/sec
cast int32 to int32 512                                            1.12    223.6±0.53ns        ? ?/sec    1.00    198.8±0.20ns        ? ?/sec
cast int32 to int64 512                                            1.00   1096.9±0.99ns        ? ?/sec    1.06   1167.2±2.57ns        ? ?/sec
cast int32 to uint32 512                                           1.05   1531.4±4.62ns        ? ?/sec    1.00   1464.9±1.52ns        ? ?/sec
cast int64 to int32 512                                            1.00  1568.7±34.86ns        ? ?/sec    1.08   1688.6±2.08ns        ? ?/sec
cast no runs of int32s to ree<int32>                               1.00     56.3±0.10µs        ? ?/sec    1.36     76.7±0.18µs        ? ?/sec
cast runs of 10 string to ree<int32>                               1.00      9.3±0.02µs        ? ?/sec    1.72     16.0±0.07µs        ? ?/sec
cast runs of 1000 int32s to ree<int32>                             1.00      3.9±0.01µs        ? ?/sec    2.13      8.2±0.02µs        ? ?/sec
cast string single run to ree<int32>                               1.00     23.8±0.08µs        ? ?/sec    1.15     27.4±0.08µs        ? ?/sec
cast string to binary view 512                                     1.00      3.3±0.00µs        ? ?/sec    1.00      3.3±0.00µs        ? ?/sec
cast string view to binary view                                    1.00     96.4±0.12ns        ? ?/sec    1.02     98.1±0.19ns        ? ?/sec
cast string view to dict                                           1.02    175.6±0.39µs        ? ?/sec    1.00    171.5±0.36µs        ? ?/sec
cast string view to string                                         1.00     48.4±0.10µs        ? ?/sec    1.01     48.9±0.08µs        ? ?/sec
cast string view to wide string                                    1.00     49.8±0.27µs        ? ?/sec    1.04     51.7±0.15µs        ? ?/sec
cast time32s to time32ms 512                                       1.02    290.8±0.91ns        ? ?/sec    1.00    285.2±0.44ns        ? ?/sec
cast time32s to time64us 512                                       1.04    302.1±0.60ns        ? ?/sec    1.00    289.8±0.55ns        ? ?/sec
cast time64ns to time32s 512                                       1.01    508.7±1.38ns        ? ?/sec    1.00    501.2±5.97ns        ? ?/sec
cast timestamp_ms to i64 512                                       1.00    434.9±1.91ns        ? ?/sec    1.01    440.4±5.99ns        ? ?/sec
cast timestamp_ms to timestamp_ns 512                              1.06      2.3±0.00µs        ? ?/sec    1.00      2.2±0.01µs        ? ?/sec
cast timestamp_ns to timestamp_s 512                               1.13    224.0±0.58ns        ? ?/sec    1.00    197.9±0.57ns        ? ?/sec
cast utf8 to date32 512                                            1.00     11.4±0.12µs        ? ?/sec    1.00     11.3±0.02µs        ? ?/sec
cast utf8 to date64 512                                            1.00     42.6±0.44µs        ? ?/sec    1.00     42.8±0.26µs        ? ?/sec
cast utf8 to f32                                                   1.01     11.8±0.03µs        ? ?/sec    1.00     11.7±0.06µs        ? ?/sec
cast wide string to binary view 512                                1.02      5.7±0.01µs        ? ?/sec    1.00      5.6±0.01µs        ? ?/sec

Related to #8707. Inspired by #8716 (comment), a follow up improvement to #8589: We already know what the length of the two vectors will be, so we can create them with that capacity.

Jefffrey

Are we still interested in getting this through? Could we run bot benchmarks again (I don't think I have the permission)

alamb · 2026-02-06T21:53:15Z

Are we still interested in getting this through? Could we run bot benchmarks again (I don't think I have the permission)

I added you here: alamb/datafusion-benchmarking@1f1e8b2

(though beware the benchmark machine is non ideal in that it is a shared VM and thus is prone to workload variations)

alamb · 2026-02-06T21:54:02Z

run benchmark cast_kernels

alamb · 2026-02-06T21:54:07Z

run benchmark cast_kernels

alamb-ghbot · 2026-02-06T21:54:18Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing 8708-remove-arrow-ord-in-arrow-cast (d5747f5) to 7dbe58a diff
BENCH_NAME=cast_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench cast_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=8708-remove-arrow-ord-in-arrow-cast
Results will be posted here when complete

alamb · 2026-02-06T21:55:01Z

Are we still interested in getting this through? Could we run bot benchmarks again (I don't think I have the permission)

It seems to me like a good idea in theory -- I just didnt' have the time to follow it through and complete a review 😢

alamb-ghbot · 2026-02-06T22:12:22Z

🤖: Benchmark completed

Details

group                                                              8708-remove-arrow-ord-in-arrow-cast    main
-----                                                              -----------------------------------    ----
cast binary view to string                                         1.04     73.8±0.69µs        ? ?/sec    1.00     70.9±1.06µs        ? ?/sec
cast binary view to string view                                    1.00    100.3±0.75µs        ? ?/sec    1.09    109.8±0.40µs        ? ?/sec
cast binary view to wide string                                    1.04     70.3±0.20µs        ? ?/sec    1.00     67.4±0.32µs        ? ?/sec
cast date32 to date64 512                                          1.03    302.6±0.99ns        ? ?/sec    1.00   293.6±10.27ns        ? ?/sec
cast date64 to date32 512                                          1.00    501.4±1.12ns        ? ?/sec    1.00    500.4±1.16ns        ? ?/sec
cast decimal128 to decimal128 512                                  1.00    613.8±1.45ns        ? ?/sec    1.00    615.3±9.85ns        ? ?/sec
cast decimal128 to decimal128 512 lower precision                  1.00      5.2±0.02µs        ? ?/sec    1.01      5.2±0.05µs        ? ?/sec
cast decimal128 to decimal128 512 with lower scale (infallible)    1.00      7.1±0.03µs        ? ?/sec    1.00      7.1±0.06µs        ? ?/sec
cast decimal128 to decimal128 512 with same scale                  1.00     79.0±0.42ns        ? ?/sec    1.04     82.5±1.62ns        ? ?/sec
cast decimal128 to decimal256 512                                  1.00      2.3±0.02µs        ? ?/sec    1.00      2.3±0.04µs        ? ?/sec
cast decimal256 to decimal128 512                                  1.01     48.6±0.56µs        ? ?/sec    1.00     48.2±0.32µs        ? ?/sec
cast decimal256 to decimal256 512                                  1.00     10.9±0.04µs        ? ?/sec    1.02     11.1±0.13µs        ? ?/sec
cast decimal256 to decimal256 512 with same scale                  1.00     79.6±0.88ns        ? ?/sec    1.03     82.3±1.51ns        ? ?/sec
cast decimal32 to decimal32 512                                    1.08      2.5±0.01µs        ? ?/sec    1.00      2.3±0.12µs        ? ?/sec
cast decimal32 to decimal32 512 lower precision                    1.04      2.8±0.01µs        ? ?/sec    1.00      2.7±0.03µs        ? ?/sec
cast decimal32 to decimal64 512                                    1.00    321.7±1.11ns        ? ?/sec    1.00    320.8±2.86ns        ? ?/sec
cast decimal64 to decimal32 512                                    1.00      2.8±0.01µs        ? ?/sec    1.19      3.4±0.03µs        ? ?/sec
cast decimal64 to decimal64 512                                    1.01    387.9±1.06ns        ? ?/sec    1.00    383.5±7.65ns        ? ?/sec
cast dict to string view                                           2.59    120.3±1.63µs        ? ?/sec    1.00     46.4±0.52µs        ? ?/sec
cast f32 to string 512                                             1.00     18.5±0.07µs        ? ?/sec    1.01     18.6±0.08µs        ? ?/sec
cast f64 to string 512                                             1.02     22.2±0.28µs        ? ?/sec    1.00     21.8±0.25µs        ? ?/sec
cast float32 to int32 512                                          1.00  1244.1±56.20ns        ? ?/sec    1.13  1409.8±46.72ns        ? ?/sec
cast float64 to float32 512                                        1.00   789.2±17.31ns        ? ?/sec    1.15   910.3±10.13ns        ? ?/sec
cast float64 to uint64 512                                         1.00  1452.1±11.45ns        ? ?/sec    1.08  1562.0±15.66ns        ? ?/sec
cast i64 to string 512                                             1.03     14.7±0.12µs        ? ?/sec    1.00     14.3±0.18µs        ? ?/sec
cast int32 to float32 512                                          1.00    712.2±2.87ns        ? ?/sec    1.21    863.4±6.57ns        ? ?/sec
cast int32 to float64 512                                          1.00    720.2±1.50ns        ? ?/sec    1.19    859.7±8.45ns        ? ?/sec
cast int32 to int32 512                                            1.01    181.2±1.68ns        ? ?/sec    1.00    179.9±1.89ns        ? ?/sec
cast int32 to int64 512                                            1.00    720.1±5.74ns        ? ?/sec    1.17    841.7±5.28ns        ? ?/sec
cast int32 to uint32 512                                           1.00   1274.7±3.15ns        ? ?/sec    1.10   1398.2±6.54ns        ? ?/sec
cast int64 to int32 512                                            1.00  1373.4±20.01ns        ? ?/sec    1.09   1497.8±9.40ns        ? ?/sec
cast no runs of int32s to ree<int32>                               1.00     60.5±0.15µs        ? ?/sec    1.44     86.9±2.52µs        ? ?/sec
cast runs of 10 string to ree<int32>                               1.00      9.7±0.04µs        ? ?/sec    1.63     15.8±0.11µs        ? ?/sec
cast runs of 1000 int32s to ree<int32>                             1.00      3.7±0.07µs        ? ?/sec    2.11      7.9±0.13µs        ? ?/sec
cast string single run to ree<int32>                               1.80     41.1±0.34µs        ? ?/sec    1.00     22.8±0.25µs        ? ?/sec
cast string to binary view 512                                     1.03      3.5±0.02µs        ? ?/sec    1.00      3.4±0.01µs        ? ?/sec
cast string view to binary view                                    1.00     79.4±0.57ns        ? ?/sec    1.01     80.4±0.76ns        ? ?/sec
cast string view to dict                                           1.00    213.8±2.40µs        ? ?/sec    1.03    220.1±4.39µs        ? ?/sec
cast string view to string                                         1.00     47.9±0.11µs        ? ?/sec    1.04     49.7±0.59µs        ? ?/sec
cast string view to wide string                                    1.00     49.9±0.20µs        ? ?/sec    1.01     50.2±1.04µs        ? ?/sec
cast time32s to time32ms 512                                       1.00    286.0±1.61ns        ? ?/sec    1.00    286.4±0.67ns        ? ?/sec
cast time32s to time64us 512                                       1.02    299.1±1.05ns        ? ?/sec    1.00    292.6±0.60ns        ? ?/sec
cast time64ns to time32s 512                                       1.00    507.1±1.47ns        ? ?/sec    1.00    506.2±1.21ns        ? ?/sec
cast timestamp_ms to i64 512                                       1.00    245.7±1.75ns        ? ?/sec    1.03    254.3±8.69ns        ? ?/sec
cast timestamp_ms to timestamp_ns 512                              1.00  1620.5±22.70ns        ? ?/sec    1.14   1848.7±9.64ns        ? ?/sec
cast timestamp_ns to timestamp_s 512                               1.00    177.9±1.55ns        ? ?/sec    1.01    178.8±1.79ns        ? ?/sec
cast utf8 to date32 512                                            1.00     11.3±0.03µs        ? ?/sec    1.01     11.5±0.09µs        ? ?/sec
cast utf8 to date64 512                                            1.08     46.6±0.43µs        ? ?/sec    1.00     43.0±0.38µs        ? ?/sec
cast utf8 to f32                                                   1.03     11.9±0.19µs        ? ?/sec    1.00     11.6±0.13µs        ? ?/sec
cast wide string to binary view 512                                1.02      6.2±0.16µs        ? ?/sec    1.00      6.1±0.01µs        ? ?/sec

alamb-ghbot · 2026-02-06T22:12:26Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing 8708-remove-arrow-ord-in-arrow-cast (d5747f5) to 7dbe58a diff
BENCH_NAME=cast_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench cast_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=8708-remove-arrow-ord-in-arrow-cast
Results will be posted here when complete

alamb-ghbot · 2026-02-06T22:29:50Z

🤖: Benchmark completed

Details

group                                                              8708-remove-arrow-ord-in-arrow-cast    main
-----                                                              -----------------------------------    ----
cast binary view to string                                         1.05     73.9±0.54µs        ? ?/sec    1.00     70.3±0.19µs        ? ?/sec
cast binary view to string view                                    1.00    100.2±0.31µs        ? ?/sec    1.12    111.9±6.25µs        ? ?/sec
cast binary view to wide string                                    1.05     70.3±0.89µs        ? ?/sec    1.00     67.0±0.17µs        ? ?/sec
cast date32 to date64 512                                          1.00    293.3±0.97ns        ? ?/sec    1.02    300.2±1.67ns        ? ?/sec
cast date64 to date32 512                                          1.00    501.5±1.42ns        ? ?/sec    1.01    504.8±1.05ns        ? ?/sec
cast decimal128 to decimal128 512                                  1.08    655.4±6.85ns        ? ?/sec    1.00    608.2±1.96ns        ? ?/sec
cast decimal128 to decimal128 512 lower precision                  1.01      5.2±0.03µs        ? ?/sec    1.00      5.2±0.02µs        ? ?/sec
cast decimal128 to decimal128 512 with lower scale (infallible)    1.00      7.1±0.05µs        ? ?/sec    1.00      7.1±0.02µs        ? ?/sec
cast decimal128 to decimal128 512 with same scale                  1.00     79.2±0.29ns        ? ?/sec    1.01     79.8±0.31ns        ? ?/sec
cast decimal128 to decimal256 512                                  1.01      2.3±0.03µs        ? ?/sec    1.00      2.3±0.03µs        ? ?/sec
cast decimal256 to decimal128 512                                  1.01     48.4±0.22µs        ? ?/sec    1.00     48.2±0.33µs        ? ?/sec
cast decimal256 to decimal256 512                                  1.00     10.9±0.06µs        ? ?/sec    1.01     11.0±0.05µs        ? ?/sec
cast decimal256 to decimal256 512 with same scale                  1.00     79.5±0.60ns        ? ?/sec    1.00     79.5±0.30ns        ? ?/sec
cast decimal32 to decimal32 512                                    1.12      2.5±0.10µs        ? ?/sec    1.00      2.2±0.02µs        ? ?/sec
cast decimal32 to decimal32 512 lower precision                    1.06      2.9±0.23µs        ? ?/sec    1.00      2.7±0.03µs        ? ?/sec
cast decimal32 to decimal64 512                                    1.01    315.5±7.78ns        ? ?/sec    1.00    313.2±7.17ns        ? ?/sec
cast decimal64 to decimal32 512                                    1.00      2.8±0.01µs        ? ?/sec    1.19      3.3±0.03µs        ? ?/sec
cast decimal64 to decimal64 512                                    1.01    383.9±1.69ns        ? ?/sec    1.00    381.9±2.90ns        ? ?/sec
cast dict to string view                                           2.57    120.3±0.37µs        ? ?/sec    1.00     46.9±3.57µs        ? ?/sec
cast f32 to string 512                                             1.00     18.5±0.30µs        ? ?/sec    1.02     18.8±0.25µs        ? ?/sec
cast f64 to string 512                                             1.03     22.3±0.33µs        ? ?/sec    1.00     21.6±0.09µs        ? ?/sec
cast float32 to int32 512                                          1.00   1233.8±6.35ns        ? ?/sec    1.13   1390.3±5.00ns        ? ?/sec
cast float64 to float32 512                                        1.00    788.3±2.48ns        ? ?/sec    1.17    922.2±2.29ns        ? ?/sec
cast float64 to uint64 512                                         1.00   1449.8±2.39ns        ? ?/sec    1.10  1601.5±10.48ns        ? ?/sec
cast i64 to string 512                                             1.03     14.7±0.05µs        ? ?/sec    1.00     14.3±0.06µs        ? ?/sec
cast int32 to float32 512                                          1.00    713.3±2.65ns        ? ?/sec    1.20    858.0±2.08ns        ? ?/sec
cast int32 to float64 512                                          1.00   724.0±17.10ns        ? ?/sec    1.20    866.0±9.34ns        ? ?/sec
cast int32 to int32 512                                            1.00    180.6±3.75ns        ? ?/sec    1.00    180.9±3.29ns        ? ?/sec
cast int32 to int64 512                                            1.00    721.2±3.18ns        ? ?/sec    1.16    837.9±5.86ns        ? ?/sec
cast int32 to uint32 512                                           1.00  1264.2±17.36ns        ? ?/sec    1.10   1387.7±2.59ns        ? ?/sec
cast int64 to int32 512                                            1.00   1370.2±2.54ns        ? ?/sec    1.10   1504.7±5.17ns        ? ?/sec
cast no runs of int32s to ree<int32>                               1.00     60.6±0.14µs        ? ?/sec    1.43     86.8±1.62µs        ? ?/sec
cast runs of 10 string to ree<int32>                               1.00      9.7±0.10µs        ? ?/sec    1.63     15.8±0.18µs        ? ?/sec
cast runs of 1000 int32s to ree<int32>                             1.00      3.7±0.08µs        ? ?/sec    2.11      7.9±0.19µs        ? ?/sec
cast string single run to ree<int32>                               1.79     41.1±0.42µs        ? ?/sec    1.00     22.9±0.31µs        ? ?/sec
cast string to binary view 512                                     1.00      3.4±0.03µs        ? ?/sec    1.03      3.5±0.01µs        ? ?/sec
cast string view to binary view                                    1.00     79.6±0.40ns        ? ?/sec    1.01     80.4±2.57ns        ? ?/sec
cast string view to dict                                           1.00    213.0±1.89µs        ? ?/sec    1.02    217.1±3.28µs        ? ?/sec
cast string view to string                                         1.00     48.0±0.11µs        ? ?/sec    1.04     49.7±0.26µs        ? ?/sec
cast string view to wide string                                    1.00     50.0±1.83µs        ? ?/sec    1.00     49.9±0.18µs        ? ?/sec
cast time32s to time32ms 512                                       1.00    285.8±0.81ns        ? ?/sec    1.00    286.2±2.22ns        ? ?/sec
cast time32s to time64us 512                                       1.02    298.6±0.67ns        ? ?/sec    1.00    292.8±2.27ns        ? ?/sec
cast time64ns to time32s 512                                       1.00   509.0±13.38ns        ? ?/sec    1.00    508.0±4.13ns        ? ?/sec
cast timestamp_ms to i64 512                                       1.00    245.7±1.64ns        ? ?/sec    1.01    247.2±1.72ns        ? ?/sec
cast timestamp_ms to timestamp_ns 512                              1.00   1616.4±6.95ns        ? ?/sec    1.16  1872.2±50.57ns        ? ?/sec
cast timestamp_ns to timestamp_s 512                               1.00    180.3±4.98ns        ? ?/sec    1.00    180.3±3.89ns        ? ?/sec
cast utf8 to date32 512                                            1.00     11.3±0.09µs        ? ?/sec    1.01     11.4±0.02µs        ? ?/sec
cast utf8 to date64 512                                            1.00     46.6±0.72µs        ? ?/sec    1.03     48.1±0.78µs        ? ?/sec
cast utf8 to f32                                                   1.04     12.0±0.10µs        ? ?/sec    1.00     11.5±0.09µs        ? ?/sec
cast wide string to binary view 512                                1.00      6.0±0.02µs        ? ?/sec    1.02      6.1±0.01µs        ? ?/sec

Jefffrey · 2026-06-14T00:04:15Z

run benchmark cast_kernels

adriangbot · 2026-06-14T00:07:57Z

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4700170237-566-qh7hc 6.12.68+ #1 SMP Sat May 2 07:49:07 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing 8708-remove-arrow-ord-in-arrow-cast (8a491f6) to 4fa8d2f (merge-base) diff
BENCH_NAME=cast_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench cast_kernels
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

Jefffrey · 2026-06-14T00:38:09Z

cc @Rich-T-kid this might be interesting to you with your recent REE work

also might be related to

Combine overlapping runs in REE (take kernel) #9865

as well

adriangbot · 2026-06-14T00:47:20Z

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                                                              8708-remove-arrow-ord-in-arrow-cast    main
-----                                                              -----------------------------------    ----
"cast decimal128 to float32"                                       1.00     27.4±0.01µs        ? ?/sec    1.00     27.4±0.02µs        ? ?/sec
"cast decimal128 to float64"                                       1.00     27.1±0.02µs        ? ?/sec    1.00     27.1±0.01µs        ? ?/sec
"cast decimal128 to int16"                                         1.01     54.0±0.76µs        ? ?/sec    1.00     53.2±0.78µs        ? ?/sec
"cast decimal128 to int32"                                         1.01     37.9±0.09µs        ? ?/sec    1.00     37.6±0.11µs        ? ?/sec
"cast decimal128 to int64"                                         1.00     37.0±0.05µs        ? ?/sec    1.00     37.1±0.06µs        ? ?/sec
"cast decimal128 to int8"                                          1.00     51.8±0.78µs        ? ?/sec    1.00     51.6±0.88µs        ? ?/sec
"cast decimal128 to uint16"                                        1.00     54.5±0.84µs        ? ?/sec    1.00     54.5±0.84µs        ? ?/sec
"cast decimal128 to uint32"                                        1.00     36.0±0.07µs        ? ?/sec    1.01     36.2±0.11µs        ? ?/sec
"cast decimal128 to uint64"                                        1.00     35.3±0.07µs        ? ?/sec    1.01     35.6±0.09µs        ? ?/sec
"cast decimal128 to uint8"                                         1.00     50.0±0.59µs        ? ?/sec    1.00     50.0±0.60µs        ? ?/sec
"cast decimal256 to float32"                                       1.00     70.4±0.04µs        ? ?/sec    1.00     70.5±0.04µs        ? ?/sec
"cast decimal256 to float64"                                       1.00     68.3±0.08µs        ? ?/sec    1.00     68.3±0.06µs        ? ?/sec
"cast decimal256 to int16"                                         1.01    167.1±0.75µs        ? ?/sec    1.00    165.9±0.66µs        ? ?/sec
"cast decimal256 to int32"                                         1.02    147.3±0.12µs        ? ?/sec    1.00    144.1±0.14µs        ? ?/sec
"cast decimal256 to int64"                                         1.01    144.5±1.00µs        ? ?/sec    1.00    142.4±0.66µs        ? ?/sec
"cast decimal256 to int8"                                          1.01    163.7±0.80µs        ? ?/sec    1.00    162.4±0.79µs        ? ?/sec
"cast decimal256 to uint16"                                        1.00    166.7±0.39µs        ? ?/sec    1.00    166.9±0.32µs        ? ?/sec
"cast decimal256 to uint32"                                        1.01    134.6±0.12µs        ? ?/sec    1.00    133.6±0.38µs        ? ?/sec
"cast decimal256 to uint64"                                        1.01    134.6±0.11µs        ? ?/sec    1.00    133.0±0.58µs        ? ?/sec
"cast decimal256 to uint8"                                         1.00    163.4±0.40µs        ? ?/sec    1.00    163.2±0.37µs        ? ?/sec
"cast decimal32 to float32"                                        1.00      6.8±0.00µs        ? ?/sec    1.00      6.8±0.00µs        ? ?/sec
"cast decimal32 to float64"                                        1.00      6.8±0.00µs        ? ?/sec    1.00      6.8±0.00µs        ? ?/sec
"cast decimal32 to int16"                                          1.00     26.0±0.71µs        ? ?/sec    1.01     26.2±0.59µs        ? ?/sec
"cast decimal32 to int32"                                          1.03     20.8±0.21µs        ? ?/sec    1.00     20.2±0.33µs        ? ?/sec
"cast decimal32 to int64"                                          1.00     20.2±0.13µs        ? ?/sec    1.01     20.4±0.15µs        ? ?/sec
"cast decimal32 to int8"                                           1.00     33.4±0.63µs        ? ?/sec    1.01     33.6±0.61µs        ? ?/sec
"cast decimal32 to uint16"                                         1.00     26.2±0.53µs        ? ?/sec    1.00     26.2±1.17µs        ? ?/sec
"cast decimal32 to uint32"                                         1.05     20.3±0.15µs        ? ?/sec    1.00     19.4±0.08µs        ? ?/sec
"cast decimal32 to uint64"                                         1.06     21.8±0.55µs        ? ?/sec    1.00     20.5±0.45µs        ? ?/sec
"cast decimal32 to uint8"                                          1.00     33.1±0.54µs        ? ?/sec    1.05     34.9±0.42µs        ? ?/sec
"cast decimal64 to float32"                                        1.00      6.8±0.00µs        ? ?/sec    1.00      6.8±0.00µs        ? ?/sec
"cast decimal64 to float64"                                        1.00      6.8±0.00µs        ? ?/sec    1.00      6.8±0.00µs        ? ?/sec
"cast decimal64 to int16"                                          1.00     33.9±0.42µs        ? ?/sec    1.01     34.1±0.28µs        ? ?/sec
"cast decimal64 to int32"                                          1.00     26.5±0.06µs        ? ?/sec    1.00     26.4±0.07µs        ? ?/sec
"cast decimal64 to int64"                                          1.00     26.1±0.06µs        ? ?/sec    1.00     26.0±0.09µs        ? ?/sec
"cast decimal64 to int8"                                           1.00     34.1±0.26µs        ? ?/sec    1.01     34.3±0.29µs        ? ?/sec
"cast decimal64 to uint16"                                         1.00     34.2±0.52µs        ? ?/sec    1.00     34.2±0.32µs        ? ?/sec
"cast decimal64 to uint32"                                         1.00     26.2±0.07µs        ? ?/sec    1.00     26.2±0.10µs        ? ?/sec
"cast decimal64 to uint64"                                         1.01     26.0±0.06µs        ? ?/sec    1.00     25.7±0.05µs        ? ?/sec
"cast decimal64 to uint8"                                          1.00     34.1±0.32µs        ? ?/sec    1.00     34.0±0.32µs        ? ?/sec
"cast float32 to decimal128(32, 3)"                                1.00     33.9±0.32µs        ? ?/sec    1.00     33.9±0.37µs        ? ?/sec
"cast float32 to decimal256(76, 4)"                                1.02    509.4±7.80µs        ? ?/sec    1.00    498.4±3.29µs        ? ?/sec
"cast float32 to decimal32(9, 2)"                                  1.00     20.6±0.98µs        ? ?/sec    1.03     21.2±1.36µs        ? ?/sec
"cast float32 to decimal64(18, 2"                                  1.01     22.1±0.76µs        ? ?/sec    1.00     21.9±0.71µs        ? ?/sec
"cast float64 to decimal128(32, 3)"                                1.00     32.3±0.50µs        ? ?/sec    1.00     32.3±0.50µs        ? ?/sec
"cast float64 to decimal256(76, 4)"                                1.02    507.3±8.16µs        ? ?/sec    1.00    495.5±4.10µs        ? ?/sec
"cast float64 to decimal32(9, 2)"                                  1.01     20.9±0.78µs        ? ?/sec    1.00     20.6±0.72µs        ? ?/sec
"cast float64 to decimal64(18, 2"                                  1.01     21.7±0.48µs        ? ?/sec    1.00     21.6±0.62µs        ? ?/sec
"cast invalid float32 to decimal128(32, 3)"                        1.00     22.4±1.03µs        ? ?/sec    1.00     22.3±0.78µs        ? ?/sec
"cast invalid float32 to decimal256(76, 4)"                        1.00     39.9±0.50µs        ? ?/sec    1.00     40.0±0.82µs        ? ?/sec
"cast invalid float32 to decimal32(9, 2)"                          1.03     20.9±1.92µs        ? ?/sec    1.00     20.3±1.48µs        ? ?/sec
"cast invalid float32 to decimal64(18, 2"                          1.11     25.7±2.08µs        ? ?/sec    1.00     23.1±0.87µs        ? ?/sec
"cast invalid float64 to decimal32(9, 2)"                          1.00     20.6±1.21µs        ? ?/sec    1.07     21.9±0.90µs        ? ?/sec
"cast invalid float64 to to decimal128(32, 3)"                     1.00     22.8±0.90µs        ? ?/sec    1.02     23.2±1.15µs        ? ?/sec
"cast invalid float64 to to decimal256(76, 4)"                     1.00     39.3±0.80µs        ? ?/sec    1.00     39.3±1.08µs        ? ?/sec
"cast invalid float64 to to decimal64(18, 2)"                      1.01     23.4±1.27µs        ? ?/sec    1.00     23.2±1.30µs        ? ?/sec
"cast invalid string to decimal128(38, 3)"                         1.00    714.4±2.18µs        ? ?/sec    1.00    711.9±0.89µs        ? ?/sec
"cast invalid string to decimal256(76, 4)"                         1.00    714.4±0.64µs        ? ?/sec    1.00    717.7±5.37µs        ? ?/sec
"cast invalid string to decimal32(9, 2)"                           1.01    683.8±0.94µs        ? ?/sec    1.00    680.2±0.97µs        ? ?/sec
"cast invalid string to decimal64(18, 2)"                          1.00    687.6±1.79µs        ? ?/sec    1.00    684.9±0.86µs        ? ?/sec
"cast string to decimal128(38, 3)"                                 1.00    640.0±0.64µs        ? ?/sec    1.01    646.6±0.48µs        ? ?/sec
"cast string to decimal256(76, 4)"                                 1.00    657.1±0.84µs        ? ?/sec    1.01    661.1±0.30µs        ? ?/sec
"cast string to decimal32(9, 2)"                                   1.00    786.2±1.10µs        ? ?/sec    1.01    791.4±0.65µs        ? ?/sec
"cast string to decimal64(18, 2)"                                  1.00    618.5±0.65µs        ? ?/sec    1.01    621.9±0.73µs        ? ?/sec
cast binary view to string                                         1.00     58.7±0.47µs        ? ?/sec    1.01     59.0±0.52µs        ? ?/sec
cast binary view to string view                                    1.02     65.0±0.35µs        ? ?/sec    1.00     63.5±0.33µs        ? ?/sec
cast binary view to wide string                                    1.00     58.5±0.33µs        ? ?/sec    1.01     58.9±0.44µs        ? ?/sec
cast date32 to date64 512                                          1.01    324.1±2.37ns        ? ?/sec    1.00    322.0±1.06ns        ? ?/sec
cast date64 to date32 512                                          1.00    404.7±2.26ns        ? ?/sec    1.02    412.7±0.73ns        ? ?/sec
cast decimal128 to decimal128 512                                  1.00      6.9±0.01µs        ? ?/sec    1.00      6.9±0.00µs        ? ?/sec
cast decimal128 to decimal128 512 lower precision                  1.07     14.6±0.04µs        ? ?/sec    1.00     13.7±0.24µs        ? ?/sec
cast decimal128 to decimal128 512 with lower scale (infallible)    1.00     45.9±0.09µs        ? ?/sec    1.00     45.9±0.06µs        ? ?/sec
cast decimal128 to decimal128 512 with same scale                  1.00     75.0±0.60ns        ? ?/sec    1.02     76.4±0.64ns        ? ?/sec
cast decimal128 to decimal256 512                                  1.00     26.2±0.03µs        ? ?/sec    1.00     26.2±0.03µs        ? ?/sec
cast decimal256 to decimal128 512                                  1.00    313.5±0.48µs        ? ?/sec    1.01    316.0±0.29µs        ? ?/sec
cast decimal256 to decimal256 512                                  1.00     81.9±0.11µs        ? ?/sec    1.00     82.1±0.08µs        ? ?/sec
cast decimal256 to decimal256 512 with same scale                  1.00     76.1±0.69ns        ? ?/sec    1.06     81.0±4.49ns        ? ?/sec
cast decimal32 to decimal32 512                                    1.00      8.6±0.02µs        ? ?/sec    1.00      8.6±0.01µs        ? ?/sec
cast decimal32 to decimal32 512 lower precision                    1.02     10.3±0.06µs        ? ?/sec    1.00     10.1±0.06µs        ? ?/sec
cast decimal32 to decimal64 512                                    1.01      3.4±0.00µs        ? ?/sec    1.00      3.3±0.00µs        ? ?/sec
cast decimal64 to decimal32 512                                    1.00     32.4±0.02µs        ? ?/sec    1.00     32.4±0.03µs        ? ?/sec
cast decimal64 to decimal64 512                                    1.25      3.6±0.01µs        ? ?/sec    1.00      2.9±0.01µs        ? ?/sec
cast dict to string view                                           1.01     40.7±1.88µs        ? ?/sec    1.00     40.3±1.59µs        ? ?/sec
cast f32 to string 512                                             1.00     11.8±0.05µs        ? ?/sec    1.00     11.7±0.04µs        ? ?/sec
cast f64 to string 512                                             1.00     15.3±0.04µs        ? ?/sec    1.00     15.2±0.04µs        ? ?/sec
cast float32 to int32 512                                          1.00   1345.9±5.92ns        ? ?/sec    1.03   1381.3±5.35ns        ? ?/sec
cast float64 to float32 512                                        1.00    717.5±1.35ns        ? ?/sec    1.03    736.1±1.82ns        ? ?/sec
cast float64 to uint64 512                                         1.00   1400.8±3.56ns        ? ?/sec    1.00  1394.0±10.91ns        ? ?/sec
cast i64 to string 512                                             1.01      8.9±0.04µs        ? ?/sec    1.00      8.8±0.03µs        ? ?/sec
cast int32 to float32 512                                          1.00    692.0±4.62ns        ? ?/sec    1.01    700.0±5.75ns        ? ?/sec
cast int32 to float64 512                                          1.00    702.8±2.22ns        ? ?/sec    1.00    706.1±3.31ns        ? ?/sec
cast int32 to int32 512                                            1.00    172.0±1.16ns        ? ?/sec    1.00    172.1±1.08ns        ? ?/sec
cast int32 to int64 512                                            1.00    688.7±4.06ns        ? ?/sec    1.04    716.1±3.83ns        ? ?/sec
cast int32 to uint32 512                                           1.00   1359.3±6.24ns        ? ?/sec    1.02   1383.9±5.58ns        ? ?/sec
cast int64 to int32 512                                            1.00   1452.7±1.21ns        ? ?/sec    1.04   1506.9±9.41ns        ? ?/sec
cast no runs of int32s to ree<int32>                               1.00     40.6±0.76µs        ? ?/sec    1.42     57.8±0.88µs        ? ?/sec
cast runs of 10 string to ree<int32>                               1.00      6.1±0.02µs        ? ?/sec    1.44      8.8±0.06µs        ? ?/sec
cast runs of 1000 int32s to ree<int32>                             1.00      2.4±0.01µs        ? ?/sec    1.43      3.4±0.01µs        ? ?/sec
cast string single run to ree<int32>                               1.08     29.6±0.16µs        ? ?/sec    1.00     27.4±0.02µs        ? ?/sec
cast string to binary view 512                                     1.00      2.3±0.02µs        ? ?/sec    1.01      2.3±0.02µs        ? ?/sec
cast string view to binary view                                    1.00     73.0±0.75ns        ? ?/sec    1.01     73.5±0.83ns        ? ?/sec
cast string view to dict                                           1.01    175.8±0.62µs        ? ?/sec    1.00    175.0±0.59µs        ? ?/sec
cast string view to string                                         1.00     45.5±2.47µs        ? ?/sec    1.03     46.7±2.22µs        ? ?/sec
cast string view to wide string                                    1.00     45.3±2.11µs        ? ?/sec    1.00     45.5±1.97µs        ? ?/sec
cast time32s to time32ms 512                                       1.00    141.6±2.05ns        ? ?/sec    1.06    149.6±0.43ns        ? ?/sec
cast time32s to time64us 512                                       1.01    324.6±2.26ns        ? ?/sec    1.00    322.5±0.83ns        ? ?/sec
cast time64ns to time32s 512                                       1.01    407.2±3.39ns        ? ?/sec    1.00    402.7±0.28ns        ? ?/sec
cast timestamp_ms to i64 512                                       1.00    250.9±0.82ns        ? ?/sec    1.01    252.4±3.17ns        ? ?/sec
cast timestamp_ms to timestamp_ns 512                              1.00   1829.1±2.48ns        ? ?/sec    1.00   1830.4±6.49ns        ? ?/sec
cast timestamp_ns to timestamp_s 512                               1.00    171.5±1.43ns        ? ?/sec    1.00    171.8±2.02ns        ? ?/sec
cast utf8 to date32 512                                            1.01      6.5±0.03µs        ? ?/sec    1.00      6.4±0.04µs        ? ?/sec
cast utf8 to date64 512                                            1.00     31.8±0.10µs        ? ?/sec    1.00     31.9±0.09µs        ? ?/sec
cast utf8 to f32                                                   1.01      5.6±0.03µs        ? ?/sec    1.00      5.6±0.03µs        ? ?/sec
cast wide string to binary view 512                                1.00      4.1±0.08µs        ? ?/sec    1.02      4.1±0.10µs        ? ?/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	1145.2s
Peak memory	16.7 MiB
Avg memory	15.2 MiB
CPU user	1139.8s
CPU sys	0.1s
Peak spill	0 B

branch

Metric	Value
Wall time	1140.2s
Peak memory	16.8 MiB
Avg memory	15.5 MiB
CPU user	1137.3s
CPU sys	0.1s
Peak spill	0 B

File an issue against this benchmark runner

Jefffrey · 2026-06-14T00:59:28Z

+    // Safety: `T::Native` is guaranteed by `ArrowPrimitiveType` to have a plain-old-data layout,
+    // allowing the value to be viewed as raw bytes. We copy exactly `element_size` bytes, so the
+    // slice built from `current` stays within bounds.
+    unsafe {
+        let value_bytes =
+            std::slice::from_raw_parts(&current as *const T::Native as *const u8, element_size);
+        for chunk in pattern_bytes.chunks_mut(element_size) {
+            chunk.copy_from_slice(value_bytes);
+        }
+    }
+    let pattern = u128::from_ne_bytes(pattern_bytes);
+
+    while idx + elements_per_chunk <= len {
+        // SAFETY: pointer arithmetic stays within the backing slice; unaligned reads are allowed.
+        let chunk = unsafe { (values.as_ptr().add(idx) as *const u128).read_unaligned() };
+        if chunk != pattern {
+            for offset in 0..elements_per_chunk {
+                let value = unsafe { *values.get_unchecked(idx + offset) };
+                if value != current {
+                    return idx + offset;
+                }
+            }
+        }
+        idx += elements_per_chunk;
+    }


With the amount of unsafe here, it would be nice if we could ensure we run this code through miri (if we don't already) 🤔

Jefffrey · 2026-06-14T01:00:58Z

+fn ensure_capacity(vec: &mut Vec<usize>, total_len: usize) {
+    if vec.len() == vec.capacity() {
+        let remaining = total_len.saturating_sub(vec.len());
+        vec.reserve(remaining.max(1));
+    }
+}


one thing that bugs me about this function is that total_len is always constant, yet this function is called inside the loop; so the only thing changing is the capacity/len of the input vector, and it either reserves up to total_len once then subsequently keeps reserving just 1 (well I don't think this case can actually happen)

Jefffrey · 2026-06-14T01:01:32Z

+        return runs;
+    }
+
+    let mut run_boundaries = Vec::with_capacity(len / 64 + 2);


I'm seeing this Vec::with_capacity(len / 64 + 2) multiple times, is this just a guesstimate? or is it based on something?

Jefffrey · 2026-06-14T01:02:24Z

+    if array.is_empty() {
+        return (Vec::new(), Vec::new());
+    }


Suggested change

if array.is_empty() {

return (Vec::new(), Vec::new());

}

if let Some(runs) = trivial_runs(array.len()) {

return runs;

}

hoisting up the call to trivial_runs() out of each helper (e.g. in runs_for_primitive, runs_for_boolean)

Jefffrey · 2026-06-14T02:59:17Z

+fn runs_for_binary<O: OffsetSizeTrait>(array: &GenericBinaryArray<O>) -> (Vec<usize>, Vec<usize>) {
+    let mut to_usize = |v: O| v.as_usize();
+    runs_for_binary_like(
+        array.len(),
+        array.null_count(),
+        array.value_offsets(),
+        array.value_data(),
+        |idx| array.is_valid(idx),
+        &mut to_usize,
+    )
+}
+
+fn runs_for_binary_like<T: Copy>(
+    len: usize,
+    null_count: usize,
+    offsets: &[T],
+    values: &[u8],
+    mut is_valid: impl FnMut(usize) -> bool,
+    to_usize: &mut impl FnMut(T) -> usize,
+) -> (Vec<usize>, Vec<usize>) {
+    if let Some(runs) = trivial_runs(len) {
+        return runs;
+    }
+


We can combine the binary+string methods like so:

fn runs_for_bytes<O: ByteArrayType>(array: &GenericByteArray<O>) -> (Vec<usize>, Vec<usize>) { let len = array.len(); let null_count = array.null_count(); let offsets = array.value_offsets(); let values = array.value_data(); // rest of runs_for_binary_like()

made it generic over GenericByteArray which is both for strings and binary arrays

means we can then use like so, removing need for runs_for_binary() and runs_for_string()

Utf8 => runs_for_bytes(array.as_string::<i32>()), LargeUtf8 => runs_for_bytes(array.as_string::<i64>()), Binary => runs_for_bytes(array.as_binary::<i32>()), LargeBinary => runs_for_bytes(array.as_binary::<i64>()),

Weijun-H changed the title ~~refactor: remove dependency on arrow_ord~~ refactor: remove arrow-ord dependency in arrow-cast Oct 27, 2025

github-actions Bot added the arrow Changes to the arrow crate label Oct 27, 2025

Weijun-H force-pushed the 8708-remove-arrow-ord-in-arrow-cast branch from a6198b8 to c72af36 Compare October 27, 2025 17:02

alamb reviewed Oct 27, 2025

View reviewed changes

vegarsti mentioned this pull request Oct 27, 2025

perf: Use Vec::with_capacity in cast_to_run_end_encoded #8726

Merged

Weijun-H force-pushed the 8708-remove-arrow-ord-in-arrow-cast branch from 70b24d1 to fe208be Compare October 27, 2025 23:00

Weijun-H force-pushed the 8708-remove-arrow-ord-in-arrow-cast branch from f9fc4fe to 4fd7761 Compare October 30, 2025 13:32

Jefffrey reviewed Feb 6, 2026

View reviewed changes

Comment thread arrow-cast/src/cast/run_array.rs Outdated

Comment thread arrow-cast/src/cast/run_array.rs Outdated

Comment thread arrow-cast/src/cast/run_array.rs Outdated

alamb mentioned this pull request Apr 16, 2026

New crates for take and partition kernels (for use by REE) #9737

Open

Weijun-H added 3 commits April 26, 2026 12:24

refactor: remove dependency on arrow_ord

db1b59f

chore

e744d4b

chore: Added comments

f848a0e

Weijun-H force-pushed the 8708-remove-arrow-ord-in-arrow-cast branch from d5747f5 to f848a0e Compare April 26, 2026 04:24

Weijun-H force-pushed the 8708-remove-arrow-ord-in-arrow-cast branch from d66a2bc to b86ef14 Compare April 26, 2026 04:37

refactor: simplify run boundary computation and remove unused functions

8a491f6

Weijun-H force-pushed the 8708-remove-arrow-ord-in-arrow-cast branch from b86ef14 to 8a491f6 Compare April 26, 2026 04:39

Weijun-H requested review from Jefffrey and alamb April 26, 2026 05:18

Jefffrey reviewed Jun 14, 2026

View reviewed changes

Conversation

Weijun-H commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

vegarsti commented Oct 27, 2025

Uh oh!

alamb commented Oct 27, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Oct 27, 2025

Uh oh!

alamb commented Oct 27, 2025

Uh oh!

alamb commented Oct 27, 2025

Uh oh!

Weijun-H commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Oct 28, 2025

Uh oh!

alamb commented Oct 28, 2025

Uh oh!

Jefffrey left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alamb commented Feb 6, 2026

Uh oh!

alamb commented Feb 6, 2026

Uh oh!

alamb commented Feb 6, 2026

Uh oh!

alamb-ghbot commented Feb 6, 2026

Uh oh!

alamb commented Feb 6, 2026

Uh oh!

alamb-ghbot commented Feb 6, 2026

Uh oh!

alamb-ghbot commented Feb 6, 2026

Uh oh!

alamb-ghbot commented Feb 6, 2026

Uh oh!

Jefffrey commented Jun 14, 2026

Uh oh!

adriangbot commented Jun 14, 2026

Uh oh!

Jefffrey commented Jun 14, 2026

Uh oh!

adriangbot commented Jun 14, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Weijun-H commented Oct 27, 2025 •

edited

Loading

Weijun-H commented Oct 28, 2025 •

edited

Loading