Skip to content

refactor: remove arrow-ord dependency in arrow-cast#8716

Open
Weijun-H wants to merge 4 commits into
apache:mainfrom
Weijun-H:8708-remove-arrow-ord-in-arrow-cast
Open

refactor: remove arrow-ord dependency in arrow-cast#8716
Weijun-H wants to merge 4 commits into
apache:mainfrom
Weijun-H:8708-remove-arrow-ord-in-arrow-cast

Conversation

@Weijun-H

@Weijun-H Weijun-H commented Oct 27, 2025

Copy link
Copy Markdown
Member

Which issue does this PR close?

Rationale for this change

arrow-cast currently depends on arrow-ord only to use the generic partition kernel when casting arrays to RunEndEncoded. This is more dependency surface than needed for this path: REE encoding only needs to find boundaries between consecutive equal values.

This PR removes the arrow-ord dependency from arrow-cast and replaces the partition call with local run-boundary computation tailored to cast-to-REE.

What changes are included in this PR?

  • Removes arrow-ord from arrow-cast dependencies.
  • Replaces partition(&[cast_array]) with compute_run_boundaries in cast_to_run_end_encoded.
  • Adds specialized run-boundary scans for primitive, boolean, binary/string, fixed-size binary, and dictionary arrays.
  • Keeps a generic ArrayData fallback for less common types.
  • Uses downcast_primitive_array! to avoid repetitive primitive type dispatch.
  • Removes thin StringView / BinaryView wrappers that only called the generic fallback.

Are these changes tested?

Yes

Are there any user-facing changes?

No. This is an internal dependency and implementation change.

@Weijun-H Weijun-H changed the title refactor: remove dependency on arrow_ord refactor: remove arrow-ord dependency in arrow-cast Oct 27, 2025
@vegarsti

Copy link
Copy Markdown
Contributor

Could you try running the benchmark in this PR #8710 and see what the difference is? I thought cast_array.slice would be doing a clone, but it's not, so this might be quite fast.

@github-actions github-actions Bot added the arrow Changes to the arrow crate label Oct 27, 2025
@Weijun-H Weijun-H force-pushed the 8708-remove-arrow-ord-in-arrow-cast branch from a6198b8 to c72af36 Compare October 27, 2025 17:02
@alamb

alamb commented Oct 27, 2025

Copy link
Copy Markdown
Contributor

I just reviewed the benchmark in

and I think it looks good to go. I'll merge it in and then run the benchmarks on this PR

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Weijun-H and @vegarsti

values_indexes.push(0);
let mut current_data = array.slice(0, 1).to_data();
for idx in 1..array.len() {
let next_data = array.slice(idx, 1).to_data();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is likely to be substantially slower than what partition does, but we can see what the benchmarks show

Comment thread arrow-cast/src/cast/run_array.rs Outdated
@@ -134,16 +134,8 @@ pub(crate) fn cast_to_run_end_encoded<K: RunEndIndexType>(
));
}

// Partition the array to identify runs of consecutive equal values
let partitions = partition(&[Arc::clone(cast_array)])?;
let mut run_ends = Vec::new();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked briefly at a profile for this function -- I think we could make it substantially faster by reducing allocatiosn with a pre-sized vector here (use partitions.count_ones() to know how many partitions are needed)

Screenshot 2025-10-27 at 3 05 53 PM

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, great idea!

Side note: How did you profile this, using samply (it looks like), cargo build --profile profiling, and ran e.g. a unit test?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used Instruments that was part of Mac XCode -- it is pretty sweet as it will do whole system profiling (fire it up and start recording and it gathers the info for all processes)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed your suggestion as a PR here: #8716, maybe you can run the benchmark on that too? 😇

@alamb

alamb commented Oct 27, 2025

Copy link
Copy Markdown
Contributor

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1017-gcp #18~24.04.1-Ubuntu SMP Tue Sep 23 17:51:44 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing 8708-remove-arrow-ord-in-arrow-cast (70b24d19012b3ba909e8c610ca84185de37278fd) to 62df32e diff
BENCH_NAME=cast_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench cast_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=8708-remove-arrow-ord-in-arrow-cast
Results will be posted here when complete

@alamb

alamb commented Oct 27, 2025

Copy link
Copy Markdown
Contributor

🤖: Benchmark completed

Details

group                                                              8708-remove-arrow-ord-in-arrow-cast     main
-----                                                              -----------------------------------     ----
cast binary view to string                                         1.00     68.6±0.30µs        ? ?/sec     1.07     73.4±0.31µs        ? ?/sec
cast binary view to string view                                    1.23    115.8±0.39µs        ? ?/sec     1.00     93.9±0.32µs        ? ?/sec
cast binary view to wide string                                    1.15     74.4±0.28µs        ? ?/sec     1.00     64.8±0.34µs        ? ?/sec
cast date32 to date64 512                                          1.00    293.1±0.84ns        ? ?/sec     1.03    301.6±1.53ns        ? ?/sec
cast date64 to date32 512                                          1.00    501.4±1.12ns        ? ?/sec     1.01    505.7±1.89ns        ? ?/sec
cast decimal128 to decimal128 512                                  1.01    610.2±0.91ns        ? ?/sec     1.00    604.3±1.45ns        ? ?/sec
cast decimal128 to decimal128 512 lower precision                  1.00      5.1±0.02µs        ? ?/sec     1.01      5.1±0.03µs        ? ?/sec
cast decimal128 to decimal128 512 with lower scale (infallible)    1.01      6.9±0.03µs        ? ?/sec     1.00      6.8±0.02µs        ? ?/sec
cast decimal128 to decimal128 512 with same scale                  1.07     81.1±0.12ns        ? ?/sec     1.00     75.8±0.16ns        ? ?/sec
cast decimal128 to decimal256 512                                  1.00      2.3±0.01µs        ? ?/sec     1.01      2.3±0.02µs        ? ?/sec
cast decimal256 to decimal128 512                                  1.00     48.6±0.14µs        ? ?/sec     1.00     48.5±0.34µs        ? ?/sec
cast decimal256 to decimal256 512                                  1.00     11.1±0.03µs        ? ?/sec     1.02     11.3±0.03µs        ? ?/sec
cast decimal256 to decimal256 512 with same scale                  1.09     82.2±0.20ns        ? ?/sec     1.00     75.7±0.20ns        ? ?/sec
cast decimal32 to decimal32 512                                    1.00      2.3±0.01µs        ? ?/sec     1.14      2.6±0.01µs        ? ?/sec
cast decimal32 to decimal32 512 lower precision                    1.00      2.8±0.03µs        ? ?/sec     1.02      2.8±0.01µs        ? ?/sec
cast decimal32 to decimal64 512                                    1.10    347.3±3.82ns        ? ?/sec     1.00    316.7±0.93ns        ? ?/sec
cast decimal64 to decimal32 512                                    1.01      3.0±0.01µs        ? ?/sec     1.00      3.0±0.01µs        ? ?/sec
cast decimal64 to decimal64 512                                    1.01    381.5±2.23ns        ? ?/sec     1.00    376.8±0.63ns        ? ?/sec
cast dict to string view                                           1.00     52.3±0.21µs        ? ?/sec     1.03     53.8±0.12µs        ? ?/sec
cast f32 to string 512                                             1.06     19.1±0.68µs        ? ?/sec     1.00     18.1±0.04µs        ? ?/sec
cast f64 to string 512                                             1.00     21.8±0.05µs        ? ?/sec     1.04     22.6±0.07µs        ? ?/sec
cast float32 to int32 512                                          1.00   1564.8±3.69ns        ? ?/sec     1.00   1560.3±4.08ns        ? ?/sec
cast float64 to float32 512                                        1.01   1088.4±3.42ns        ? ?/sec     1.00   1077.6±5.59ns        ? ?/sec
cast float64 to uint64 512                                         1.01   1769.7±5.39ns        ? ?/sec     1.00   1754.0±2.46ns        ? ?/sec
cast i64 to string 512                                             1.02     14.7±0.12µs        ? ?/sec     1.00     14.4±0.04µs        ? ?/sec
cast int32 to float32 512                                          1.02   1065.8±2.83ns        ? ?/sec     1.00   1047.8±4.28ns        ? ?/sec
cast int32 to float64 512                                          1.01   1071.1±4.97ns        ? ?/sec     1.00   1056.6±2.00ns        ? ?/sec
cast int32 to int32 512                                            1.01    201.1±1.01ns        ? ?/sec     1.00    198.7±0.45ns        ? ?/sec
cast int32 to int64 512                                            1.00   1084.5±1.52ns        ? ?/sec     1.08   1167.5±4.21ns        ? ?/sec
cast int32 to uint32 512                                           1.03   1517.6±5.18ns        ? ?/sec     1.00   1466.4±3.60ns        ? ?/sec
cast int64 to int32 512                                            1.00   1562.5±2.39ns        ? ?/sec     1.08   1684.9±3.44ns        ? ?/sec
cast no runs of int32s to ree<int32>                               18.65  1452.8±3.65µs        ? ?/sec     1.00     77.9±0.40µs        ? ?/sec
cast runs of 10 string to ree<int32>                               83.20  1357.7±3.84µs        ? ?/sec     1.00     16.3±0.08µs        ? ?/sec
cast runs of 1000 int32s to ree<int32>                             164.19  1339.8±2.22µs        ? ?/sec    1.00      8.2±0.04µs        ? ?/sec
cast string single run to ree<int32>                               57.34  1568.5±2.43µs        ? ?/sec     1.00     27.4±0.08µs        ? ?/sec
cast string to binary view 512                                     1.00      3.2±0.01µs        ? ?/sec     1.02      3.3±0.01µs        ? ?/sec
cast string view to binary view                                    1.00     97.8±0.20ns        ? ?/sec     1.00     97.3±0.20ns        ? ?/sec
cast string view to dict                                           1.00    173.5±0.35µs        ? ?/sec     1.04    180.1±0.30µs        ? ?/sec
cast string view to string                                         1.00     48.2±0.11µs        ? ?/sec     1.02     49.1±0.52µs        ? ?/sec
cast string view to wide string                                    1.00     48.4±0.16µs        ? ?/sec     1.07     51.8±0.22µs        ? ?/sec
cast time32s to time32ms 512                                       1.01    288.3±1.04ns        ? ?/sec     1.00    285.8±0.40ns        ? ?/sec
cast time32s to time64us 512                                       1.01    292.1±0.30ns        ? ?/sec     1.00    290.6±0.86ns        ? ?/sec
cast time64ns to time32s 512                                       1.00    503.3±4.33ns        ? ?/sec     1.01    507.9±0.76ns        ? ?/sec
cast timestamp_ms to i64 512                                       1.05    453.0±2.06ns        ? ?/sec     1.00    433.1±1.14ns        ? ?/sec
cast timestamp_ms to timestamp_ns 512                              1.06      2.3±0.02µs        ? ?/sec     1.00      2.2±0.00µs        ? ?/sec
cast timestamp_ns to timestamp_s 512                               1.01    200.6±1.05ns        ? ?/sec     1.00    197.7±0.35ns        ? ?/sec
cast utf8 to date32 512                                            1.00     11.4±0.02µs        ? ?/sec     1.00     11.4±0.04µs        ? ?/sec
cast utf8 to date64 512                                            1.08     46.3±0.08µs        ? ?/sec     1.00     42.8±0.16µs        ? ?/sec
cast utf8 to f32                                                   1.00     11.5±0.09µs        ? ?/sec     1.01     11.7±0.03µs        ? ?/sec
cast wide string to binary view 512                                1.00      5.6±0.01µs        ? ?/sec     1.00      5.6±0.01µs        ? ?/sec

@alamb

alamb commented Oct 27, 2025

Copy link
Copy Markdown
Contributor

cast no runs of int32s to ree 18.65 1452.8±3.65µs ? ?/sec 1.00 77.9±0.40µs ? ?/sec
cast runs of 10 string to ree 83.20 1357.7±3.84µs ? ?/sec 1.00 16.3±0.08µs ? ?/sec
cast runs of 1000 int32s to ree 164.19 1339.8±2.22µs ? ?/sec 1.00 8.2±0.04µs ? ?/sec
? cast string single run to ree 57.34 1568.5±2.43µs ? ?/sec 1.00 27.4±0.08µs ? ?/sec

As @vegarsti predicted, this PR appears to be quite a bit slower than using partition

@Weijun-H

Weijun-H commented Oct 28, 2025

Copy link
Copy Markdown
Member Author

FYI @vegarsti , @alamb After several rounds of optimization, the current version delivers significant improvements over the previous one.

  • Type-specialized dispatch:
    compute_run_boundaries now routes each physical layout (boolean, primitive scalars, binary/string, etc.) to a dedicated helper, allowing most arrays to bypass the slow, generic ArrayData comparison path.
  • Chunked primitive scanning:
    The no-null primitive path uses scan_run_end, which compares 16 bytes at a time via u128 loads. When a chunk differs, it falls back to scalar iteration—reducing branches and bounds checks in the hot loop.
  • Targeted use of unsafe for performance:
    Tight loops leverage get_unchecked, from_raw_parts, and read_unaligned to eliminate redundant bounds and alignment checks. Each unsafe block includes detailed safety comments describing the invariants upheld.
  • Generic fallback:
    Less common types still rely on ArrayData equality but reuse the shared accumulator to produce consistent run and value outputs—without special-casing memory management.
cast string single run to ree<int32>
                        time:   [23.143 µs 23.180 µs 23.224 µs]
                        change: [−8.5926% −6.6138% −5.2622%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  9 (9.00%) high severe

cast runs of 10 string to ree<int32>
                        time:   [4.4857 µs 4.4924 µs 4.4999 µs]
                        change: [−35.582% −32.807% −30.598%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

cast runs of 1000 int32s to ree<int32>
                        time:   [1.9651 µs 1.9923 µs 2.0449 µs]
                        change: [−35.958% −34.582% −33.095%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

cast no runs of int32s to ree<int32>
                        time:   [27.745 µs 28.013 µs 28.291 µs]
                        change: [−27.957% −27.305% −26.645%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  14 (14.00%) high mild

@alamb

alamb commented Oct 28, 2025

Copy link
Copy Markdown
Contributor

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1017-gcp #18~24.04.1-Ubuntu SMP Tue Sep 23 17:51:44 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing 8708-remove-arrow-ord-in-arrow-cast (f9fc4fe4fe6d5195b69dd5bb6b7e454024883164) to 6c3e588 diff
BENCH_NAME=cast_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench cast_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=8708-remove-arrow-ord-in-arrow-cast
Results will be posted here when complete

@alamb

alamb commented Oct 28, 2025

Copy link
Copy Markdown
Contributor

🤖: Benchmark completed

Details

group                                                              8708-remove-arrow-ord-in-arrow-cast    main
-----                                                              -----------------------------------    ----
cast binary view to string                                         1.00     68.8±0.29µs        ? ?/sec    1.07     73.3±0.29µs        ? ?/sec
cast binary view to string view                                    1.24    115.9±0.41µs        ? ?/sec    1.00     93.4±0.16µs        ? ?/sec
cast binary view to wide string                                    1.14     73.8±0.32µs        ? ?/sec    1.00     64.9±0.27µs        ? ?/sec
cast date32 to date64 512                                          1.00    295.8±1.07ns        ? ?/sec    1.00    296.2±0.51ns        ? ?/sec
cast date64 to date32 512                                          1.03    512.0±4.08ns        ? ?/sec    1.00    499.2±0.86ns        ? ?/sec
cast decimal128 to decimal128 512                                  1.01    609.4±2.59ns        ? ?/sec    1.00    605.9±3.07ns        ? ?/sec
cast decimal128 to decimal128 512 lower precision                  1.02      5.2±0.03µs        ? ?/sec    1.00      5.1±0.01µs        ? ?/sec
cast decimal128 to decimal128 512 with lower scale (infallible)    1.00      6.8±0.09µs        ? ?/sec    1.00      6.8±0.01µs        ? ?/sec
cast decimal128 to decimal128 512 with same scale                  1.00     75.6±0.15ns        ? ?/sec    1.00     76.0±0.08ns        ? ?/sec
cast decimal128 to decimal256 512                                  1.01      2.3±0.00µs        ? ?/sec    1.00      2.3±0.01µs        ? ?/sec
cast decimal256 to decimal128 512                                  1.00     48.5±0.18µs        ? ?/sec    1.00     48.3±0.06µs        ? ?/sec
cast decimal256 to decimal256 512                                  1.00     11.5±0.08µs        ? ?/sec    1.05     12.1±0.08µs        ? ?/sec
cast decimal256 to decimal256 512 with same scale                  1.00     75.7±0.11ns        ? ?/sec    1.00     75.5±0.13ns        ? ?/sec
cast decimal32 to decimal32 512                                    1.00      2.3±0.03µs        ? ?/sec    1.13      2.6±0.02µs        ? ?/sec
cast decimal32 to decimal32 512 lower precision                    1.00      2.8±0.03µs        ? ?/sec    1.01      2.8±0.02µs        ? ?/sec
cast decimal32 to decimal64 512                                    1.10    347.8±2.21ns        ? ?/sec    1.00    316.6±0.56ns        ? ?/sec
cast decimal64 to decimal32 512                                    1.01      3.0±0.02µs        ? ?/sec    1.00      3.0±0.00µs        ? ?/sec
cast decimal64 to decimal64 512                                    1.01    381.6±5.70ns        ? ?/sec    1.00    376.0±0.49ns        ? ?/sec
cast dict to string view                                           1.00     52.5±0.10µs        ? ?/sec    1.02     53.8±0.09µs        ? ?/sec
cast f32 to string 512                                             1.03     18.7±0.04µs        ? ?/sec    1.00     18.3±0.04µs        ? ?/sec
cast f64 to string 512                                             1.00     21.4±0.12µs        ? ?/sec    1.06     22.6±0.12µs        ? ?/sec
cast float32 to int32 512                                          1.01   1577.3±2.44ns        ? ?/sec    1.00   1567.4±1.85ns        ? ?/sec
cast float64 to float32 512                                        1.02   1110.7±3.13ns        ? ?/sec    1.00   1091.8±1.88ns        ? ?/sec
cast float64 to uint64 512                                         1.02   1773.8±1.71ns        ? ?/sec    1.00   1742.5±3.28ns        ? ?/sec
cast i64 to string 512                                             1.00     14.4±0.04µs        ? ?/sec    1.02     14.7±0.13µs        ? ?/sec
cast int32 to float32 512                                          1.00   1015.5±1.26ns        ? ?/sec    1.04   1054.3±2.03ns        ? ?/sec
cast int32 to float64 512                                          1.03   1088.9±3.95ns        ? ?/sec    1.00   1053.8±1.81ns        ? ?/sec
cast int32 to int32 512                                            1.12    223.6±0.53ns        ? ?/sec    1.00    198.8±0.20ns        ? ?/sec
cast int32 to int64 512                                            1.00   1096.9±0.99ns        ? ?/sec    1.06   1167.2±2.57ns        ? ?/sec
cast int32 to uint32 512                                           1.05   1531.4±4.62ns        ? ?/sec    1.00   1464.9±1.52ns        ? ?/sec
cast int64 to int32 512                                            1.00  1568.7±34.86ns        ? ?/sec    1.08   1688.6±2.08ns        ? ?/sec
cast no runs of int32s to ree<int32>                               1.00     56.3±0.10µs        ? ?/sec    1.36     76.7±0.18µs        ? ?/sec
cast runs of 10 string to ree<int32>                               1.00      9.3±0.02µs        ? ?/sec    1.72     16.0±0.07µs        ? ?/sec
cast runs of 1000 int32s to ree<int32>                             1.00      3.9±0.01µs        ? ?/sec    2.13      8.2±0.02µs        ? ?/sec
cast string single run to ree<int32>                               1.00     23.8±0.08µs        ? ?/sec    1.15     27.4±0.08µs        ? ?/sec
cast string to binary view 512                                     1.00      3.3±0.00µs        ? ?/sec    1.00      3.3±0.00µs        ? ?/sec
cast string view to binary view                                    1.00     96.4±0.12ns        ? ?/sec    1.02     98.1±0.19ns        ? ?/sec
cast string view to dict                                           1.02    175.6±0.39µs        ? ?/sec    1.00    171.5±0.36µs        ? ?/sec
cast string view to string                                         1.00     48.4±0.10µs        ? ?/sec    1.01     48.9±0.08µs        ? ?/sec
cast string view to wide string                                    1.00     49.8±0.27µs        ? ?/sec    1.04     51.7±0.15µs        ? ?/sec
cast time32s to time32ms 512                                       1.02    290.8±0.91ns        ? ?/sec    1.00    285.2±0.44ns        ? ?/sec
cast time32s to time64us 512                                       1.04    302.1±0.60ns        ? ?/sec    1.00    289.8±0.55ns        ? ?/sec
cast time64ns to time32s 512                                       1.01    508.7±1.38ns        ? ?/sec    1.00    501.2±5.97ns        ? ?/sec
cast timestamp_ms to i64 512                                       1.00    434.9±1.91ns        ? ?/sec    1.01    440.4±5.99ns        ? ?/sec
cast timestamp_ms to timestamp_ns 512                              1.06      2.3±0.00µs        ? ?/sec    1.00      2.2±0.01µs        ? ?/sec
cast timestamp_ns to timestamp_s 512                               1.13    224.0±0.58ns        ? ?/sec    1.00    197.9±0.57ns        ? ?/sec
cast utf8 to date32 512                                            1.00     11.4±0.12µs        ? ?/sec    1.00     11.3±0.02µs        ? ?/sec
cast utf8 to date64 512                                            1.00     42.6±0.44µs        ? ?/sec    1.00     42.8±0.26µs        ? ?/sec
cast utf8 to f32                                                   1.01     11.8±0.03µs        ? ?/sec    1.00     11.7±0.06µs        ? ?/sec
cast wide string to binary view 512                                1.02      5.7±0.01µs        ? ?/sec    1.00      5.6±0.01µs        ? ?/sec

alamb pushed a commit that referenced this pull request Oct 30, 2025
Related to #8707.

Inspired by
#8716 (comment), a
follow up improvement to #8589: We already know what the length of the
two vectors will be, so we can create them with that capacity.
@Weijun-H Weijun-H force-pushed the 8708-remove-arrow-ord-in-arrow-cast branch from f9fc4fe to 4fd7761 Compare October 30, 2025 13:32

@Jefffrey Jefffrey left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we still interested in getting this through? Could we run bot benchmarks again (I don't think I have the permission)

Comment thread arrow-cast/src/cast/run_array.rs Outdated
Comment thread arrow-cast/src/cast/run_array.rs Outdated
Comment thread arrow-cast/src/cast/run_array.rs Outdated
@alamb

alamb commented Feb 6, 2026

Copy link
Copy Markdown
Contributor

Are we still interested in getting this through? Could we run bot benchmarks again (I don't think I have the permission)

I added you here: alamb/datafusion-benchmarking@1f1e8b2

(though beware the benchmark machine is non ideal in that it is a shared VM and thus is prone to workload variations)

@alamb

alamb commented Feb 6, 2026

Copy link
Copy Markdown
Contributor

run benchmark cast_kernels

1 similar comment
@alamb

alamb commented Feb 6, 2026

Copy link
Copy Markdown
Contributor

run benchmark cast_kernels

@alamb-ghbot

Copy link
Copy Markdown

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing 8708-remove-arrow-ord-in-arrow-cast (d5747f5) to 7dbe58a diff
BENCH_NAME=cast_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench cast_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=8708-remove-arrow-ord-in-arrow-cast
Results will be posted here when complete

@alamb

alamb commented Feb 6, 2026

Copy link
Copy Markdown
Contributor

Are we still interested in getting this through? Could we run bot benchmarks again (I don't think I have the permission)

It seems to me like a good idea in theory -- I just didnt' have the time to follow it through and complete a review 😢

@alamb-ghbot

Copy link
Copy Markdown

🤖: Benchmark completed

Details

group                                                              8708-remove-arrow-ord-in-arrow-cast    main
-----                                                              -----------------------------------    ----
cast binary view to string                                         1.04     73.8±0.69µs        ? ?/sec    1.00     70.9±1.06µs        ? ?/sec
cast binary view to string view                                    1.00    100.3±0.75µs        ? ?/sec    1.09    109.8±0.40µs        ? ?/sec
cast binary view to wide string                                    1.04     70.3±0.20µs        ? ?/sec    1.00     67.4±0.32µs        ? ?/sec
cast date32 to date64 512                                          1.03    302.6±0.99ns        ? ?/sec    1.00   293.6±10.27ns        ? ?/sec
cast date64 to date32 512                                          1.00    501.4±1.12ns        ? ?/sec    1.00    500.4±1.16ns        ? ?/sec
cast decimal128 to decimal128 512                                  1.00    613.8±1.45ns        ? ?/sec    1.00    615.3±9.85ns        ? ?/sec
cast decimal128 to decimal128 512 lower precision                  1.00      5.2±0.02µs        ? ?/sec    1.01      5.2±0.05µs        ? ?/sec
cast decimal128 to decimal128 512 with lower scale (infallible)    1.00      7.1±0.03µs        ? ?/sec    1.00      7.1±0.06µs        ? ?/sec
cast decimal128 to decimal128 512 with same scale                  1.00     79.0±0.42ns        ? ?/sec    1.04     82.5±1.62ns        ? ?/sec
cast decimal128 to decimal256 512                                  1.00      2.3±0.02µs        ? ?/sec    1.00      2.3±0.04µs        ? ?/sec
cast decimal256 to decimal128 512                                  1.01     48.6±0.56µs        ? ?/sec    1.00     48.2±0.32µs        ? ?/sec
cast decimal256 to decimal256 512                                  1.00     10.9±0.04µs        ? ?/sec    1.02     11.1±0.13µs        ? ?/sec
cast decimal256 to decimal256 512 with same scale                  1.00     79.6±0.88ns        ? ?/sec    1.03     82.3±1.51ns        ? ?/sec
cast decimal32 to decimal32 512                                    1.08      2.5±0.01µs        ? ?/sec    1.00      2.3±0.12µs        ? ?/sec
cast decimal32 to decimal32 512 lower precision                    1.04      2.8±0.01µs        ? ?/sec    1.00      2.7±0.03µs        ? ?/sec
cast decimal32 to decimal64 512                                    1.00    321.7±1.11ns        ? ?/sec    1.00    320.8±2.86ns        ? ?/sec
cast decimal64 to decimal32 512                                    1.00      2.8±0.01µs        ? ?/sec    1.19      3.4±0.03µs        ? ?/sec
cast decimal64 to decimal64 512                                    1.01    387.9±1.06ns        ? ?/sec    1.00    383.5±7.65ns        ? ?/sec
cast dict to string view                                           2.59    120.3±1.63µs        ? ?/sec    1.00     46.4±0.52µs        ? ?/sec
cast f32 to string 512                                             1.00     18.5±0.07µs        ? ?/sec    1.01     18.6±0.08µs        ? ?/sec
cast f64 to string 512                                             1.02     22.2±0.28µs        ? ?/sec    1.00     21.8±0.25µs        ? ?/sec
cast float32 to int32 512                                          1.00  1244.1±56.20ns        ? ?/sec    1.13  1409.8±46.72ns        ? ?/sec
cast float64 to float32 512                                        1.00   789.2±17.31ns        ? ?/sec    1.15   910.3±10.13ns        ? ?/sec
cast float64 to uint64 512                                         1.00  1452.1±11.45ns        ? ?/sec    1.08  1562.0±15.66ns        ? ?/sec
cast i64 to string 512                                             1.03     14.7±0.12µs        ? ?/sec    1.00     14.3±0.18µs        ? ?/sec
cast int32 to float32 512                                          1.00    712.2±2.87ns        ? ?/sec    1.21    863.4±6.57ns        ? ?/sec
cast int32 to float64 512                                          1.00    720.2±1.50ns        ? ?/sec    1.19    859.7±8.45ns        ? ?/sec
cast int32 to int32 512                                            1.01    181.2±1.68ns        ? ?/sec    1.00    179.9±1.89ns        ? ?/sec
cast int32 to int64 512                                            1.00    720.1±5.74ns        ? ?/sec    1.17    841.7±5.28ns        ? ?/sec
cast int32 to uint32 512                                           1.00   1274.7±3.15ns        ? ?/sec    1.10   1398.2±6.54ns        ? ?/sec
cast int64 to int32 512                                            1.00  1373.4±20.01ns        ? ?/sec    1.09   1497.8±9.40ns        ? ?/sec
cast no runs of int32s to ree<int32>                               1.00     60.5±0.15µs        ? ?/sec    1.44     86.9±2.52µs        ? ?/sec
cast runs of 10 string to ree<int32>                               1.00      9.7±0.04µs        ? ?/sec    1.63     15.8±0.11µs        ? ?/sec
cast runs of 1000 int32s to ree<int32>                             1.00      3.7±0.07µs        ? ?/sec    2.11      7.9±0.13µs        ? ?/sec
cast string single run to ree<int32>                               1.80     41.1±0.34µs        ? ?/sec    1.00     22.8±0.25µs        ? ?/sec
cast string to binary view 512                                     1.03      3.5±0.02µs        ? ?/sec    1.00      3.4±0.01µs        ? ?/sec
cast string view to binary view                                    1.00     79.4±0.57ns        ? ?/sec    1.01     80.4±0.76ns        ? ?/sec
cast string view to dict                                           1.00    213.8±2.40µs        ? ?/sec    1.03    220.1±4.39µs        ? ?/sec
cast string view to string                                         1.00     47.9±0.11µs        ? ?/sec    1.04     49.7±0.59µs        ? ?/sec
cast string view to wide string                                    1.00     49.9±0.20µs        ? ?/sec    1.01     50.2±1.04µs        ? ?/sec
cast time32s to time32ms 512                                       1.00    286.0±1.61ns        ? ?/sec    1.00    286.4±0.67ns        ? ?/sec
cast time32s to time64us 512                                       1.02    299.1±1.05ns        ? ?/sec    1.00    292.6±0.60ns        ? ?/sec
cast time64ns to time32s 512                                       1.00    507.1±1.47ns        ? ?/sec    1.00    506.2±1.21ns        ? ?/sec
cast timestamp_ms to i64 512                                       1.00    245.7±1.75ns        ? ?/sec    1.03    254.3±8.69ns        ? ?/sec
cast timestamp_ms to timestamp_ns 512                              1.00  1620.5±22.70ns        ? ?/sec    1.14   1848.7±9.64ns        ? ?/sec
cast timestamp_ns to timestamp_s 512                               1.00    177.9±1.55ns        ? ?/sec    1.01    178.8±1.79ns        ? ?/sec
cast utf8 to date32 512                                            1.00     11.3±0.03µs        ? ?/sec    1.01     11.5±0.09µs        ? ?/sec
cast utf8 to date64 512                                            1.08     46.6±0.43µs        ? ?/sec    1.00     43.0±0.38µs        ? ?/sec
cast utf8 to f32                                                   1.03     11.9±0.19µs        ? ?/sec    1.00     11.6±0.13µs        ? ?/sec
cast wide string to binary view 512                                1.02      6.2±0.16µs        ? ?/sec    1.00      6.1±0.01µs        ? ?/sec

@alamb-ghbot

Copy link
Copy Markdown

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing 8708-remove-arrow-ord-in-arrow-cast (d5747f5) to 7dbe58a diff
BENCH_NAME=cast_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench cast_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=8708-remove-arrow-ord-in-arrow-cast
Results will be posted here when complete

@alamb-ghbot

Copy link
Copy Markdown

🤖: Benchmark completed

Details

group                                                              8708-remove-arrow-ord-in-arrow-cast    main
-----                                                              -----------------------------------    ----
cast binary view to string                                         1.05     73.9±0.54µs        ? ?/sec    1.00     70.3±0.19µs        ? ?/sec
cast binary view to string view                                    1.00    100.2±0.31µs        ? ?/sec    1.12    111.9±6.25µs        ? ?/sec
cast binary view to wide string                                    1.05     70.3±0.89µs        ? ?/sec    1.00     67.0±0.17µs        ? ?/sec
cast date32 to date64 512                                          1.00    293.3±0.97ns        ? ?/sec    1.02    300.2±1.67ns        ? ?/sec
cast date64 to date32 512                                          1.00    501.5±1.42ns        ? ?/sec    1.01    504.8±1.05ns        ? ?/sec
cast decimal128 to decimal128 512                                  1.08    655.4±6.85ns        ? ?/sec    1.00    608.2±1.96ns        ? ?/sec
cast decimal128 to decimal128 512 lower precision                  1.01      5.2±0.03µs        ? ?/sec    1.00      5.2±0.02µs        ? ?/sec
cast decimal128 to decimal128 512 with lower scale (infallible)    1.00      7.1±0.05µs        ? ?/sec    1.00      7.1±0.02µs        ? ?/sec
cast decimal128 to decimal128 512 with same scale                  1.00     79.2±0.29ns        ? ?/sec    1.01     79.8±0.31ns        ? ?/sec
cast decimal128 to decimal256 512                                  1.01      2.3±0.03µs        ? ?/sec    1.00      2.3±0.03µs        ? ?/sec
cast decimal256 to decimal128 512                                  1.01     48.4±0.22µs        ? ?/sec    1.00     48.2±0.33µs        ? ?/sec
cast decimal256 to decimal256 512                                  1.00     10.9±0.06µs        ? ?/sec    1.01     11.0±0.05µs        ? ?/sec
cast decimal256 to decimal256 512 with same scale                  1.00     79.5±0.60ns        ? ?/sec    1.00     79.5±0.30ns        ? ?/sec
cast decimal32 to decimal32 512                                    1.12      2.5±0.10µs        ? ?/sec    1.00      2.2±0.02µs        ? ?/sec
cast decimal32 to decimal32 512 lower precision                    1.06      2.9±0.23µs        ? ?/sec    1.00      2.7±0.03µs        ? ?/sec
cast decimal32 to decimal64 512                                    1.01    315.5±7.78ns        ? ?/sec    1.00    313.2±7.17ns        ? ?/sec
cast decimal64 to decimal32 512                                    1.00      2.8±0.01µs        ? ?/sec    1.19      3.3±0.03µs        ? ?/sec
cast decimal64 to decimal64 512                                    1.01    383.9±1.69ns        ? ?/sec    1.00    381.9±2.90ns        ? ?/sec
cast dict to string view                                           2.57    120.3±0.37µs        ? ?/sec    1.00     46.9±3.57µs        ? ?/sec
cast f32 to string 512                                             1.00     18.5±0.30µs        ? ?/sec    1.02     18.8±0.25µs        ? ?/sec
cast f64 to string 512                                             1.03     22.3±0.33µs        ? ?/sec    1.00     21.6±0.09µs        ? ?/sec
cast float32 to int32 512                                          1.00   1233.8±6.35ns        ? ?/sec    1.13   1390.3±5.00ns        ? ?/sec
cast float64 to float32 512                                        1.00    788.3±2.48ns        ? ?/sec    1.17    922.2±2.29ns        ? ?/sec
cast float64 to uint64 512                                         1.00   1449.8±2.39ns        ? ?/sec    1.10  1601.5±10.48ns        ? ?/sec
cast i64 to string 512                                             1.03     14.7±0.05µs        ? ?/sec    1.00     14.3±0.06µs        ? ?/sec
cast int32 to float32 512                                          1.00    713.3±2.65ns        ? ?/sec    1.20    858.0±2.08ns        ? ?/sec
cast int32 to float64 512                                          1.00   724.0±17.10ns        ? ?/sec    1.20    866.0±9.34ns        ? ?/sec
cast int32 to int32 512                                            1.00    180.6±3.75ns        ? ?/sec    1.00    180.9±3.29ns        ? ?/sec
cast int32 to int64 512                                            1.00    721.2±3.18ns        ? ?/sec    1.16    837.9±5.86ns        ? ?/sec
cast int32 to uint32 512                                           1.00  1264.2±17.36ns        ? ?/sec    1.10   1387.7±2.59ns        ? ?/sec
cast int64 to int32 512                                            1.00   1370.2±2.54ns        ? ?/sec    1.10   1504.7±5.17ns        ? ?/sec
cast no runs of int32s to ree<int32>                               1.00     60.6±0.14µs        ? ?/sec    1.43     86.8±1.62µs        ? ?/sec
cast runs of 10 string to ree<int32>                               1.00      9.7±0.10µs        ? ?/sec    1.63     15.8±0.18µs        ? ?/sec
cast runs of 1000 int32s to ree<int32>                             1.00      3.7±0.08µs        ? ?/sec    2.11      7.9±0.19µs        ? ?/sec
cast string single run to ree<int32>                               1.79     41.1±0.42µs        ? ?/sec    1.00     22.9±0.31µs        ? ?/sec
cast string to binary view 512                                     1.00      3.4±0.03µs        ? ?/sec    1.03      3.5±0.01µs        ? ?/sec
cast string view to binary view                                    1.00     79.6±0.40ns        ? ?/sec    1.01     80.4±2.57ns        ? ?/sec
cast string view to dict                                           1.00    213.0±1.89µs        ? ?/sec    1.02    217.1±3.28µs        ? ?/sec
cast string view to string                                         1.00     48.0±0.11µs        ? ?/sec    1.04     49.7±0.26µs        ? ?/sec
cast string view to wide string                                    1.00     50.0±1.83µs        ? ?/sec    1.00     49.9±0.18µs        ? ?/sec
cast time32s to time32ms 512                                       1.00    285.8±0.81ns        ? ?/sec    1.00    286.2±2.22ns        ? ?/sec
cast time32s to time64us 512                                       1.02    298.6±0.67ns        ? ?/sec    1.00    292.8±2.27ns        ? ?/sec
cast time64ns to time32s 512                                       1.00   509.0±13.38ns        ? ?/sec    1.00    508.0±4.13ns        ? ?/sec
cast timestamp_ms to i64 512                                       1.00    245.7±1.64ns        ? ?/sec    1.01    247.2±1.72ns        ? ?/sec
cast timestamp_ms to timestamp_ns 512                              1.00   1616.4±6.95ns        ? ?/sec    1.16  1872.2±50.57ns        ? ?/sec
cast timestamp_ns to timestamp_s 512                               1.00    180.3±4.98ns        ? ?/sec    1.00    180.3±3.89ns        ? ?/sec
cast utf8 to date32 512                                            1.00     11.3±0.09µs        ? ?/sec    1.01     11.4±0.02µs        ? ?/sec
cast utf8 to date64 512                                            1.00     46.6±0.72µs        ? ?/sec    1.03     48.1±0.78µs        ? ?/sec
cast utf8 to f32                                                   1.04     12.0±0.10µs        ? ?/sec    1.00     11.5±0.09µs        ? ?/sec
cast wide string to binary view 512                                1.00      6.0±0.02µs        ? ?/sec    1.02      6.1±0.01µs        ? ?/sec

@Weijun-H Weijun-H force-pushed the 8708-remove-arrow-ord-in-arrow-cast branch from d5747f5 to f848a0e Compare April 26, 2026 04:24
@Weijun-H Weijun-H force-pushed the 8708-remove-arrow-ord-in-arrow-cast branch from d66a2bc to b86ef14 Compare April 26, 2026 04:37
@Weijun-H Weijun-H force-pushed the 8708-remove-arrow-ord-in-arrow-cast branch from b86ef14 to 8a491f6 Compare April 26, 2026 04:39
@Weijun-H Weijun-H requested review from Jefffrey and alamb April 26, 2026 05:18
@Jefffrey

Copy link
Copy Markdown
Contributor

run benchmark cast_kernels

@adriangbot

Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4700170237-566-qh7hc 6.12.68+ #1 SMP Sat May 2 07:49:07 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing 8708-remove-arrow-ord-in-arrow-cast (8a491f6) to 4fa8d2f (merge-base) diff
BENCH_NAME=cast_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench cast_kernels
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@Jefffrey

Copy link
Copy Markdown
Contributor

cc @Rich-T-kid this might be interesting to you with your recent REE work

also might be related to

as well

@adriangbot

Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                                                              8708-remove-arrow-ord-in-arrow-cast    main
-----                                                              -----------------------------------    ----
"cast decimal128 to float32"                                       1.00     27.4±0.01µs        ? ?/sec    1.00     27.4±0.02µs        ? ?/sec
"cast decimal128 to float64"                                       1.00     27.1±0.02µs        ? ?/sec    1.00     27.1±0.01µs        ? ?/sec
"cast decimal128 to int16"                                         1.01     54.0±0.76µs        ? ?/sec    1.00     53.2±0.78µs        ? ?/sec
"cast decimal128 to int32"                                         1.01     37.9±0.09µs        ? ?/sec    1.00     37.6±0.11µs        ? ?/sec
"cast decimal128 to int64"                                         1.00     37.0±0.05µs        ? ?/sec    1.00     37.1±0.06µs        ? ?/sec
"cast decimal128 to int8"                                          1.00     51.8±0.78µs        ? ?/sec    1.00     51.6±0.88µs        ? ?/sec
"cast decimal128 to uint16"                                        1.00     54.5±0.84µs        ? ?/sec    1.00     54.5±0.84µs        ? ?/sec
"cast decimal128 to uint32"                                        1.00     36.0±0.07µs        ? ?/sec    1.01     36.2±0.11µs        ? ?/sec
"cast decimal128 to uint64"                                        1.00     35.3±0.07µs        ? ?/sec    1.01     35.6±0.09µs        ? ?/sec
"cast decimal128 to uint8"                                         1.00     50.0±0.59µs        ? ?/sec    1.00     50.0±0.60µs        ? ?/sec
"cast decimal256 to float32"                                       1.00     70.4±0.04µs        ? ?/sec    1.00     70.5±0.04µs        ? ?/sec
"cast decimal256 to float64"                                       1.00     68.3±0.08µs        ? ?/sec    1.00     68.3±0.06µs        ? ?/sec
"cast decimal256 to int16"                                         1.01    167.1±0.75µs        ? ?/sec    1.00    165.9±0.66µs        ? ?/sec
"cast decimal256 to int32"                                         1.02    147.3±0.12µs        ? ?/sec    1.00    144.1±0.14µs        ? ?/sec
"cast decimal256 to int64"                                         1.01    144.5±1.00µs        ? ?/sec    1.00    142.4±0.66µs        ? ?/sec
"cast decimal256 to int8"                                          1.01    163.7±0.80µs        ? ?/sec    1.00    162.4±0.79µs        ? ?/sec
"cast decimal256 to uint16"                                        1.00    166.7±0.39µs        ? ?/sec    1.00    166.9±0.32µs        ? ?/sec
"cast decimal256 to uint32"                                        1.01    134.6±0.12µs        ? ?/sec    1.00    133.6±0.38µs        ? ?/sec
"cast decimal256 to uint64"                                        1.01    134.6±0.11µs        ? ?/sec    1.00    133.0±0.58µs        ? ?/sec
"cast decimal256 to uint8"                                         1.00    163.4±0.40µs        ? ?/sec    1.00    163.2±0.37µs        ? ?/sec
"cast decimal32 to float32"                                        1.00      6.8±0.00µs        ? ?/sec    1.00      6.8±0.00µs        ? ?/sec
"cast decimal32 to float64"                                        1.00      6.8±0.00µs        ? ?/sec    1.00      6.8±0.00µs        ? ?/sec
"cast decimal32 to int16"                                          1.00     26.0±0.71µs        ? ?/sec    1.01     26.2±0.59µs        ? ?/sec
"cast decimal32 to int32"                                          1.03     20.8±0.21µs        ? ?/sec    1.00     20.2±0.33µs        ? ?/sec
"cast decimal32 to int64"                                          1.00     20.2±0.13µs        ? ?/sec    1.01     20.4±0.15µs        ? ?/sec
"cast decimal32 to int8"                                           1.00     33.4±0.63µs        ? ?/sec    1.01     33.6±0.61µs        ? ?/sec
"cast decimal32 to uint16"                                         1.00     26.2±0.53µs        ? ?/sec    1.00     26.2±1.17µs        ? ?/sec
"cast decimal32 to uint32"                                         1.05     20.3±0.15µs        ? ?/sec    1.00     19.4±0.08µs        ? ?/sec
"cast decimal32 to uint64"                                         1.06     21.8±0.55µs        ? ?/sec    1.00     20.5±0.45µs        ? ?/sec
"cast decimal32 to uint8"                                          1.00     33.1±0.54µs        ? ?/sec    1.05     34.9±0.42µs        ? ?/sec
"cast decimal64 to float32"                                        1.00      6.8±0.00µs        ? ?/sec    1.00      6.8±0.00µs        ? ?/sec
"cast decimal64 to float64"                                        1.00      6.8±0.00µs        ? ?/sec    1.00      6.8±0.00µs        ? ?/sec
"cast decimal64 to int16"                                          1.00     33.9±0.42µs        ? ?/sec    1.01     34.1±0.28µs        ? ?/sec
"cast decimal64 to int32"                                          1.00     26.5±0.06µs        ? ?/sec    1.00     26.4±0.07µs        ? ?/sec
"cast decimal64 to int64"                                          1.00     26.1±0.06µs        ? ?/sec    1.00     26.0±0.09µs        ? ?/sec
"cast decimal64 to int8"                                           1.00     34.1±0.26µs        ? ?/sec    1.01     34.3±0.29µs        ? ?/sec
"cast decimal64 to uint16"                                         1.00     34.2±0.52µs        ? ?/sec    1.00     34.2±0.32µs        ? ?/sec
"cast decimal64 to uint32"                                         1.00     26.2±0.07µs        ? ?/sec    1.00     26.2±0.10µs        ? ?/sec
"cast decimal64 to uint64"                                         1.01     26.0±0.06µs        ? ?/sec    1.00     25.7±0.05µs        ? ?/sec
"cast decimal64 to uint8"                                          1.00     34.1±0.32µs        ? ?/sec    1.00     34.0±0.32µs        ? ?/sec
"cast float32 to decimal128(32, 3)"                                1.00     33.9±0.32µs        ? ?/sec    1.00     33.9±0.37µs        ? ?/sec
"cast float32 to decimal256(76, 4)"                                1.02    509.4±7.80µs        ? ?/sec    1.00    498.4±3.29µs        ? ?/sec
"cast float32 to decimal32(9, 2)"                                  1.00     20.6±0.98µs        ? ?/sec    1.03     21.2±1.36µs        ? ?/sec
"cast float32 to decimal64(18, 2"                                  1.01     22.1±0.76µs        ? ?/sec    1.00     21.9±0.71µs        ? ?/sec
"cast float64 to decimal128(32, 3)"                                1.00     32.3±0.50µs        ? ?/sec    1.00     32.3±0.50µs        ? ?/sec
"cast float64 to decimal256(76, 4)"                                1.02    507.3±8.16µs        ? ?/sec    1.00    495.5±4.10µs        ? ?/sec
"cast float64 to decimal32(9, 2)"                                  1.01     20.9±0.78µs        ? ?/sec    1.00     20.6±0.72µs        ? ?/sec
"cast float64 to decimal64(18, 2"                                  1.01     21.7±0.48µs        ? ?/sec    1.00     21.6±0.62µs        ? ?/sec
"cast invalid float32 to decimal128(32, 3)"                        1.00     22.4±1.03µs        ? ?/sec    1.00     22.3±0.78µs        ? ?/sec
"cast invalid float32 to decimal256(76, 4)"                        1.00     39.9±0.50µs        ? ?/sec    1.00     40.0±0.82µs        ? ?/sec
"cast invalid float32 to decimal32(9, 2)"                          1.03     20.9±1.92µs        ? ?/sec    1.00     20.3±1.48µs        ? ?/sec
"cast invalid float32 to decimal64(18, 2"                          1.11     25.7±2.08µs        ? ?/sec    1.00     23.1±0.87µs        ? ?/sec
"cast invalid float64 to decimal32(9, 2)"                          1.00     20.6±1.21µs        ? ?/sec    1.07     21.9±0.90µs        ? ?/sec
"cast invalid float64 to to decimal128(32, 3)"                     1.00     22.8±0.90µs        ? ?/sec    1.02     23.2±1.15µs        ? ?/sec
"cast invalid float64 to to decimal256(76, 4)"                     1.00     39.3±0.80µs        ? ?/sec    1.00     39.3±1.08µs        ? ?/sec
"cast invalid float64 to to decimal64(18, 2)"                      1.01     23.4±1.27µs        ? ?/sec    1.00     23.2±1.30µs        ? ?/sec
"cast invalid string to decimal128(38, 3)"                         1.00    714.4±2.18µs        ? ?/sec    1.00    711.9±0.89µs        ? ?/sec
"cast invalid string to decimal256(76, 4)"                         1.00    714.4±0.64µs        ? ?/sec    1.00    717.7±5.37µs        ? ?/sec
"cast invalid string to decimal32(9, 2)"                           1.01    683.8±0.94µs        ? ?/sec    1.00    680.2±0.97µs        ? ?/sec
"cast invalid string to decimal64(18, 2)"                          1.00    687.6±1.79µs        ? ?/sec    1.00    684.9±0.86µs        ? ?/sec
"cast string to decimal128(38, 3)"                                 1.00    640.0±0.64µs        ? ?/sec    1.01    646.6±0.48µs        ? ?/sec
"cast string to decimal256(76, 4)"                                 1.00    657.1±0.84µs        ? ?/sec    1.01    661.1±0.30µs        ? ?/sec
"cast string to decimal32(9, 2)"                                   1.00    786.2±1.10µs        ? ?/sec    1.01    791.4±0.65µs        ? ?/sec
"cast string to decimal64(18, 2)"                                  1.00    618.5±0.65µs        ? ?/sec    1.01    621.9±0.73µs        ? ?/sec
cast binary view to string                                         1.00     58.7±0.47µs        ? ?/sec    1.01     59.0±0.52µs        ? ?/sec
cast binary view to string view                                    1.02     65.0±0.35µs        ? ?/sec    1.00     63.5±0.33µs        ? ?/sec
cast binary view to wide string                                    1.00     58.5±0.33µs        ? ?/sec    1.01     58.9±0.44µs        ? ?/sec
cast date32 to date64 512                                          1.01    324.1±2.37ns        ? ?/sec    1.00    322.0±1.06ns        ? ?/sec
cast date64 to date32 512                                          1.00    404.7±2.26ns        ? ?/sec    1.02    412.7±0.73ns        ? ?/sec
cast decimal128 to decimal128 512                                  1.00      6.9±0.01µs        ? ?/sec    1.00      6.9±0.00µs        ? ?/sec
cast decimal128 to decimal128 512 lower precision                  1.07     14.6±0.04µs        ? ?/sec    1.00     13.7±0.24µs        ? ?/sec
cast decimal128 to decimal128 512 with lower scale (infallible)    1.00     45.9±0.09µs        ? ?/sec    1.00     45.9±0.06µs        ? ?/sec
cast decimal128 to decimal128 512 with same scale                  1.00     75.0±0.60ns        ? ?/sec    1.02     76.4±0.64ns        ? ?/sec
cast decimal128 to decimal256 512                                  1.00     26.2±0.03µs        ? ?/sec    1.00     26.2±0.03µs        ? ?/sec
cast decimal256 to decimal128 512                                  1.00    313.5±0.48µs        ? ?/sec    1.01    316.0±0.29µs        ? ?/sec
cast decimal256 to decimal256 512                                  1.00     81.9±0.11µs        ? ?/sec    1.00     82.1±0.08µs        ? ?/sec
cast decimal256 to decimal256 512 with same scale                  1.00     76.1±0.69ns        ? ?/sec    1.06     81.0±4.49ns        ? ?/sec
cast decimal32 to decimal32 512                                    1.00      8.6±0.02µs        ? ?/sec    1.00      8.6±0.01µs        ? ?/sec
cast decimal32 to decimal32 512 lower precision                    1.02     10.3±0.06µs        ? ?/sec    1.00     10.1±0.06µs        ? ?/sec
cast decimal32 to decimal64 512                                    1.01      3.4±0.00µs        ? ?/sec    1.00      3.3±0.00µs        ? ?/sec
cast decimal64 to decimal32 512                                    1.00     32.4±0.02µs        ? ?/sec    1.00     32.4±0.03µs        ? ?/sec
cast decimal64 to decimal64 512                                    1.25      3.6±0.01µs        ? ?/sec    1.00      2.9±0.01µs        ? ?/sec
cast dict to string view                                           1.01     40.7±1.88µs        ? ?/sec    1.00     40.3±1.59µs        ? ?/sec
cast f32 to string 512                                             1.00     11.8±0.05µs        ? ?/sec    1.00     11.7±0.04µs        ? ?/sec
cast f64 to string 512                                             1.00     15.3±0.04µs        ? ?/sec    1.00     15.2±0.04µs        ? ?/sec
cast float32 to int32 512                                          1.00   1345.9±5.92ns        ? ?/sec    1.03   1381.3±5.35ns        ? ?/sec
cast float64 to float32 512                                        1.00    717.5±1.35ns        ? ?/sec    1.03    736.1±1.82ns        ? ?/sec
cast float64 to uint64 512                                         1.00   1400.8±3.56ns        ? ?/sec    1.00  1394.0±10.91ns        ? ?/sec
cast i64 to string 512                                             1.01      8.9±0.04µs        ? ?/sec    1.00      8.8±0.03µs        ? ?/sec
cast int32 to float32 512                                          1.00    692.0±4.62ns        ? ?/sec    1.01    700.0±5.75ns        ? ?/sec
cast int32 to float64 512                                          1.00    702.8±2.22ns        ? ?/sec    1.00    706.1±3.31ns        ? ?/sec
cast int32 to int32 512                                            1.00    172.0±1.16ns        ? ?/sec    1.00    172.1±1.08ns        ? ?/sec
cast int32 to int64 512                                            1.00    688.7±4.06ns        ? ?/sec    1.04    716.1±3.83ns        ? ?/sec
cast int32 to uint32 512                                           1.00   1359.3±6.24ns        ? ?/sec    1.02   1383.9±5.58ns        ? ?/sec
cast int64 to int32 512                                            1.00   1452.7±1.21ns        ? ?/sec    1.04   1506.9±9.41ns        ? ?/sec
cast no runs of int32s to ree<int32>                               1.00     40.6±0.76µs        ? ?/sec    1.42     57.8±0.88µs        ? ?/sec
cast runs of 10 string to ree<int32>                               1.00      6.1±0.02µs        ? ?/sec    1.44      8.8±0.06µs        ? ?/sec
cast runs of 1000 int32s to ree<int32>                             1.00      2.4±0.01µs        ? ?/sec    1.43      3.4±0.01µs        ? ?/sec
cast string single run to ree<int32>                               1.08     29.6±0.16µs        ? ?/sec    1.00     27.4±0.02µs        ? ?/sec
cast string to binary view 512                                     1.00      2.3±0.02µs        ? ?/sec    1.01      2.3±0.02µs        ? ?/sec
cast string view to binary view                                    1.00     73.0±0.75ns        ? ?/sec    1.01     73.5±0.83ns        ? ?/sec
cast string view to dict                                           1.01    175.8±0.62µs        ? ?/sec    1.00    175.0±0.59µs        ? ?/sec
cast string view to string                                         1.00     45.5±2.47µs        ? ?/sec    1.03     46.7±2.22µs        ? ?/sec
cast string view to wide string                                    1.00     45.3±2.11µs        ? ?/sec    1.00     45.5±1.97µs        ? ?/sec
cast time32s to time32ms 512                                       1.00    141.6±2.05ns        ? ?/sec    1.06    149.6±0.43ns        ? ?/sec
cast time32s to time64us 512                                       1.01    324.6±2.26ns        ? ?/sec    1.00    322.5±0.83ns        ? ?/sec
cast time64ns to time32s 512                                       1.01    407.2±3.39ns        ? ?/sec    1.00    402.7±0.28ns        ? ?/sec
cast timestamp_ms to i64 512                                       1.00    250.9±0.82ns        ? ?/sec    1.01    252.4±3.17ns        ? ?/sec
cast timestamp_ms to timestamp_ns 512                              1.00   1829.1±2.48ns        ? ?/sec    1.00   1830.4±6.49ns        ? ?/sec
cast timestamp_ns to timestamp_s 512                               1.00    171.5±1.43ns        ? ?/sec    1.00    171.8±2.02ns        ? ?/sec
cast utf8 to date32 512                                            1.01      6.5±0.03µs        ? ?/sec    1.00      6.4±0.04µs        ? ?/sec
cast utf8 to date64 512                                            1.00     31.8±0.10µs        ? ?/sec    1.00     31.9±0.09µs        ? ?/sec
cast utf8 to f32                                                   1.01      5.6±0.03µs        ? ?/sec    1.00      5.6±0.03µs        ? ?/sec
cast wide string to binary view 512                                1.00      4.1±0.08µs        ? ?/sec    1.02      4.1±0.10µs        ? ?/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 1145.2s
Peak memory 16.7 MiB
Avg memory 15.2 MiB
CPU user 1139.8s
CPU sys 0.1s
Peak spill 0 B

branch

Metric Value
Wall time 1140.2s
Peak memory 16.8 MiB
Avg memory 15.5 MiB
CPU user 1137.3s
CPU sys 0.1s
Peak spill 0 B

File an issue against this benchmark runner

Comment on lines +526 to +550
// Safety: `T::Native` is guaranteed by `ArrowPrimitiveType` to have a plain-old-data layout,
// allowing the value to be viewed as raw bytes. We copy exactly `element_size` bytes, so the
// slice built from `current` stays within bounds.
unsafe {
let value_bytes =
std::slice::from_raw_parts(&current as *const T::Native as *const u8, element_size);
for chunk in pattern_bytes.chunks_mut(element_size) {
chunk.copy_from_slice(value_bytes);
}
}
let pattern = u128::from_ne_bytes(pattern_bytes);

while idx + elements_per_chunk <= len {
// SAFETY: pointer arithmetic stays within the backing slice; unaligned reads are allowed.
let chunk = unsafe { (values.as_ptr().add(idx) as *const u128).read_unaligned() };
if chunk != pattern {
for offset in 0..elements_per_chunk {
let value = unsafe { *values.get_unchecked(idx + offset) };
if value != current {
return idx + offset;
}
}
}
idx += elements_per_chunk;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the amount of unsafe here, it would be nice if we could ensure we run this code through miri (if we don't already) 🤔

Comment on lines +480 to +485
fn ensure_capacity(vec: &mut Vec<usize>, total_len: usize) {
if vec.len() == vec.capacity() {
let remaining = total_len.saturating_sub(vec.len());
vec.reserve(remaining.max(1));
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one thing that bugs me about this function is that total_len is always constant, yet this function is called inside the loop; so the only thing changing is the capacity/len of the input vector, and it either reserves up to total_len once then subsequently keeps reserving just 1 (well I don't think this case can actually happen)

return runs;
}

let mut run_boundaries = Vec::with_capacity(len / 64 + 2);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing this Vec::with_capacity(len / 64 + 2) multiple times, is this just a guesstimate? or is it based on something?

Comment on lines +176 to +178
if array.is_empty() {
return (Vec::new(), Vec::new());
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if array.is_empty() {
return (Vec::new(), Vec::new());
}
if let Some(runs) = trivial_runs(array.len()) {
return runs;
}

hoisting up the call to trivial_runs() out of each helper (e.g. in runs_for_primitive, runs_for_boolean)

Comment on lines +301 to +324
fn runs_for_binary<O: OffsetSizeTrait>(array: &GenericBinaryArray<O>) -> (Vec<usize>, Vec<usize>) {
let mut to_usize = |v: O| v.as_usize();
runs_for_binary_like(
array.len(),
array.null_count(),
array.value_offsets(),
array.value_data(),
|idx| array.is_valid(idx),
&mut to_usize,
)
}

fn runs_for_binary_like<T: Copy>(
len: usize,
null_count: usize,
offsets: &[T],
values: &[u8],
mut is_valid: impl FnMut(usize) -> bool,
to_usize: &mut impl FnMut(T) -> usize,
) -> (Vec<usize>, Vec<usize>) {
if let Some(runs) = trivial_runs(len) {
return runs;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can combine the binary+string methods like so:

fn runs_for_bytes<O: ByteArrayType>(array: &GenericByteArray<O>) -> (Vec<usize>, Vec<usize>) {
    let len = array.len();
    let null_count = array.null_count();
    let offsets = array.value_offsets();
    let values = array.value_data();

    // rest of runs_for_binary_like()
  • made it generic over GenericByteArray which is both for strings and binary arrays

means we can then use like so, removing need for runs_for_binary() and runs_for_string()

        Utf8 => runs_for_bytes(array.as_string::<i32>()),
        LargeUtf8 => runs_for_bytes(array.as_string::<i64>()),
        Binary => runs_for_bytes(array.as_binary::<i32>()),
        LargeBinary => runs_for_bytes(array.as_binary::<i64>()),

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove arrow-ord dependency in arrow-cast due to RunEndEncoded casting

6 participants