refactor: remove arrow-ord dependency in arrow-cast#8716
Conversation
arrow_ordarrow-ord dependency in arrow-cast
|
Could you try running the benchmark in this PR #8710 and see what the difference is? I thought |
a6198b8 to
c72af36
Compare
|
I just reviewed the benchmark in and I think it looks good to go. I'll merge it in and then run the benchmarks on this PR |
| values_indexes.push(0); | ||
| let mut current_data = array.slice(0, 1).to_data(); | ||
| for idx in 1..array.len() { | ||
| let next_data = array.slice(idx, 1).to_data(); |
There was a problem hiding this comment.
I think this is likely to be substantially slower than what partition does, but we can see what the benchmarks show
| @@ -134,16 +134,8 @@ pub(crate) fn cast_to_run_end_encoded<K: RunEndIndexType>( | |||
| )); | |||
| } | |||
|
|
|||
| // Partition the array to identify runs of consecutive equal values | |||
| let partitions = partition(&[Arc::clone(cast_array)])?; | |||
| let mut run_ends = Vec::new(); | |||
There was a problem hiding this comment.
Oh, great idea!
Side note: How did you profile this, using samply (it looks like), cargo build --profile profiling, and ran e.g. a unit test?
There was a problem hiding this comment.
I used Instruments that was part of Mac XCode -- it is pretty sweet as it will do whole system profiling (fire it up and start recording and it gathers the info for all processes)
There was a problem hiding this comment.
Pushed your suggestion as a PR here: #8716, maybe you can run the benchmark on that too? 😇
|
🤖 |
|
🤖: Benchmark completed Details
|
As @vegarsti predicted, this PR appears to be quite a bit slower than using partition |
70b24d1 to
fe208be
Compare
|
FYI @vegarsti , @alamb After several rounds of optimization, the current version delivers significant improvements over the previous one.
|
|
🤖 |
|
🤖: Benchmark completed Details
|
Related to #8707. Inspired by #8716 (comment), a follow up improvement to #8589: We already know what the length of the two vectors will be, so we can create them with that capacity.
f9fc4fe to
4fd7761
Compare
Jefffrey
left a comment
There was a problem hiding this comment.
Are we still interested in getting this through? Could we run bot benchmarks again (I don't think I have the permission)
I added you here: alamb/datafusion-benchmarking@1f1e8b2 (though beware the benchmark machine is non ideal in that it is a shared VM and thus is prone to workload variations) |
|
run benchmark cast_kernels |
1 similar comment
|
run benchmark cast_kernels |
|
🤖 |
It seems to me like a good idea in theory -- I just didnt' have the time to follow it through and complete a review 😢 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
d5747f5 to
f848a0e
Compare
d66a2bc to
b86ef14
Compare
b86ef14 to
8a491f6
Compare
|
run benchmark cast_kernels |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing 8708-remove-arrow-ord-in-arrow-cast (8a491f6) to 4fa8d2f (merge-base) diff File an issue against this benchmark runner |
|
cc @Rich-T-kid this might be interesting to you with your recent REE work also might be related to as well |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
| // Safety: `T::Native` is guaranteed by `ArrowPrimitiveType` to have a plain-old-data layout, | ||
| // allowing the value to be viewed as raw bytes. We copy exactly `element_size` bytes, so the | ||
| // slice built from `current` stays within bounds. | ||
| unsafe { | ||
| let value_bytes = | ||
| std::slice::from_raw_parts(¤t as *const T::Native as *const u8, element_size); | ||
| for chunk in pattern_bytes.chunks_mut(element_size) { | ||
| chunk.copy_from_slice(value_bytes); | ||
| } | ||
| } | ||
| let pattern = u128::from_ne_bytes(pattern_bytes); | ||
|
|
||
| while idx + elements_per_chunk <= len { | ||
| // SAFETY: pointer arithmetic stays within the backing slice; unaligned reads are allowed. | ||
| let chunk = unsafe { (values.as_ptr().add(idx) as *const u128).read_unaligned() }; | ||
| if chunk != pattern { | ||
| for offset in 0..elements_per_chunk { | ||
| let value = unsafe { *values.get_unchecked(idx + offset) }; | ||
| if value != current { | ||
| return idx + offset; | ||
| } | ||
| } | ||
| } | ||
| idx += elements_per_chunk; | ||
| } |
There was a problem hiding this comment.
With the amount of unsafe here, it would be nice if we could ensure we run this code through miri (if we don't already) 🤔
| fn ensure_capacity(vec: &mut Vec<usize>, total_len: usize) { | ||
| if vec.len() == vec.capacity() { | ||
| let remaining = total_len.saturating_sub(vec.len()); | ||
| vec.reserve(remaining.max(1)); | ||
| } | ||
| } |
There was a problem hiding this comment.
one thing that bugs me about this function is that total_len is always constant, yet this function is called inside the loop; so the only thing changing is the capacity/len of the input vector, and it either reserves up to total_len once then subsequently keeps reserving just 1 (well I don't think this case can actually happen)
| return runs; | ||
| } | ||
|
|
||
| let mut run_boundaries = Vec::with_capacity(len / 64 + 2); |
There was a problem hiding this comment.
I'm seeing this Vec::with_capacity(len / 64 + 2) multiple times, is this just a guesstimate? or is it based on something?
| if array.is_empty() { | ||
| return (Vec::new(), Vec::new()); | ||
| } |
There was a problem hiding this comment.
| if array.is_empty() { | |
| return (Vec::new(), Vec::new()); | |
| } | |
| if let Some(runs) = trivial_runs(array.len()) { | |
| return runs; | |
| } |
hoisting up the call to trivial_runs() out of each helper (e.g. in runs_for_primitive, runs_for_boolean)
| fn runs_for_binary<O: OffsetSizeTrait>(array: &GenericBinaryArray<O>) -> (Vec<usize>, Vec<usize>) { | ||
| let mut to_usize = |v: O| v.as_usize(); | ||
| runs_for_binary_like( | ||
| array.len(), | ||
| array.null_count(), | ||
| array.value_offsets(), | ||
| array.value_data(), | ||
| |idx| array.is_valid(idx), | ||
| &mut to_usize, | ||
| ) | ||
| } | ||
|
|
||
| fn runs_for_binary_like<T: Copy>( | ||
| len: usize, | ||
| null_count: usize, | ||
| offsets: &[T], | ||
| values: &[u8], | ||
| mut is_valid: impl FnMut(usize) -> bool, | ||
| to_usize: &mut impl FnMut(T) -> usize, | ||
| ) -> (Vec<usize>, Vec<usize>) { | ||
| if let Some(runs) = trivial_runs(len) { | ||
| return runs; | ||
| } | ||
|
|
There was a problem hiding this comment.
We can combine the binary+string methods like so:
fn runs_for_bytes<O: ByteArrayType>(array: &GenericByteArray<O>) -> (Vec<usize>, Vec<usize>) {
let len = array.len();
let null_count = array.null_count();
let offsets = array.value_offsets();
let values = array.value_data();
// rest of runs_for_binary_like()- made it generic over
GenericByteArraywhich is both for strings and binary arrays
means we can then use like so, removing need for runs_for_binary() and runs_for_string()
Utf8 => runs_for_bytes(array.as_string::<i32>()),
LargeUtf8 => runs_for_bytes(array.as_string::<i64>()),
Binary => runs_for_bytes(array.as_binary::<i32>()),
LargeBinary => runs_for_bytes(array.as_binary::<i64>()),
Which issue does this PR close?
Rationale for this change
arrow-castcurrently depends onarrow-ordonly to use the genericpartitionkernel when casting arrays toRunEndEncoded. This is more dependency surface than needed for this path: REE encoding only needs to find boundaries between consecutive equal values.This PR removes the
arrow-orddependency fromarrow-castand replaces thepartitioncall with local run-boundary computation tailored to cast-to-REE.What changes are included in this PR?
arrow-ordfromarrow-castdependencies.partition(&[cast_array])withcompute_run_boundariesincast_to_run_end_encoded.ArrayDatafallback for less common types.downcast_primitive_array!to avoid repetitive primitive type dispatch.StringView/BinaryViewwrappers that only called the generic fallback.Are these changes tested?
Yes
Are there any user-facing changes?
No. This is an internal dependency and implementation change.