feat: support array_insert#1073
Conversation
|
As I was able to realize, |
|
@andygrove Sorry for tagging but I have questions about the ticket (
Thanks in advance! That is my first serious attempt to contribute to the project, so sorry If I'm annoying. |
+ fix tests for spark < 3.4
- added test for the negative index - added test for the legacy spark mode
|
Thanks, @SemyonSinchenko. I think it's fine to skip the test for Spark 3.3. I plan on reviewing this PR in more detail tomorrow, but it looks good from an initial read. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1073 +/- ##
============================================
- Coverage 34.46% 34.27% -0.20%
Complexity 888 888
============================================
Files 113 113
Lines 43580 43355 -225
Branches 9658 9488 -170
============================================
- Hits 15021 14860 -161
- Misses 25507 25596 +89
+ Partials 3052 2899 -153 ☔ View full report in Codecov by Sentry. |
|
This PR is ready for the review. The only failed check failed due to the internal GHA error: GitHub Actions has encountered an internal error when running your job. |
Co-authored-by: Andy Grove <agrove@apache.org>
|
I am not sure if this should be done together with this PR but it would some add "free" tests. Spark introduced with 3.5 array_prepend which it implements with array_insert and starting 4.0 it also implements array_append with array_insert. If you want you can copy the array_append tests and replace array_append with array_prepend for Spark 3.5+ and enable the array_append tests for spark 4.0. You can also ignore this comment if you do not agree or if it leads to unrelated errors. |
- fixes; - tests; - comments in the code;
In one case there is a zero in index and test fails due to spark error
|
Thanks for comments and suggestions! This PR is ready for the review again. What were changed from the last round of the review:
These tests pointed my attention to the uncovered parts of the code that I fixed and I also added couple of additional tests to the At the moment this PR is tested in multiple ways:
So, it looks to me, that all the possible cases are covered and the behavior is the same like in spark. |
Done! |
| let src_element_type = match src_value.data_type() { | ||
| DataType::List(field) => field.data_type(), | ||
| DataType::LargeList(field) => field.data_type(), | ||
| data_type => { | ||
| return Err(DataFusionError::Internal(format!( | ||
| "Unexpected src array type in ArrayInsert: {:?}", | ||
| data_type | ||
| ))) | ||
| } |
There was a problem hiding this comment.
minor nit: this logic for extracting a list type is repeated a few times and could be factored out into a function
There was a problem hiding this comment.
@andygrove Thanks for the suggestion!
I moved a checking of the array type (and the exception logic) to the method:
pub fn array_type(&self, data_type: &DataType) -> DataFusionResult<DataType> {
match data_type {
DataType::List(field) => Ok(DataType::List(Arc::clone(field))),
DataType::LargeList(field) => Ok(DataType::LargeList(Arc::clone(field))),
data_type => {
return Err(DataFusionError::Internal(format!(
"Unexpected src array type in ArrayInsert: {:?}",
data_type
)))
}
}
}It allows at least to avoid returning the same error multiple time. Is it what you suggested? Or should I move this method to a helper function and refactor also GerArrayStructField to use such a function?
P.S. Sorry for the stupid question... But can you please explain to me why we always check both List and LargeList, while Apache Spark only supports i32 indexes for arrays (max length is Integer.MAX_VALUE - 15), which is the case of List to my understanding? All the code in the list.rs might become a bit simpler if we make it non-generic (it also makes implementation of other missing methods like array_zip simpler).
There was a problem hiding this comment.
That's a good question. I wonder if the existing code for LargeList is actually being tested. It would be interesting to try removing it and see if there are any regressions. It makes sense to only handle List if Spark only supports i32 indexes.
There was a problem hiding this comment.
The difference I think is that a LargeList can store more than Integer.MAX_VALUE entries in all rows in a single batch, so if you have multiple Spark rows all with the max num of rows supported, it wouldn't fit into an Arrow List array. That would probably need to be supported elsewhere, but it may be worth keeping the LargeList handling around in case that scenario is supported? And other DataFusion expressions might return a LargeList even if it doesn't come directly from Spark? Does the native Parquet reader ever use a LargeList?
There was a problem hiding this comment.
Thanks for the explanation!
You are right, I will close #1118 then
andygrove
left a comment
There was a problem hiding this comment.
LGTM. Thanks @SemyonSinchenko!
Which issue does this PR close?
Related to #1042
array_insert: SELECT array_insert(array(1, 2, 3, 4), 5, 5)Rationale for this change
As described in #1042
What changes are included in this PR?
QueryPlanSerde.scala: I added an additional case for the array insert;expr.proto: I added a new message for theArrayInsert;planner.rs: I added a case for the array_insert;list.rs:ArrayInsertstruct;PhysicalExpr,DisplayandPartialExprfor it;fn array_insertHow are these changes tested?
At the moment I added a simple tests for
fn array_insertand a test forQueryPlanSerde.