feat: Support basic Delta scans#3035
Conversation
| reqwest = { version = "0.12", default-features = false, features = ["rustls-tls-native-roots", "http2"] } | ||
| object_store_opendal = {version = "0.55.0", optional = true} | ||
| hdfs-sys = {version = "0.3", optional = true, features = ["hdfs_3_3"]} | ||
| hdfs-sys = {version = "0.3", optional = true, features = ["hdfs_3_3", "vendored"]} |
There was a problem hiding this comment.
With opendal enabled by default in #2929, I was getting failures finding libhdfs.so running unit tests on Linux. The hdrs dependency below for macos was purely activating the vendored feature of hdfs-sys, so I just enabled that globally here so you don't need libhdfs available for developing on Linux either
| <dependency> | ||
| <groupId>com.google.guava</groupId> | ||
| <artifactId>failureaccess</artifactId> | ||
| <version>1.0.3</version> | ||
| <scope>test</scope> | ||
| </dependency> |
There was a problem hiding this comment.
Without this I'm getting an error creating the Delta tables:
Cause: java.lang.ClassNotFoundException: com.google.common.util.concurrent.internal.InternalFutureFailureAccess
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
at java.base/java.lang.ClassLoader.defineClass1(Native Method)
at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
...
Not sure why, could use some help figuring out what is missing with Guava from where
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3035 +/- ##
============================================
+ Coverage 56.12% 59.60% +3.47%
- Complexity 976 1379 +403
============================================
Files 119 168 +49
Lines 11743 15534 +3791
Branches 2251 2579 +328
============================================
+ Hits 6591 9259 +2668
- Misses 4012 4975 +963
- Partials 1140 1300 +160 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Thanks, @Kimahriman. Please also add content to the documentation (either the user guide or the contributor guide) explaining this new feature. |
|
Added a snippet to the user guide. There's not much for the contributor guide since this doesn't start the process of trying to use the delta rust libraries. Still trying to figure out if/how to integrate those |
|
@Kimahriman any plans to continue work on this? We rely on delta tables and apache comet sounds like it might be very interesting option to speed up our queries without making a big transition off spark. |
| deltaMetadata.minReaderVersion match { | ||
| case 1 => true | ||
| case 2 => false | ||
| case 3 => |
There was a problem hiding this comment.
Could you add case _ to handle future versions
There was a problem hiding this comment.
Reader version 3 switched to named features instead of incremental versions, so there will never be another reader version. I can still handle more gracefully if you would like though
This should be good to go. I don't have immediate plans to address the reader features that will require custom integration like DVs and column mapping (though the latter would likely be somewhat straightforward to support) |
|
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Which issue does this PR close?
Related to #174, not full support so probably should keep that open (or open new tickets specifically for column mapping and deletion vectors)
Rationale for this change
Delta scans are simply built-in Parquet scans when column mapping and deletion vectors are not used, so the existing Comet parquet scans work fine for them. This adds support for these cases in the scan rule and adds some unit tests that could be used for future integration of delta-rs or delta-kernel-rs for full scan support.
What changes are included in this PR?
Updates the
CometScanRuleto use Comet parquet scans for Delta V1 scans of tables that don't use column mapping or deletion vectors. Adds a newDeltaReflectionclass for accessing the Delta properties to determine the enabled features.How are these changes tested?
Adds unit tests for different Delta scans and adds Delta as a test dependency.