Skip to content

feat: Support basic Delta scans#3035

Closed
Kimahriman wants to merge 12 commits into
apache:mainfrom
Kimahriman:delta-scan
Closed

feat: Support basic Delta scans#3035
Kimahriman wants to merge 12 commits into
apache:mainfrom
Kimahriman:delta-scan

Conversation

@Kimahriman

@Kimahriman Kimahriman commented Jan 4, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Related to #174, not full support so probably should keep that open (or open new tickets specifically for column mapping and deletion vectors)

Rationale for this change

Delta scans are simply built-in Parquet scans when column mapping and deletion vectors are not used, so the existing Comet parquet scans work fine for them. This adds support for these cases in the scan rule and adds some unit tests that could be used for future integration of delta-rs or delta-kernel-rs for full scan support.

What changes are included in this PR?

Updates the CometScanRule to use Comet parquet scans for Delta V1 scans of tables that don't use column mapping or deletion vectors. Adds a new DeltaReflection class for accessing the Delta properties to determine the enabled features.

How are these changes tested?

Adds unit tests for different Delta scans and adds Delta as a test dependency.

Comment thread native/core/Cargo.toml
reqwest = { version = "0.12", default-features = false, features = ["rustls-tls-native-roots", "http2"] }
object_store_opendal = {version = "0.55.0", optional = true}
hdfs-sys = {version = "0.3", optional = true, features = ["hdfs_3_3"]}
hdfs-sys = {version = "0.3", optional = true, features = ["hdfs_3_3", "vendored"]}

@Kimahriman Kimahriman Jan 4, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With opendal enabled by default in #2929, I was getting failures finding libhdfs.so running unit tests on Linux. The hdrs dependency below for macos was purely activating the vendored feature of hdfs-sys, so I just enabled that globally here so you don't need libhdfs available for developing on Linux either

Comment thread spark/pom.xml
Comment on lines +115 to +120
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>failureaccess</artifactId>
<version>1.0.3</version>
<scope>test</scope>
</dependency>

@Kimahriman Kimahriman Jan 4, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this I'm getting an error creating the Delta tables:

Cause: java.lang.ClassNotFoundException: com.google.common.util.concurrent.internal.InternalFutureFailureAccess
  at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641)
  at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
  at java.base/java.lang.ClassLoader.defineClass1(Native Method)
  at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
  at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
  at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
  at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
  at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
  at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
  ...

Not sure why, could use some help figuring out what is missing with Guava from where

@codecov-commenter

codecov-commenter commented Jan 4, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 65.78947% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.60%. Comparing base (f09f8af) to head (aa51b22).
⚠️ Report is 832 commits behind head on main.

Files with missing lines Patch % Lines
...scala/org/apache/comet/delta/DeltaReflection.scala 66.66% 9 Missing ⚠️
...ala/org/apache/spark/sql/comet/CometScanExec.scala 63.63% 1 Missing and 3 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3035      +/-   ##
============================================
+ Coverage     56.12%   59.60%   +3.47%     
- Complexity      976     1379     +403     
============================================
  Files           119      168      +49     
  Lines         11743    15534    +3791     
  Branches       2251     2579     +328     
============================================
+ Hits           6591     9259    +2668     
- Misses         4012     4975     +963     
- Partials       1140     1300     +160     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andygrove

Copy link
Copy Markdown
Member

Thanks, @Kimahriman. Please also add content to the documentation (either the user guide or the contributor guide) explaining this new feature.

@Kimahriman

Copy link
Copy Markdown
Contributor Author

Added a snippet to the user guide. There's not much for the contributor guide since this doesn't start the process of trying to use the delta rust libraries. Still trying to figure out if/how to integrate those

@sebbegg

sebbegg commented Mar 2, 2026

Copy link
Copy Markdown

@Kimahriman any plans to continue work on this?

We rely on delta tables and apache comet sounds like it might be very interesting option to speed up our queries without making a big transition off spark.

Comment thread spark/src/main/scala/org/apache/comet/delta/DeltaReflection.scala
Comment thread spark/src/main/scala/org/apache/comet/delta/DeltaReflection.scala Outdated
deltaMetadata.minReaderVersion match {
case 1 => true
case 2 => false
case 3 =>

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add case _ to handle future versions

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reader version 3 switched to named features instead of incremental versions, so there will never be another reader version. I can still handle more gracefully if you would like though

@Kimahriman

Copy link
Copy Markdown
Contributor Author

@Kimahriman any plans to continue work on this?

We rely on delta tables and apache comet sounds like it might be very interesting option to speed up our queries without making a big transition off spark.

This should be good to go. I don't have immediate plans to address the reader features that will require custom integration like DVs and column mapping (though the latter would likely be somewhat straightforward to support)

@github-actions

github-actions Bot commented May 3, 2026

Copy link
Copy Markdown

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants