feat: Make supported hadoop filesystem schemes configurable#2272
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2272 +/- ##
============================================
+ Coverage 56.12% 58.00% +1.88%
- Complexity 976 1294 +318
============================================
Files 119 147 +28
Lines 11743 13388 +1645
Branches 2251 2377 +126
============================================
+ Hits 6591 7766 +1175
- Misses 4012 4360 +348
- Partials 1140 1262 +122 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| } | ||
| .map_err(|e| ExecutionError::GeneralError(e.to_string()))?; | ||
| let (object_store, object_store_path): (Box<dyn ObjectStore>, Path) = | ||
| if is_hdfs_scheme(&url, object_store_configs) { |
There was a problem hiding this comment.
There is a little gotcha here when the scheme is s3a. In s3a's case, we replace s3a with s3 so that we can use the native object store implementation (
). If the user has s3a in the list of hdfs urls because they want to use the hadoop-aws implementation, then they will still end up with the native implementation.There was a problem hiding this comment.
Thanks, I made some adjustments, could you take another look?
| // Azure Data Lake Storage Gen2 secure configurations (can use both prefixes) | ||
| "abfss" -> Seq("fs.abfss.", "fs.abfs.")) | ||
|
|
||
| val COMET_LIBHDFS_SCHEMES_KEY = "fs.comet.libhdfs.schemes" |
There was a problem hiding this comment.
Should we make this a Comet conf (i.e add it in CometConf so it is automatically documented)?
There was a problem hiding this comment.
Thanks, moved to CometConf.
| conf(s"spark.hadoop.$COMET_LIBHDFS_SCHEMES_KEY") | ||
| .doc( | ||
| "Defines filesystem schemes (e.g., hdfs, webhdfs) that the native side accesses " + | ||
| "via libhdfs, separated by commas.") |
There was a problem hiding this comment.
nit: perhaps we can mention that this configuration is valid only if comet has been built with the hdfs feature flag enabled.
There was a problem hiding this comment.
Thanks, added this description
Which issue does this PR close?
Closes #2271.
Rationale for this change
Currently we prefer to use jvm-based libhdfs to implement native hdfs reader, which means we can support more hadoop file systems. But currently we hardcode to support only hdfs scheme, I want to make the supported hadoop file system schemes configurable.
What changes are included in this PR?
Make supported hadoop filesystem schemes configurable
How are these changes tested?
After patch #2244, the newly added test cases were successfully run