Fix S3 globbing for duckdbfs with httplib#6709
Conversation
bbba2b6 to
8fea279
Compare
Merging this PR will improve performance by 17.64%
Performance Changes
Comparing Footnotes
|
Signed-off-by: Mikhail Kot <mikhail@spiraldb.com>
8fea279 to
ccbe879
Compare
|
@claude review |
|
Claude finished @0ax1's task in 1m 20s —— View job Code Review
The fix is straightforward and correct. The root cause is well-identified: DuckDB's
|
Using vortex_filesystem=duckdb, both of httpfs backends are broken when using globs on S3 files: 1. httplib was broken because duckdb folder list API returns full paths for S3 files, and we assumed relative paths. This PR fixes this behavior 2. curl was broken because of duckdb/duckdb-httpfs#265 in 1.4. Up until upstream fixes this, globbing won't work. This issue is not present in 1.5 anymore Because of this bug, our benchmarks fail to run with curl backend, see vortex-data#6685 Signed-off-by: Mikhail Kot <mikhail@spiraldb.com>
DuckdbFS implementation for Vortex was introduced in #6198 as opt-out, but changed to opt-in in #6564 due to performance regressions. There were multiple issues (#6709, #6565 #6685) associated with it which differ from vortex's file system behaviour. It also requires additional dependencies CI which are a blocker for #8486 since MacOS runner doesn't bundle openssl for x86_64 on arm, and builds fail. As a long term goal, calling duckdb's blocking IO inside our event loop isn't the right abstraction. We want to allow duckdb to use its own IO outside vortex. Duckdb fs is also not maintaned actively so we're removing it Signed-off-by: Mikhail Kot <mikhail@spiraldb.com>
Using vortex_filesystem=duckdb, both of httpfs backends are broken
when using globs on S3 files:
S3 files, and we assumed relative paths. This PR fixes this behavior
in 1.4. Up until upstream fixes this, globbing won't work.
This issue is not present in 1.5 anymore
Because of this bug, our benchmarks fail to run with curl backend, see #6685