GH-3573: Fix VariantUtil string decoding to use explicit UTF-8 charset#3576
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes Variant string decoding in VariantUtil to always use UTF-8, aligning the read path with the Variant encoding specification and the existing write path behavior. This prevents silent corruption of non-ASCII strings when the JVM default charset is not UTF-8 (e.g., JDK ≤ 17 with LC_ALL=C).
Changes:
- Update
VariantUtil.getString()to decode byte sequences usingStandardCharsets.UTF_8. - Update
VariantUtil.getMetadataKey()andVariantUtil.getMetadataMap()to decode dictionary/metadata strings usingStandardCharsets.UTF_8. - Add the required
StandardCharsetsimport.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
steveloughran
left a comment
There was a problem hiding this comment.
ooh, this is bad!
Looks like org.apache.iceberg.variants.VariantUtil uses UTF_8, but it's worth worrying about the others.
Have you considered adding a test which uses a non-ascii key in a variant and verify it then resolves?
|
Hi @steveloughran, thanks for looking into this PR! I've added 2 tests - one for getMetadataKey (via getFieldByKey call) and one for getMetadataMap (via creating ImmutableMetadata), should have more or less complete coverage now. |
steveloughran
left a comment
There was a problem hiding this comment.
LGTM, needs a review by a committer though
Fokko
left a comment
There was a problem hiding this comment.
Oof, great catch @mikhail-melnik
|
Thanks for the review @wgtmac, @steveloughran and @copilot |
Rationale for this change
VariantUtil.getString(),getMetadataKey(), andgetMetadataMap()usednew String(byte[], ...)without a charset, relying on the JVM platform default. On JDK <= 17 withLC_ALL=C(the build invocation recommended in the README), the platform default becomesUS-ASCII, causing multi-byte UTF-8 sequences in Variant string values and metadata keys to be decoded as multiple characters instead of one.The Variant binary encoding spec mandates UTF-8 for all string data. The write path already uses
StandardCharsets.UTF_8explicitly, this PR aligns the read path to match.What changes are included in this PR?
VariantUtil.java: addStandardCharsets.UTF_8to sixnew String(byte[], ...)call sites acrossgetString(),getMetadataKey(), andgetMetadataMap().Are these changes tested?
Yes, the existing
testParseUnicodeStringtest inTestVariantParseJsoncovers this. It was already written to catch this exact case but only failed when running underLC_ALL=C(i.e. on JDK <= 17 with the README-recommended build invocation). With this fix,LC_ALL=C ./mvnw test -pl parquet-variantpasses cleanly.Are there any user-facing changes?
No API changes. Behavior change only for JVMs running with a non-UTF-8 default charset, where string values containing non-ASCII characters were previously silently corrupted on read.
Closes #3573