Skip to content

GH-3573: Fix VariantUtil string decoding to use explicit UTF-8 charset#3576

Merged
Fokko merged 3 commits into
apache:masterfrom
mikhail-melnik:fix-variant-string-charset
May 27, 2026
Merged

GH-3573: Fix VariantUtil string decoding to use explicit UTF-8 charset#3576
Fokko merged 3 commits into
apache:masterfrom
mikhail-melnik:fix-variant-string-charset

Conversation

@mikhail-melnik
Copy link
Copy Markdown
Contributor

@mikhail-melnik mikhail-melnik commented May 22, 2026

Rationale for this change

VariantUtil.getString(), getMetadataKey(), and getMetadataMap() used new String(byte[], ...) without a charset, relying on the JVM platform default. On JDK <= 17 with LC_ALL=C (the build invocation recommended in the README), the platform default becomes US-ASCII, causing multi-byte UTF-8 sequences in Variant string values and metadata keys to be decoded as multiple characters instead of one.

The Variant binary encoding spec mandates UTF-8 for all string data. The write path already uses StandardCharsets.UTF_8 explicitly, this PR aligns the read path to match.

What changes are included in this PR?

VariantUtil.java: add StandardCharsets.UTF_8 to six new String(byte[], ...) call sites across getString(), getMetadataKey(), and getMetadataMap().

Are these changes tested?

Yes, the existing testParseUnicodeString test in TestVariantParseJson covers this. It was already written to catch this exact case but only failed when running under LC_ALL=C (i.e. on JDK <= 17 with the README-recommended build invocation). With this fix, LC_ALL=C ./mvnw test -pl parquet-variant passes cleanly.

Are there any user-facing changes?

No API changes. Behavior change only for JVMs running with a non-UTF-8 default charset, where string values containing non-ASCII characters were previously silently corrupted on read.

Closes #3573

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes Variant string decoding in VariantUtil to always use UTF-8, aligning the read path with the Variant encoding specification and the existing write path behavior. This prevents silent corruption of non-ASCII strings when the JVM default charset is not UTF-8 (e.g., JDK ≤ 17 with LC_ALL=C).

Changes:

  • Update VariantUtil.getString() to decode byte sequences using StandardCharsets.UTF_8.
  • Update VariantUtil.getMetadataKey() and VariantUtil.getMetadataMap() to decode dictionary/metadata strings using StandardCharsets.UTF_8.
  • Add the required StandardCharsets import.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooh, this is bad!

Looks like org.apache.iceberg.variants.VariantUtil uses UTF_8, but it's worth worrying about the others.

Have you considered adding a test which uses a non-ascii key in a variant and verify it then resolves?

@mikhail-melnik
Copy link
Copy Markdown
Contributor Author

mikhail-melnik commented May 26, 2026

Hi @steveloughran, thanks for looking into this PR! I've added 2 tests - one for getMetadataKey (via getFieldByKey call) and one for getMetadataMap (via creating ImmutableMetadata), should have more or less complete coverage now.

Copy link
Copy Markdown
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, needs a review by a committer though

@Fokko Fokko added this to the 1.18.0 milestone May 27, 2026
Copy link
Copy Markdown
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, great catch @mikhail-melnik

@Fokko Fokko merged commit b8f3330 into apache:master May 27, 2026
5 checks passed
@Fokko
Copy link
Copy Markdown
Contributor

Fokko commented May 27, 2026

Thanks for the review @wgtmac, @steveloughran and @copilot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

VariantUtil decodes string bytes using platform default charset instead of UTF-8

5 participants