GH-3573: Fix VariantUtil string decoding to use explicit UTF-8 charset by mikhail-melnik · Pull Request #3576 · apache/parquet-java

mikhail-melnik · 2026-05-22T11:26:10Z

Rationale for this change

VariantUtil.getString(), getMetadataKey(), and getMetadataMap() used new String(byte[], ...) without a charset, relying on the JVM platform default. On JDK <= 17 with LC_ALL=C (the build invocation recommended in the README), the platform default becomes US-ASCII, causing multi-byte UTF-8 sequences in Variant string values and metadata keys to be decoded as multiple characters instead of one.

The Variant binary encoding spec mandates UTF-8 for all string data. The write path already uses StandardCharsets.UTF_8 explicitly, this PR aligns the read path to match.

What changes are included in this PR?

VariantUtil.java: add StandardCharsets.UTF_8 to six new String(byte[], ...) call sites across getString(), getMetadataKey(), and getMetadataMap().

Are these changes tested?

Yes, the existing testParseUnicodeString test in TestVariantParseJson covers this. It was already written to catch this exact case but only failed when running under LC_ALL=C (i.e. on JDK <= 17 with the README-recommended build invocation). With this fix, LC_ALL=C ./mvnw test -pl parquet-variant passes cleanly.

Are there any user-facing changes?

No API changes. Behavior change only for JVMs running with a non-UTF-8 default charset, where string values containing non-ASCII characters were previously silently corrupted on read.

Closes #3573

Copilot

Pull request overview

This PR fixes Variant string decoding in VariantUtil to always use UTF-8, aligning the read path with the Variant encoding specification and the existing write path behavior. This prevents silent corruption of non-ASCII strings when the JVM default charset is not UTF-8 (e.g., JDK ≤ 17 with LC_ALL=C).

Changes:

Update VariantUtil.getString() to decode byte sequences using StandardCharsets.UTF_8.
Update VariantUtil.getMetadataKey() and VariantUtil.getMetadataMap() to decode dictionary/metadata strings using StandardCharsets.UTF_8.
Add the required StandardCharsets import.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

steveloughran

ooh, this is bad!

Looks like org.apache.iceberg.variants.VariantUtil uses UTF_8, but it's worth worrying about the others.

Have you considered adding a test which uses a non-ascii key in a variant and verify it then resolves?

mikhail-melnik · 2026-05-26T23:29:04Z

Hi @steveloughran, thanks for looking into this PR! I've added 2 tests - one for getMetadataKey (via getFieldByKey call) and one for getMetadataMap (via creating ImmutableMetadata), should have more or less complete coverage now.

steveloughran

LGTM, needs a review by a committer though

Fokko

Oof, great catch @mikhail-melnik

Fokko · 2026-05-27T19:11:45Z

Thanks for the review @wgtmac, @steveloughran and @copilot

Fix VariantUtil string decoding to use explicit UTF-8 charset

8d00e4f

wgtmac requested a review from Copilot May 23, 2026 03:21

Copilot started reviewing on behalf of wgtmac May 23, 2026 03:21 View session

Copilot AI reviewed May 23, 2026

View reviewed changes

Apply Spotless formatting

dd1b3da

steveloughran reviewed May 26, 2026

View reviewed changes

Add tests for non-ASCII string values, object keys and metadata map

e528cf6

wgtmac approved these changes May 27, 2026

View reviewed changes

steveloughran approved these changes May 27, 2026

View reviewed changes

Fokko added this to the 1.18.0 milestone May 27, 2026

Fokko approved these changes May 27, 2026

View reviewed changes

Fokko merged commit b8f3330 into apache:master May 27, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-3573: Fix VariantUtil string decoding to use explicit UTF-8 charset#3576

GH-3573: Fix VariantUtil string decoding to use explicit UTF-8 charset#3576
Fokko merged 3 commits into
apache:masterfrom
mikhail-melnik:fix-variant-string-charset

mikhail-melnik commented May 22, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

steveloughran left a comment

Uh oh!

mikhail-melnik commented May 26, 2026 •

edited

Loading

Uh oh!

steveloughran left a comment

Uh oh!

Fokko left a comment

Uh oh!

Uh oh!

Fokko commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

mikhail-melnik commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

steveloughran left a comment

Choose a reason for hiding this comment

Uh oh!

mikhail-melnik commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steveloughran left a comment

Choose a reason for hiding this comment

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Fokko commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mikhail-melnik commented May 22, 2026 •

edited

Loading

mikhail-melnik commented May 26, 2026 •

edited

Loading