PARQUET-134 patch - Support file write mode#100
Closed
masokan wants to merge 1 commit into
Closed
Conversation
Member
There was a problem hiding this comment.
Please create an enum for this.
Member
|
Thanks for contributing. I made a comment above. Otherwise this looks good to me. |
Contributor
|
This was merged as #111. Can someone close it? |
Member
|
@masokan could you close this pull request as the change was merged as part of another one? |
vinooganesh
added a commit
to vinooganesh/parquet-java
that referenced
this pull request
May 17, 2026
Adds generateAlpFixturesAtMultipleVectorSizes to TestInterOpReadAlp. For each of the four source files in parquet-testing PR apache#100 (alp_spotify1, alp_arade, alp_float_spotify1, alp_float_arade), reads every row, then re-encodes as Java ALP at both vectorSize=1024 and vectorSize=4096. Output goes to ALP_OUTPUT_DIR (default ${user.dir}/alp-java-generated/), producing 8 files total named alp_java_<stem>_vs{1024,4096}.parquet. Each output is verified by reading back through the standard reader path and bit-comparing every value via doubleToRawLongBits / floatToRawIntBits — catches NaN payload and signed-zero divergence, not just numerical equality. Skips when ALP_TEST_DATA_DIR isn't set, so it stays inert in CI on machines without the source datasets. To run: git clone --branch alpFloatingPointDataset \\ https://github.com/prtkgaur/parquet-testing.git ALP_TEST_DATA_DIR=path/to/parquet-testing/data \\ mvn -pl parquet-hadoop \\ -Dtest=TestInterOpReadAlp#generateAlpFixturesAtMultipleVectorSizes \\ test
vinooganesh
added a commit
to vinooganesh/parquet-java
that referenced
this pull request
May 18, 2026
Extends generateAlpFixturesAtMultipleVectorSizes to vary writer page
version (PARQUET_1_0, PARQUET_2_0) as a third axis alongside dataset
and ALP vector size. Output grows from 8 → 16 files per run:
alp_java_<stem>_v{1,2}_vs{1024,4096}.parquet
Page version is orthogonal to ALP encoding — the page version
difference lives in the parquet protocol layer, not in the ALP
payload — but covering both axes makes the fixture set fully
symmetric for cross-language compatibility verification. C++/Rust/Go
readers can use the V1 and V2 variants to prove their decoders
handle Java-written ALP regardless of how the surrounding pages are
framed. Avoids an asymmetry where the existing PR apache#100 set has C++
at V1 and Java at V2 with no overlap.
All 16 outputs independently verified against the canonical
_expect.csv truth files from parquet-testing PR apache#100 (1.56M values,
0 mismatches).
vinooganesh
added a commit
to vinooganesh/parquet-java
that referenced
this pull request
May 18, 2026
Two new tests in TestInterOpReadAlp: readAllFixtureFilesIndependently Opens every alp_java_*.parquet in ALP_OUTPUT_DIR and asserts each column chunk declares Encoding.ALP and decodes through the standard reader path without error. Separate from the generator's own round-trip verification so reader correctness surfaces as a distinct signal in CI when the fixtures are present. Skips cleanly when ALP_OUTPUT_DIR is empty so it stays inert in default CI environments. generateAndVerifyCornerCaseFixture Writes a single small fixture file (alp_java_cornercases.parquet, ~60 KB) targeting the corner cases enumerated in parquet-testing issue apache#105: vectors with no exceptions, one exception per vector, all exceptions, NaN/Inf/-0.0, constant values (bit_width=0), multi-vector with differing exponents, and optional columns with nulls. Both f32 and f64 variants — 14 columns × 2048 rows total. Reads each column back and bit-exactly verifies every value against the expected pattern via doubleToRawLongBits / floatToRawIntBits. The corner-case fixture is intended as a candidate file for parquet-testing PR apache#100 once naming/design is confirmed. Generating it also surfaced (and verified the fix for) a pre-existing reader bug where optional columns with nulls couldn't be decoded — see the preceding commit.
vinooganesh
added a commit
to vinooganesh/parquet-java
that referenced
this pull request
May 24, 2026
The corner-case fixture (alp_java_cornercases.parquet) is synthetic — it isn't derived from any raw dataset in parquet-testing PR apache#100, so the existing alp_*_expect.csv files don't cover it. That left cross-language verifiers with no independent ground truth to check the parquet file against; they had to either trust the Java reader or duplicate the construction recipe in their own code. writeCornerCaseCsvTruth now dumps the expected values straight from the construction recipe into alp_java_cornercases_expect.csv next to the parquet, every time the generator runs. The CSV uses the same format conventions as the existing _expect.csv files (comma- separated, header row, no quoting) plus two extensions: • Empty field = null cell (for optional columns) • Special values printed via Java's standard toString: "NaN", "Infinity", "-Infinity", "-0.0". These all parse via C++ std::stod / std::stof per the standard (case-insensitive, "inf" and "infinity" both accepted). The Arrow C++ ALP decoder reads the parquet and compares against this CSV bit-exactly: 27306 non-null cells + 1366 null cells across 14 columns × 2048 rows, 0 mismatches. This makes the corner-case fixture self-documenting and verifiable by any future cross-language tooling without rerunning the Java generator to discover what the expected values are.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.