PARQUET-134 patch - Support file write mode by masokan · Pull Request #100 · apache/parquet-java

masokan · 2015-01-07T17:18:41Z

No description provided.

julienledem · 2015-01-30T00:39:35Z

Please create an enum for this.

julienledem · 2015-01-30T00:40:11Z

Thanks for contributing. I made a comment above. Otherwise this looks good to me.

rdblue · 2015-03-05T02:26:43Z

This was merged as #111. Can someone close it?

julienledem · 2015-03-09T17:45:50Z

@masokan could you close this pull request as the change was merged as part of another one?
Thank you.

Adds generateAlpFixturesAtMultipleVectorSizes to TestInterOpReadAlp. For each of the four source files in parquet-testing PR apache#100 (alp_spotify1, alp_arade, alp_float_spotify1, alp_float_arade), reads every row, then re-encodes as Java ALP at both vectorSize=1024 and vectorSize=4096. Output goes to ALP_OUTPUT_DIR (default ${user.dir}/alp-java-generated/), producing 8 files total named alp_java_<stem>_vs{1024,4096}.parquet. Each output is verified by reading back through the standard reader path and bit-comparing every value via doubleToRawLongBits / floatToRawIntBits — catches NaN payload and signed-zero divergence, not just numerical equality. Skips when ALP_TEST_DATA_DIR isn't set, so it stays inert in CI on machines without the source datasets. To run: git clone --branch alpFloatingPointDataset \\ https://github.com/prtkgaur/parquet-testing.git ALP_TEST_DATA_DIR=path/to/parquet-testing/data \\ mvn -pl parquet-hadoop \\ -Dtest=TestInterOpReadAlp#generateAlpFixturesAtMultipleVectorSizes \\ test

Extends generateAlpFixturesAtMultipleVectorSizes to vary writer page version (PARQUET_1_0, PARQUET_2_0) as a third axis alongside dataset and ALP vector size. Output grows from 8 → 16 files per run: alp_java_<stem>_v{1,2}_vs{1024,4096}.parquet Page version is orthogonal to ALP encoding — the page version difference lives in the parquet protocol layer, not in the ALP payload — but covering both axes makes the fixture set fully symmetric for cross-language compatibility verification. C++/Rust/Go readers can use the V1 and V2 variants to prove their decoders handle Java-written ALP regardless of how the surrounding pages are framed. Avoids an asymmetry where the existing PR apache#100 set has C++ at V1 and Java at V2 with no overlap. All 16 outputs independently verified against the canonical _expect.csv truth files from parquet-testing PR apache#100 (1.56M values, 0 mismatches).

Two new tests in TestInterOpReadAlp: readAllFixtureFilesIndependently Opens every alp_java_*.parquet in ALP_OUTPUT_DIR and asserts each column chunk declares Encoding.ALP and decodes through the standard reader path without error. Separate from the generator's own round-trip verification so reader correctness surfaces as a distinct signal in CI when the fixtures are present. Skips cleanly when ALP_OUTPUT_DIR is empty so it stays inert in default CI environments. generateAndVerifyCornerCaseFixture Writes a single small fixture file (alp_java_cornercases.parquet, ~60 KB) targeting the corner cases enumerated in parquet-testing issue apache#105: vectors with no exceptions, one exception per vector, all exceptions, NaN/Inf/-0.0, constant values (bit_width=0), multi-vector with differing exponents, and optional columns with nulls. Both f32 and f64 variants — 14 columns × 2048 rows total. Reads each column back and bit-exactly verifies every value against the expected pattern via doubleToRawLongBits / floatToRawIntBits. The corner-case fixture is intended as a candidate file for parquet-testing PR apache#100 once naming/design is confirmed. Generating it also surfaced (and verified the fix for) a pre-existing reader bug where optional columns with nulls couldn't be decoded — see the preceding commit.

The corner-case fixture (alp_java_cornercases.parquet) is synthetic — it isn't derived from any raw dataset in parquet-testing PR apache#100, so the existing alp_*_expect.csv files don't cover it. That left cross-language verifiers with no independent ground truth to check the parquet file against; they had to either trust the Java reader or duplicate the construction recipe in their own code. writeCornerCaseCsvTruth now dumps the expected values straight from the construction recipe into alp_java_cornercases_expect.csv next to the parquet, every time the generator runs. The CSV uses the same format conventions as the existing _expect.csv files (comma- separated, header row, no quoting) plus two extensions: • Empty field = null cell (for optional columns) • Special values printed via Java's standard toString: "NaN", "Infinity", "-Infinity", "-0.0". These all parse via C++ std::stod / std::stof per the standard (case-insensitive, "inf" and "infinity" both accepted). The Arrow C++ ALP decoder reads the parquet and compares against this CSV bit-exactly: 27306 non-null cells + 1366 null cells across 14 columns × 2048 rows, 0 mismatches. This makes the corner-case fixture self-documenting and verifiable by any future cross-language tooling without rerunning the Java generator to discover what the expected values are.

PARQUET-134 patch - Support file write mode

6dc36c0

julienledem reviewed Jan 30, 2015
View reviewed changes

Comment thread parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileWriter.java

Copy link
Copy Markdown

Member

julienledem Jan 30, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create an enum for this.

julienledem mentioned this pull request Jan 30, 2015

https://github.com/apache/incubator-parquet-mr/pull/134 #87

Closed

asfgit closed this in 8f898da Jul 13, 2015

vinooganesh mentioned this pull request May 26, 2026

Parquet Java ALP Implementation #3397

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-134 patch - Support file write mode#100

PARQUET-134 patch - Support file write mode#100
masokan wants to merge 1 commit into
apache:masterfrom
masokan:master

masokan commented Jan 7, 2015

Uh oh!

julienledem Jan 30, 2015

Uh oh!

julienledem commented Jan 30, 2015

Uh oh!

rdblue commented Mar 5, 2015

Uh oh!

julienledem commented Mar 9, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

masokan commented Jan 7, 2015

Uh oh!

julienledem Jan 30, 2015

Choose a reason for hiding this comment

Uh oh!

julienledem commented Jan 30, 2015

Uh oh!

rdblue commented Mar 5, 2015

Uh oh!

julienledem commented Mar 9, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants