Skip to content

API, Parquet: Map geometry and geography to Parquet logical types#16765

Open
huan233usc wants to merge 19 commits into
apache:mainfrom
huan233usc:parquet-geo-schema
Open

API, Parquet: Map geometry and geography to Parquet logical types#16765
huan233usc wants to merge 19 commits into
apache:mainfrom
huan233usc:parquet-geo-schema

Conversation

@huan233usc

@huan233usc huan233usc commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

Map Iceberg geometry and geography primitive types to and from Parquet's geometry / geography logical type annotations on a BINARY column, so the geo types survive a schema round-trip through ParquetSchemaUtil.convert in both directions.

  • TypeToMessageType: emit geometry / geography as BINARY annotated with LogicalTypeAnnotation.geometryType / geographyType, passing the resolved CRS and edge-interpolation algorithm through directly. Iceberg and Parquet use the same algorithm names (SPHERICAL, VINCENTY, THOMAS, ANDOYER, KARNEY), so the algorithm is mapped by name.
  • MessageTypeToType: read those annotations back into Types.GeometryType / Types.GeographyType. An unset crs / algorithm maps to the Iceberg default (the Parquet defaults are the same: OGC:CRS84 / SPHERICAL), and algorithm names are resolved with EdgeAlgorithm.fromName, the same conversion used when parsing geography type strings (Types.fromPrimitiveString).
  • Types.GeometryType / Types.GeographyType: serialize with explicit CRS / edge algorithm defaults (geometry(OGC:CRS84), geography(OGC:CRS84, spherical)) so table metadata preserves the concrete defaults being used, while equals / hashCode compare resolved defaults so default-equivalent forms remain semantically equal. No public signature changes.

This is the first step of plumbing the geo value path through Parquet. It is intentionally schema mapping only — the generic value read/write path (BaseParquetReaders / BaseParquetWriter) and the ParquetMetrics guard for geo columns are separate follow-ups, so this PR stays small and easy to review. It is purely additive: no behavior changes for non-geo types.

This is 1/N for #16650

Test plan

  • TestParquetSchemaUtil#testGeospatialTypeRoundTrip round-trips a schema with default-CRS geometry, an explicit-CRS geometry, default geography, and a geography per edge algorithm (all five, so the by-name mapping is exercised for every constant in both directions) through ParquetSchemaUtil.convert.
  • TestParquetSchemaUtil#testGeospatialAnnotationsWithOmittedParameters reads hand-built MessageTypes with unset / explicit / explicit-default CRS and algorithm — covering files written by engines that omit defaults — and confirms each maps to the expected Iceberg type and serializes with explicit defaults.
  • TestTypes#testGeospatialTypeDefaultNormalization covers equals() / hashCode() parity for the default-CRS and default-algorithm geography forms, that algorithm() still reports SPHERICAL, and that a non-default algorithm stays distinct; testGeospatialTypeToString covers explicit-default rendering.
  • ./gradlew :iceberg-api:check :iceberg-parquet:check — clean (tests, checkstyle, revapi, spotless).
  • ./gradlew :iceberg-core:test — full core suite green; no regressions in TestSchemaParser / TestSingleValueParser / TestGeospatialTable or anywhere geography types are serialized.

@huan233usc huan233usc marked this pull request as draft June 11, 2026 02:38
@huan233usc huan233usc force-pushed the parquet-geo-schema branch from e0c3e18 to f724354 Compare June 11, 2026 05:26
@github-actions github-actions Bot added the API label Jun 11, 2026
@huan233usc huan233usc changed the title Parquet: Map geometry and geography to Parquet logical types API, Parquet: Map geometry and geography to Parquet logical types Jun 11, 2026
@huan233usc huan233usc force-pushed the parquet-geo-schema branch from f724354 to f58e12b Compare June 11, 2026 05:36
@huan233usc huan233usc marked this pull request as ready for review June 11, 2026 05:39
Map Iceberg geometry and geography to and from Parquet's geometry /
geography logical type annotations on a BINARY column, passing the
resolved CRS and edge algorithm through directly (Iceberg and Parquet
use the same algorithm names; the read side resolves names with
EdgeAlgorithm.fromName, the same conversion used when parsing
geography type strings, and maps unset annotation parameters to the
Iceberg defaults).

To make the plain geography type round-trip through writers that omit
default parameters (an unset Parquet crs / algorithm defaults to
OGC:CRS84 / SPHERICAL), GeographyType now treats an explicit default
algorithm as equal to an omitted one: equals, hashCode, and toString
use the resolved getters crs() / algorithm() instead of the raw
nullable fields, matching how the CRS already resolves through its
getter.

Schema mapping only; the value read/write path and metrics handling
are follow-ups.

Co-authored-by: Isaac
@huan233usc huan233usc force-pushed the parquet-geo-schema branch from f58e12b to 57036d7 Compare June 11, 2026 05:53
@huan233usc

Copy link
Copy Markdown
Contributor Author

Hi @szehon-ho , can you help reviewing when you have a chance? Thank you very much!

@szehon-ho

Copy link
Copy Markdown
Member

A couple of clarifying notes (not blocking — just flagging for discussion):

1. Write path isn't blocked, only the read path.

Now that TypeToMessageType emits geometry/geography, ParquetSchemaUtil.convert succeeds where it previously threw, so the failure point moves into the value path and the two sides behave asymmetrically:

  • Read: BaseParquetReaders.primitive runs the logical-type visitor and .orElseThrow(...), so a geo column fails fast with UnsupportedOperationException.
  • Write: BaseParquetWriter.primitive runs its logical-type visitor, but when it returns empty it falls through to case BINARY -> ParquetValueWriters.byteBuffers(desc) rather than throwing. So a geo write no longer fails at writer setup — it silently degrades to the generic binary writer.

Since the value path is an explicit follow-up, it'd be safer to also reject GEOMETRY/GEOGRAPHY on the write side here, so we don't produce files the reader then refuses to read.

2. Serialization effect of the toString change (default algorithm dropped).

For reference, here's what gets written into table metadata before/after this PR. CRS is unchanged (the constructor already collapses the default to the bare form); only the algorithm changes.

CRS — no change:

User action Before After
Sets default explicitly (OGC:CRS84) "geography" "geography"
Doesn't set it "geography" "geography"

Algorithm — changed:

User action Before After
Sets default explicitly (spherical) "geography(OGC:CRS84, spherical)" "geography"
Doesn't set it "geography" "geography"

So explicitly setting spherical is no longer persisted — it now serializes identically to not setting it. Semantically fine (the spec defaults an unspecified algorithm to spherical), but re-serializing an existing table that stored "...,spherical)" will produce the shorter string (same type, different bytes). Worth calling out in the PR description.

Xin Huang added 2 commits June 17, 2026 23:13
Keep explicitly supplied default CRS and geography edge algorithm in type string serialization while continuing to compare geospatial types using resolved defaults.
Serialize geospatial types with explicit default CRS and edge algorithm while preserving semantic equality for default-equivalent forms.
@huan233usc

Copy link
Copy Markdown
Contributor Author

Updated again in ca4e1a689 to make the behavior fully explicit by default:

  • GeometryType.crs84() now serializes as geometry(OGC:CRS84)
  • GeographyType.crs84() now serializes as geography(OGC:CRS84, spherical)
  • bare input like geometry / geography is still accepted and resolves to those defaults
  • equals / hashCode still compare resolved defaults, so default-equivalent forms remain equal

This keeps Iceberg table metadata aligned with the concrete defaults written into Parquet annotations and avoids depending on future default interpretation.

Match Spotless formatting for the explicit geography default assertion.
Xin Huang added 5 commits June 23, 2026 20:04
The unsupported-transform messages now render geometry/geography types
with their explicit default CRS and edge algorithm, matching the
explicit-default serialization in this PR.
Replace the magic "geometry" / "geography" string literals with a private
NAME constant on each geo type, used by both the type-name lookup map and
toString(), now that toString() no longer renders the bare keyword.
Compare resolved geo getters with Objects.equals for null-safety and
consistency with the Objects.hash-based hashCode.
Fail fast when the generic Parquet writer encounters a geometry or
geography column instead of silently degrading to the binary writer; the
geospatial value path is a separate follow-up. Also drop the redundant
default fallbacks when reading geo annotations, since GeometryType.of /
GeographyType.of already resolve a null CRS / algorithm to the default.
Assert the generic Parquet data writer fails fast for geometry and
geography columns instead of silently degrading to the binary writer.

@szehon-ho szehon-ho left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small consistency follow-up on the geo toString() changes (non-blocking). Overall the schema mapping and write-rejection look good.

Comment thread api/src/main/java/org/apache/iceberg/types/Types.java Outdated
Comment thread api/src/main/java/org/apache/iceberg/types/Types.java Outdated
@huan233usc

Copy link
Copy Markdown
Contributor Author

Thanks @szehon-ho! Applied both suggestions in 9432405GeometryType.toString() now formats from crs() and GeographyType.toString() from crs() / algorithm(), so toString() stays consistent with equals/hashCode and emits canonical CRS casing. The divergence you flagged (geometry(ogc:crs84) was .equals() to crs84() but serialized non-canonically) is fixed, and since SchemaParser writes toString() into metadata, the canonical form is now what gets persisted. Added assertions in testGeospatialTypeToString covering the casing case.

@huan233usc huan233usc requested a review from szehon-ho June 25, 2026 03:17
private GeometryType(String crs) {
Preconditions.checkArgument(crs == null || !crs.isEmpty(), "Invalid CRS: (empty string)");
this.crs = DEFAULT_CRS.equalsIgnoreCase(crs) ? null : crs;
this.crs = crs != null ? crs : DEFAULT_CRS;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the default-CRS match is case-insensitive (DEFAULT_CRS.equalsIgnoreCase(...)), a default column written by another engine/version in any casing — geometry, geometry(OGC:CRS84), or geometry(ogc:crs84) — all resolve to the same default-equal type. That's good behavior and worth keeping.

Minor suggestion: instead of resolving in crs() and routing equals/hashCode/toString through the getter, could we canonicalize once in the constructor and keep the stored field canonical?

private GeometryType(String crs) {
  Preconditions.checkArgument(crs == null || !crs.isEmpty(), "Invalid CRS: (empty string)");
  this.crs = crs == null || DEFAULT_CRS.equalsIgnoreCase(crs) ? DEFAULT_CRS : crs;
}

Then crs() is just return crs;, equals/hashCode compare the field directly, and toString becomes String.format("%s(%s)", NAME, crs) — no getter indirection and the explanatory comments drop out. Same idea for GeographyType (you already resolve the algorithm default in the constructor). Observable behavior of crs(), equals, hashCode, and toString is identical to the current approach for every input (null, OGC:CRS84, ogc:crs84, non-default), so the added tests still pass.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — done in 6ec2544. Both GeometryType and GeographyType now canonicalize the default CRS (omitted or any casing) in the constructor, so crs() is a plain field getter and equals/hashCode/toString compare/format the stored field directly. The getter indirection and the explanatory comments are gone, and TestTypes (incl. the ogc:crs84 -> OGC:CRS84 cases) still passes since the observable behavior is identical.

Preconditions.checkArgument(crs == null || !crs.isEmpty(), "Invalid CRS: (empty string)");
this.crs = DEFAULT_CRS.equalsIgnoreCase(crs) ? null : crs;
// canonicalize the default CRS (any casing or omitted) so the stored field is comparable
this.crs = crs == null || DEFAULT_CRS.equalsIgnoreCase(crs) ? DEFAULT_CRS : crs;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline, lets clean this part up :) and also make sure we are case preserving if the user provides some default that is not upper case

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done now we just has this.crs = crs, the pervious impl for normalizing is gone

Store the CRS and edge algorithm exactly as provided (null meaning the default) and resolve the default only in crs()/algorithm(), with equals/hashCode/toString comparing the resolved values. A non-canonical default casing such as ogc:crs84 is now preserved verbatim and treated as a distinct CRS rather than folded to the canonical OGC:CRS84.
@huan233usc huan233usc requested a review from szehon-ho June 25, 2026 19:54
Xin Huang and others added 7 commits June 25, 2026 12:57
crs84() now flows through the same pass-through constructor instead of duplicating the default field assignments.
Canonicalize only an omitted CRS/edge algorithm to the default in the constructor; a provided CRS keeps its original casing. equals compares the CRS with equalsIgnoreCase and hashCode hashes the upper-cased CRS so equal types stay hash-consistent. Any casing of a CRS (e.g. ogc:crs84) is therefore equal to the canonical default but still serializes with the casing the user provided.
Add Javadoc on GeometryType#crs() and GeographyType#crs() noting that CRS values are compared case-insensitively while each value keeps its original casing in toString() and serialized metadata.
Co-authored-by: Szehon Ho <szehon.apache@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants