API, Parquet: Map geometry and geography to Parquet logical types#16765
API, Parquet: Map geometry and geography to Parquet logical types#16765huan233usc wants to merge 19 commits into
Conversation
e0c3e18 to
f724354
Compare
f724354 to
f58e12b
Compare
Map Iceberg geometry and geography to and from Parquet's geometry / geography logical type annotations on a BINARY column, passing the resolved CRS and edge algorithm through directly (Iceberg and Parquet use the same algorithm names; the read side resolves names with EdgeAlgorithm.fromName, the same conversion used when parsing geography type strings, and maps unset annotation parameters to the Iceberg defaults). To make the plain geography type round-trip through writers that omit default parameters (an unset Parquet crs / algorithm defaults to OGC:CRS84 / SPHERICAL), GeographyType now treats an explicit default algorithm as equal to an omitted one: equals, hashCode, and toString use the resolved getters crs() / algorithm() instead of the raw nullable fields, matching how the CRS already resolves through its getter. Schema mapping only; the value read/write path and metrics handling are follow-ups. Co-authored-by: Isaac
f58e12b to
57036d7
Compare
|
Hi @szehon-ho , can you help reviewing when you have a chance? Thank you very much! |
|
A couple of clarifying notes (not blocking — just flagging for discussion): 1. Write path isn't blocked, only the read path. Now that
Since the value path is an explicit follow-up, it'd be safer to also reject 2. Serialization effect of the For reference, here's what gets written into table metadata before/after this PR. CRS is unchanged (the constructor already collapses the default to the bare form); only the algorithm changes. CRS — no change:
Algorithm — changed:
So explicitly setting |
Keep explicitly supplied default CRS and geography edge algorithm in type string serialization while continuing to compare geospatial types using resolved defaults.
Serialize geospatial types with explicit default CRS and edge algorithm while preserving semantic equality for default-equivalent forms.
|
Updated again in
This keeps Iceberg table metadata aligned with the concrete defaults written into Parquet annotations and avoids depending on future default interpretation. |
Match Spotless formatting for the explicit geography default assertion.
The unsupported-transform messages now render geometry/geography types with their explicit default CRS and edge algorithm, matching the explicit-default serialization in this PR.
Replace the magic "geometry" / "geography" string literals with a private NAME constant on each geo type, used by both the type-name lookup map and toString(), now that toString() no longer renders the bare keyword.
Compare resolved geo getters with Objects.equals for null-safety and consistency with the Objects.hash-based hashCode.
Fail fast when the generic Parquet writer encounters a geometry or geography column instead of silently degrading to the binary writer; the geospatial value path is a separate follow-up. Also drop the redundant default fallbacks when reading geo annotations, since GeometryType.of / GeographyType.of already resolve a null CRS / algorithm to the default.
Assert the generic Parquet data writer fails fast for geometry and geography columns instead of silently degrading to the binary writer.
szehon-ho
left a comment
There was a problem hiding this comment.
One small consistency follow-up on the geo toString() changes (non-blocking). Overall the schema mapping and write-rejection look good.
…consistent output
|
Thanks @szehon-ho! Applied both suggestions in 9432405 — |
| private GeometryType(String crs) { | ||
| Preconditions.checkArgument(crs == null || !crs.isEmpty(), "Invalid CRS: (empty string)"); | ||
| this.crs = DEFAULT_CRS.equalsIgnoreCase(crs) ? null : crs; | ||
| this.crs = crs != null ? crs : DEFAULT_CRS; |
There was a problem hiding this comment.
Since the default-CRS match is case-insensitive (DEFAULT_CRS.equalsIgnoreCase(...)), a default column written by another engine/version in any casing — geometry, geometry(OGC:CRS84), or geometry(ogc:crs84) — all resolve to the same default-equal type. That's good behavior and worth keeping.
Minor suggestion: instead of resolving in crs() and routing equals/hashCode/toString through the getter, could we canonicalize once in the constructor and keep the stored field canonical?
private GeometryType(String crs) {
Preconditions.checkArgument(crs == null || !crs.isEmpty(), "Invalid CRS: (empty string)");
this.crs = crs == null || DEFAULT_CRS.equalsIgnoreCase(crs) ? DEFAULT_CRS : crs;
}Then crs() is just return crs;, equals/hashCode compare the field directly, and toString becomes String.format("%s(%s)", NAME, crs) — no getter indirection and the explanatory comments drop out. Same idea for GeographyType (you already resolve the algorithm default in the constructor). Observable behavior of crs(), equals, hashCode, and toString is identical to the current approach for every input (null, OGC:CRS84, ogc:crs84, non-default), so the added tests still pass.
There was a problem hiding this comment.
Good call — done in 6ec2544. Both GeometryType and GeographyType now canonicalize the default CRS (omitted or any casing) in the constructor, so crs() is a plain field getter and equals/hashCode/toString compare/format the stored field directly. The getter indirection and the explanatory comments are gone, and TestTypes (incl. the ogc:crs84 -> OGC:CRS84 cases) still passes since the observable behavior is identical.
| Preconditions.checkArgument(crs == null || !crs.isEmpty(), "Invalid CRS: (empty string)"); | ||
| this.crs = DEFAULT_CRS.equalsIgnoreCase(crs) ? null : crs; | ||
| // canonicalize the default CRS (any casing or omitted) so the stored field is comparable | ||
| this.crs = crs == null || DEFAULT_CRS.equalsIgnoreCase(crs) ? DEFAULT_CRS : crs; |
There was a problem hiding this comment.
discussed offline, lets clean this part up :) and also make sure we are case preserving if the user provides some default that is not upper case
There was a problem hiding this comment.
Done now we just has this.crs = crs, the pervious impl for normalizing is gone
Store the CRS and edge algorithm exactly as provided (null meaning the default) and resolve the default only in crs()/algorithm(), with equals/hashCode/toString comparing the resolved values. A non-canonical default casing such as ogc:crs84 is now preserved verbatim and treated as a distinct CRS rather than folded to the canonical OGC:CRS84.
crs84() now flows through the same pass-through constructor instead of duplicating the default field assignments.
Canonicalize only an omitted CRS/edge algorithm to the default in the constructor; a provided CRS keeps its original casing. equals compares the CRS with equalsIgnoreCase and hashCode hashes the upper-cased CRS so equal types stay hash-consistent. Any casing of a CRS (e.g. ogc:crs84) is therefore equal to the canonical default but still serializes with the casing the user provided.
Add Javadoc on GeometryType#crs() and GeographyType#crs() noting that CRS values are compared case-insensitively while each value keeps its original casing in toString() and serialized metadata.
Co-authored-by: Szehon Ho <szehon.apache@gmail.com>
Summary
Map Iceberg
geometryandgeographyprimitive types to and from Parquet's geometry / geography logical type annotations on aBINARYcolumn, so the geo types survive a schema round-trip throughParquetSchemaUtil.convertin both directions.TypeToMessageType: emitgeometry/geographyasBINARYannotated withLogicalTypeAnnotation.geometryType/geographyType, passing the resolved CRS and edge-interpolation algorithm through directly. Iceberg and Parquet use the same algorithm names (SPHERICAL,VINCENTY,THOMAS,ANDOYER,KARNEY), so the algorithm is mapped by name.MessageTypeToType: read those annotations back intoTypes.GeometryType/Types.GeographyType. An unsetcrs/algorithmmaps to the Iceberg default (the Parquet defaults are the same:OGC:CRS84/SPHERICAL), and algorithm names are resolved withEdgeAlgorithm.fromName, the same conversion used when parsing geography type strings (Types.fromPrimitiveString).Types.GeometryType/Types.GeographyType: serialize with explicit CRS / edge algorithm defaults (geometry(OGC:CRS84),geography(OGC:CRS84, spherical)) so table metadata preserves the concrete defaults being used, whileequals/hashCodecompare resolved defaults so default-equivalent forms remain semantically equal. No public signature changes.This is the first step of plumbing the geo value path through Parquet. It is intentionally schema mapping only — the generic value read/write path (
BaseParquetReaders/BaseParquetWriter) and theParquetMetricsguard for geo columns are separate follow-ups, so this PR stays small and easy to review. It is purely additive: no behavior changes for non-geo types.This is 1/N for #16650
Test plan
TestParquetSchemaUtil#testGeospatialTypeRoundTripround-trips a schema with default-CRS geometry, an explicit-CRS geometry, default geography, and a geography per edge algorithm (all five, so the by-name mapping is exercised for every constant in both directions) throughParquetSchemaUtil.convert.TestParquetSchemaUtil#testGeospatialAnnotationsWithOmittedParametersreads hand-builtMessageTypes with unset / explicit / explicit-default CRS and algorithm — covering files written by engines that omit defaults — and confirms each maps to the expected Iceberg type and serializes with explicit defaults.TestTypes#testGeospatialTypeDefaultNormalizationcoversequals()/hashCode()parity for the default-CRS and default-algorithm geography forms, thatalgorithm()still reportsSPHERICAL, and that a non-default algorithm stays distinct;testGeospatialTypeToStringcovers explicit-default rendering../gradlew :iceberg-api:check :iceberg-parquet:check— clean (tests, checkstyle, revapi, spotless)../gradlew :iceberg-core:test— full core suite green; no regressions inTestSchemaParser/TestSingleValueParser/TestGeospatialTableor anywhere geography types are serialized.