Skip to content

Schema v1 to Aardvark migrator#143

Merged
thatbudakguy merged 1 commit into
mainfrom
v1-aardvark
Jan 29, 2024
Merged

Schema v1 to Aardvark migrator#143
thatbudakguy merged 1 commit into
mainfrom
v1-aardvark

Conversation

@thatbudakguy

Copy link
Copy Markdown
Member
  • Handle elements without crosswalk (via lookup tables)
  • Support migrating collections in dct_isPartOf_sm
  • Convert single to multivalued fields where appropriate
  • Retain custom fields and remove deprecated fields

Closes #121

@thatbudakguy thatbudakguy force-pushed the v1-aardvark branch 2 times, most recently from e0b8659 to c051c55 Compare March 1, 2023 23:10
@thatbudakguy thatbudakguy marked this pull request as ready for review March 1, 2023 23:11
thatbudakguy added a commit that referenced this pull request Mar 27, 2023
This fixes the error about a pending spec without a reason.
#143 will
un-pend the test when it is merged.
@thatbudakguy thatbudakguy marked this pull request as draft March 30, 2023 22:14
@thatbudakguy

Copy link
Copy Markdown
Member Author

Making this a draft again pending discussion of behavior for some fields; see OpenGeoMetadata/metadata-issues#50

@the-codetrane

Copy link
Copy Markdown

The path in lib/geo_combine/geoblacklight.rb:16changed to "https://raw.githubusercontent.com/OpenGeoMetadata/opengeometadata.github.io/main/docs/schema/geoblacklight-schema-#{GEOBLACKLIGHT_VERSION}.json"

@thatbudakguy thatbudakguy force-pushed the v1-aardvark branch 3 times, most recently from 687f017 to f30e6d1 Compare August 3, 2023 20:30
@thatbudakguy thatbudakguy marked this pull request as ready for review August 3, 2023 20:31
@the-codetrane

Copy link
Copy Markdown

@thatbudakguy any chance we can get solr_geom to dcat_bbox added to this? This is otherwise working
image

@thatbudakguy

thatbudakguy commented Sep 13, 2023

Copy link
Copy Markdown
Member Author

@the-codetrane thx for pointing that out; I added a step to handle dcat_bbox. This PR is now blocked by #162.

@the-codetrane

the-codetrane commented Oct 12, 2023

Copy link
Copy Markdown

@thatbudakguy found another key that could be migrated - layer_geom_type_s to gbl_resourceType_sm. The crosswalk documentation has them as deprecated/new fields, but it would appear they are in fact related.

@thatbudakguy

Copy link
Copy Markdown
Member Author

there's code in this PR to do that – we use a lookup table to map geometry types to resources types. it's only straightforward for a few cases, imo. does it not work for you?

@the-codetrane

Copy link
Copy Markdown

This is what comes out when I run the migrator on a GBL 1.0 schema record:

{
  "dct_description_sm": [
    "This polygon shapefile represents the 1964 County Boundaries for China. The layer includes population census data and was primarily based on the \"Historical Administrative Maps of the People's Republic of China,\" published by China Map Press, and some other yearly administrative maps. See the documentation for more information and a list of the layer variables."
  ],
  "dct_format_s": "Shapefile",
  "dct_identifier_sm": [
    "http://hdl.handle.net/2451/34626"
  ],
  "dct_language_sm": [
    "English"
  ],
  "dct_publisher_sm": [
    "Beijing Hua tong ren shi chang xin xi you xian ze ren gong si"
  ],
  "dc_relation_sm": [
    "http://sws.geonames.org/1814991/about/rdf"
  ],
  "dct_accessRights_s": "Restricted",
  "dct_subject_sm": [
    "Boundaries",
    "Demographic surveys",
    "Population"
  ],
  "dct_title_s": "1964 County Boundaries of China with Population Census Data",
  "dc_type_s": "Dataset",
  "dct_isPartOf_sm": [
    "Historical China County Population Census Data"
  ],
  "dct_issued_s": "2005",
  "schema_provider_s": "NYU",
  "dct_references_s": "{\"http://schema.org/url\":\"http://hdl.handle.net/2451/34626\",\"http://www.opengis.net/def/serviceType/ogc/wfs\":\"https://maps-restricted.geo.nyu.edu/geoserver/sdr/wfs\",\"http://www.opengis.net/def/serviceType/ogc/wms\":\"https://maps-restricted.geo.nyu.edu/geoserver/sdr/wms\",\"http://schema.org/downloadUrl\":\"https://archive.nyu.edu/retrieve/74851/nyu_2451_34626.zip\",\"http://lccn.loc.gov/sh85035852\":\"https://archive.nyu.edu/retrieve/74896/nyu_2451_34626_doc.zip\"}",
  "dct_spatial_sm": [
    "People's Republic of China, China"
  ],
  "dct_temporal_sm": [
    "1964"
  ],
  "gbl_mdVersion_s": "Aardvark",
  "layer_geom_type_s": "Polygon", // I'M GUESSING THIS IS SUPPOSED TO BE SOMETHING ELSE?
  "gbl_wxsIdentifier_s": "sdr:nyu_2451_34626",
  "gbl_mdModified_dt": "2016-11-10T15:51:38Z",
  "id": "nyu-2451-34626",
  "nyu_addl_dspace_s": "35559",
  "locn_geometry": "ENVELOPE(73.557693, 134.773911, 53.56086, 10.175472)",
  "gbl_indexYear_im": [
    1964
  ],
  "nyu_addl_format_sm": [
    "Shapefile"
  ],
  "_version_": 1779481613907787776,
  "timestamp": "2023-10-11T17:38:31.500Z"
}

@srappel

srappel commented Oct 12, 2023

Copy link
Copy Markdown

"layer_geom_type_s": "Polygon", // I'M GUESSING THIS IS SUPPOSED TO BE SOMETHING ELSE?

I would expect "gbl_resourceType_sm": "Polygon Data" according to the controlled vocab

@thatbudakguy

Copy link
Copy Markdown
Member Author

@the-codetrane can you share the record that you transformed to get that output?

@the-codetrane

Copy link
Copy Markdown

@thatbudakguy My contract at NYU ended, so I'm outside the walled garden. @mnyrop should be able to help you with this.

@thatbudakguy

thatbudakguy commented Jan 19, 2024

Copy link
Copy Markdown
Member Author

OK, I found the record. I ran it through the migrator myself and got:

{
  "dct_creator_sm": [],
  "dct_description_sm": [
    "This polygon shapefile represents the 1964 County Boundaries for China. The layer includes population census data and was primarily based on the \"Historical Administrative Maps of the People's Republic of China,\" published by China Map Press, and some other yearly administrative maps. See the documentation for more information and a list of the layer variables."
  ],
  "dct_format_s": "Shapefile",
  "dct_identifier_sm": ["http://hdl.handle.net/2451/34626"],
  "dct_language_sm": ["English"],
  "dct_publisher_sm": [
    "Beijing Hua tong ren shi chang xin xi you xian ze ren gong si"
  ],
  "dc_relation_sm": ["http://sws.geonames.org/1814991/about/rdf"],
  "dct_accessRights_s": "Restricted",
  "dct_subject_sm": ["Boundaries", "Demographic surveys", "Population"],
  "dct_title_s": "1964 County Boundaries of China with Population Census Data",
  "dct_issued_s": "2005",
  "schema_provider_s": "NYU",
  "dct_references_s": "{\"http://schema.org/url\":\"http://hdl.handle.net/2451/34626\",\"http://www.opengis.net/def/serviceType/ogc/wfs\":\"https://maps-restricted.geo.nyu.edu/geoserver/sdr/wfs\",\"http://www.opengis.net/def/serviceType/ogc/wms\":\"https://maps-restricted.geo.nyu.edu/geoserver/sdr/wms\",\"http://schema.org/downloadUrl\":\"https://archive.nyu.edu/retrieve/74851/nyu_2451_34626.zip\",\"http://lccn.loc.gov/sh85035852\":\"https://archive.nyu.edu/retrieve/74896/nyu_2451_34626_doc.zip\"}",
  "dct_spatial_sm": ["People's Republic of China, China"],
  "dct_temporal_sm": ["1964"],
  "gbl_mdVersion_s": "Aardvark",
  "gbl_wxsIdentifier_s": "sdr:nyu_2451_34626",
  "gbl_mdModified_dt": "2016-11-10T15:51:38Z",
  "id": "nyu-2451-34626",
  "nyu_addl_dspace_s": "35559",
  "dcat_bbox": "ENVELOPE(73.557693, 134.773911, 53.56086, 10.175472)",
  "gbl_indexYear_im": [1964],
  "gbl_resourceClass_s": ["Datasets"],
  "gbl_resourceType_s": ["Polygon data"]
}

It turned out there was just a typo; the new field is gbl_resourceType_sm (not gbl_resourceType_s), as it's multi-valued. Otherwise, the conversion works as expected (it outputs Polygon data and the original field is stripped).

I've corrected the mistake.

@karenmajewicz

Copy link
Copy Markdown
Contributor

Resource Class is also multivalued: gbl_resourceClass_sm

@srappel srappel left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it out with this record and got the following output:

{"gbl_mdVersion_s":"Aardvark",
 "dct_identifier_sm":["930D4EA3-442E-4A28-AEC7-830F1A6CB5F8"],
 "dct_title_s":"Land Use Milwaukee County, WI 1963",
 "dct_description_sm":["This data layer represents land use for Milwaukee County, Wisconsin in 1963."],
 "dct_accessRights_s":"Public",
 "schema_provider_s":"UW-Madison Robinson Map Library",
 "gbl_wxsIdentifier_s":"",
 "id":"930D4EA3-442E-4A28-AEC7-830F1A6CB5F8",
 "gbl_mdModified_dt":"2022-01-22T20:12:43Z",
 "dct_format_s":"Shapefile",
 "dct_language_sm":["English"],
 "dct_creator_sm":["Southeastern Wisconsin Regional Planning Commission"],
 "dc_publisher_sm":[""],
 "dct_subject_sm":["Planning and Cadastral"],
 "dct_spatial_sm":[],
 "dct_issued_s":"",
 "dct_temporal_sm":["1963"],
 "gbl_indexYear_im":[1963],
 "dct_references_s":
  "{\"http://schema.org/downloadUrl\":\"https://gisdata.wisc.edu/public/Milwaukee_LandUse_1963.zip\",\"http://www.isotc211.org/schemas/2005/gmd/\":\"https://gisdata.wisc.edu/public/metadata/Milwaukee_LandUse_1963.xml\"}",
 "dcat_bbox":"ENVELOPE(-88.074273, -87.812986, 43.195098, 42.83888)",
 "uw_supplemental_s":"For more information: http://www.sewrpc.org/SEWRPC/LandUse.htm",
 "uw_notice_s":"",
 "gbl_resourceClass_s":["Datasets"],
 "gbl_resourceType_sm":["Polygon data"]}

I looked through it pretty carefully and don't see anything unusual. Note the local fields uw_notice_s and uw_supplemental_s both seem to have just come through as-is, which I assume is the default behavior.

@thatbudakguy thatbudakguy force-pushed the v1-aardvark branch 2 times, most recently from 9338a89 to 52db14d Compare January 29, 2024 22:34
- Handle elements without crosswalk (via lookup tables)
- Support migrating collections in dct_isPartOf_sm
- Convert single to multivalued fields where appropriate
- Retain custom fields and remove deprecated fields

Closes #121
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create class to convert from schema version 1 to Aardvark

4 participants