Global Entity Matcher (GEM)

Data preparation guide

Request access

Data preparation guide

This guide provides detailed instructions for preparing your geospatial data for use with GEM, including code examples, validation techniques, and best practices.

Data format requirements

GEM requires input data in Apache Parquet format with a specific schema.

Required schema

FieldTypeDescriptionExample
idinteger or stringUnique identifier for each road segment5707295
is_navigablebooleanWhether the road is navigable by vehiclestrue
geometrystringRoad geometry in WKT LineString format"LINESTRING (145.18 -37.87, 145.18 -37.87)"

Geometry format

The geometry field must contain valid Well-Known Text (WKT) LineString geometries:

LINESTRING (longitude1 latitude1, longitude2 latitude2, ...)

Examples:

LINESTRING (145.18156 -37.87340, 145.18092 -37.87356)
LINESTRING (4.8952 52.3702, 4.8960 52.3710, 4.8975 52.3725)

Creating Parquet files

Using Python (pandas + pyarrow)

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
# Create sample data
data = {
'id': [1, 2, 3, 4, 5],
'is_navigable': [True, True, False, True, True],
'geometry': [
'LINESTRING (4.8952 52.3702, 4.8960 52.3710)',
'LINESTRING (4.8960 52.3710, 4.8975 52.3725)',
'LINESTRING (4.8975 52.3725, 4.8990 52.3740)',
'LINESTRING (4.8990 52.3740, 4.9005 52.3755)',
'LINESTRING (4.9005 52.3755, 4.9020 52.3770)'
]
}
# Create DataFrame
df = pd.DataFrame(data)
# Define schema with correct types
schema = pa.schema([
('id', pa.int64()),
('is_navigable', pa.bool_()),
('geometry', pa.string())
])
# Convert to PyArrow Table with schema
table = pa.Table.from_pandas(df, schema=schema)
# Write to Parquet
pq.write_table(table, 'my_road_data.parquet')
print(f"Created Parquet file with {len(df)} records")

Using Python (GeoPandas)

If your data is already in a geospatial format (Shapefile, GeoJSON, etc.):

import geopandas as gpd
import pandas as pd
# Read source data
gdf = gpd.read_file('roads.shp')
# Prepare for GEM
gem_data = pd.DataFrame({
'id': range(1, len(gdf) + 1), # Generate unique IDs
'is_navigable': gdf['navigable'].fillna(True), # Default to True
'geometry': gdf.geometry.apply(lambda g: g.wkt) # Convert to WKT
})
# Filter to LineStrings only
gem_data = gem_data[gem_data['geometry'].str.startswith('LINESTRING')]
# Save as Parquet
gem_data.to_parquet('gem_input.parquet', index=False)
print(f"Exported {len(gem_data)} road segments")

Using PySpark

For large datasets:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, LongType, BooleanType, StringType
# Initialize Spark
spark = SparkSession.builder.appName("GEM Data Prep").getOrCreate()
# Define schema
schema = StructType([
StructField("id", LongType(), False),
StructField("is_navigable", BooleanType(), False),
StructField("geometry", StringType(), False)
])
# Read your source data
source_df = spark.read.format("your_format").load("your_data")
# Transform to GEM schema
gem_df = source_df.select(
source_df["road_id"].alias("id"),
source_df["navigable"].alias("is_navigable"),
source_df["wkt_geometry"].alias("geometry")
)
# Write as Parquet
gem_df.write.parquet("gem_input.parquet")

Data validation

Always validate your data before uploading to GEM.

Python validation script

import pandas as pd
import re
def validate_gem_data(filepath):
"""Validate a Parquet file for GEM compatibility."""
print(f"Validating: {filepath}")
errors = []
warnings = []
# Read the file
try:
df = pd.read_parquet(filepath)
except Exception as e:
return [f"Cannot read file: {e}"], []
print(f"Total records: {len(df)}")
# Check required columns
required_cols = ['id', 'is_navigable', 'geometry']
missing_cols = [c for c in required_cols if c not in df.columns]
if missing_cols:
errors.append(f"Missing required columns: {missing_cols}")
return errors, warnings
# Check for null values
for col in required_cols:
null_count = df[col].isnull().sum()
if null_count > 0:
errors.append(f"Column '{col}' has {null_count} null values")
# Check ID uniqueness
duplicate_ids = df['id'].duplicated().sum()
if duplicate_ids > 0:
errors.append(f"Found {duplicate_ids} duplicate IDs")
# Check data types
if not pd.api.types.is_integer_dtype(df['id']):
errors.append(f"Column 'id' should be integer, got {df['id'].dtype}")
if not pd.api.types.is_bool_dtype(df['is_navigable']):
errors.append(f"Column 'is_navigable' should be boolean, got {df['is_navigable'].dtype}")
# Validate geometries
linestring_pattern = r'^LINESTRING\s*\([^)]+\)$'
invalid_geom = 0
for idx, geom in df['geometry'].items():
if not isinstance(geom, str):
invalid_geom += 1
elif not re.match(linestring_pattern, geom.strip(), re.IGNORECASE):
invalid_geom += 1
if invalid_geom > 0:
errors.append(f"Found {invalid_geom} invalid geometries (must be WKT LINESTRING)")
# Check for empty geometries
empty_geom = df['geometry'].str.contains(r'LINESTRING\s*\(\s*\)', case=False, regex=True).sum()
if empty_geom > 0:
warnings.append(f"Found {empty_geom} empty geometries")
# Summary
print(f"\nValidation Results:")
print(f" Errors: {len(errors)}")
print(f" Warnings: {len(warnings)}")
if errors:
print("\nErrors:")
for e in errors:
print(f" ❌ {e}")
if warnings:
print("\nWarnings:")
for w in warnings:
print(f" ⚠️ {w}")
if not errors:
print("\n✅ File is valid for GEM!")
return errors, warnings
# Usage
errors, warnings = validate_gem_data('my_road_data.parquet')

Common data quality issues

Issue 1: Invalid geometry format

Problem: Geometries not in WKT LineString format.

Solution:

from shapely import wkt
from shapely.geometry import LineString
def fix_geometry(geom):
"""Convert various geometry formats to WKT LineString."""
try:
# If it's already a valid WKT string
parsed = wkt.loads(geom)
if isinstance(parsed, LineString):
return geom
else:
return None # Not a LineString
except:
return None
df['geometry'] = df['geometry'].apply(fix_geometry)
df = df.dropna(subset=['geometry'])

Issue 2: Duplicate IDs

Problem: Multiple records share the same ID.

Solution:

# Option 1: Keep first occurrence
df = df.drop_duplicates(subset=['id'], keep='first')
# Option 2: Regenerate IDs
df['id'] = range(1, len(df) + 1)

Issue 3: Mixed geometry types

Problem: Dataset contains Points, Polygons, etc. alongside LineStrings.

Solution:

# Filter to LineStrings only
df = df[df['geometry'].str.upper().str.startswith('LINESTRING')]

Issue 4: Coordinate system issues

Problem: Coordinates in wrong order or projection.

Solution:

import geopandas as gpd
from shapely import wkt
# Read and reproject
gdf = gpd.read_file('roads.shp')
gdf = gdf.to_crs('EPSG:4326') # Convert to WGS84
# Extract WKT
df['geometry'] = gdf.geometry.apply(lambda g: g.wkt)

Best practices

Before uploading

  1. Start small: Test with a subset (1,000-10,000 records) before processing full dataset
  2. Validate thoroughly: Run validation script on every file
  3. Check file size: Large files may take longer to upload; plan accordingly
  4. Use descriptive filenames: city_roads_2024_v1.parquet not data.parquet

Data quality tips

  1. Clean geometries: Remove self-intersections and invalid geometries
  2. Ensure connectivity: Connected road networks match better than isolated segments
  3. Include all segments: Don’t filter out small roads—they help with context
  4. Accurate navigability: Set is_navigable correctly for better matching

File naming conventions

Recommended naming pattern:

{region}_{data_type}_{date}_{version}.parquet

Examples:

  • netherlands_roads_20240115_v1.parquet
  • california_highways_20240120_v2.parquet
  • tokyo_streets_20240118_final.parquet

Sample data

Here’s a minimal sample file you can use for testing:

import pandas as pd
# Sample Amsterdam road segments
sample_data = {
'id': [1, 2, 3, 4, 5],
'is_navigable': [True, True, True, True, False],
'geometry': [
'LINESTRING (4.8952 52.3702, 4.8960 52.3710)',
'LINESTRING (4.8960 52.3710, 4.8975 52.3725)',
'LINESTRING (4.8975 52.3725, 4.8990 52.3740)',
'LINESTRING (4.8990 52.3740, 4.9005 52.3755)',
'LINESTRING (4.9005 52.3755, 4.9020 52.3770)'
]
}
df = pd.DataFrame(sample_data)
df.to_parquet('sample_gem_input.parquet', index=False)
print("Sample file created: sample_gem_input.parquet")

Output data schema

FieldTypeDescription
idstring or integerYour original road segment ID
gersstringMatched GERS ID (UUID format)
confidenceintegerMatch confidence score (0-100)
lr_idstringLinear reference: coordinates and GERS ID
lr_gersstringLinear reference: distance range and original ID

Example:

{"id":"abc","gers":"550e8400-e29b-41d4-a716-446655440000","confidence":99,"lr_id":"52.0197-76.36744#550e8400-e29b-41d4-a716-446655440000","lr_gers":"0.0-100.0#abc"}

Next steps

Once your data is prepared and validated: