Data preparation guide
Request accessData preparation guide
This guide provides detailed instructions for preparing your geospatial data for use with GEM, including code examples, validation techniques, and best practices.
Data format requirements
GEM requires input data in Apache Parquet format with a specific schema.
Required schema
| Field | Type | Description | Example |
|---|---|---|---|
id | integer or string | Unique identifier for each road segment | 5707295 |
is_navigable | boolean | Whether the road is navigable by vehicles | true |
geometry | string | Road geometry in WKT LineString format | "LINESTRING (145.18 -37.87, 145.18 -37.87)" |
Geometry format
The geometry field must contain valid Well-Known Text (WKT) LineString geometries:
LINESTRING (longitude1 latitude1, longitude2 latitude2, ...)Examples:
LINESTRING (145.18156 -37.87340, 145.18092 -37.87356)LINESTRING (4.8952 52.3702, 4.8960 52.3710, 4.8975 52.3725)Creating Parquet files
Using Python (pandas + pyarrow)
import pandas as pdimport pyarrow as paimport pyarrow.parquet as pq
# Create sample datadata = { 'id': [1, 2, 3, 4, 5], 'is_navigable': [True, True, False, True, True], 'geometry': [ 'LINESTRING (4.8952 52.3702, 4.8960 52.3710)', 'LINESTRING (4.8960 52.3710, 4.8975 52.3725)', 'LINESTRING (4.8975 52.3725, 4.8990 52.3740)', 'LINESTRING (4.8990 52.3740, 4.9005 52.3755)', 'LINESTRING (4.9005 52.3755, 4.9020 52.3770)' ]}
# Create DataFramedf = pd.DataFrame(data)
# Define schema with correct typesschema = pa.schema([ ('id', pa.int64()), ('is_navigable', pa.bool_()), ('geometry', pa.string())])
# Convert to PyArrow Table with schematable = pa.Table.from_pandas(df, schema=schema)
# Write to Parquetpq.write_table(table, 'my_road_data.parquet')
print(f"Created Parquet file with {len(df)} records")Using Python (GeoPandas)
If your data is already in a geospatial format (Shapefile, GeoJSON, etc.):
import geopandas as gpdimport pandas as pd
# Read source datagdf = gpd.read_file('roads.shp')
# Prepare for GEMgem_data = pd.DataFrame({ 'id': range(1, len(gdf) + 1), # Generate unique IDs 'is_navigable': gdf['navigable'].fillna(True), # Default to True 'geometry': gdf.geometry.apply(lambda g: g.wkt) # Convert to WKT})
# Filter to LineStrings onlygem_data = gem_data[gem_data['geometry'].str.startswith('LINESTRING')]
# Save as Parquetgem_data.to_parquet('gem_input.parquet', index=False)
print(f"Exported {len(gem_data)} road segments")Using PySpark
For large datasets:
from pyspark.sql import SparkSessionfrom pyspark.sql.types import StructType, StructField, LongType, BooleanType, StringType
# Initialize Sparkspark = SparkSession.builder.appName("GEM Data Prep").getOrCreate()
# Define schemaschema = StructType([ StructField("id", LongType(), False), StructField("is_navigable", BooleanType(), False), StructField("geometry", StringType(), False)])
# Read your source datasource_df = spark.read.format("your_format").load("your_data")
# Transform to GEM schemagem_df = source_df.select( source_df["road_id"].alias("id"), source_df["navigable"].alias("is_navigable"), source_df["wkt_geometry"].alias("geometry"))
# Write as Parquetgem_df.write.parquet("gem_input.parquet")Data validation
Always validate your data before uploading to GEM.
Python validation script
import pandas as pdimport re
def validate_gem_data(filepath): """Validate a Parquet file for GEM compatibility."""
print(f"Validating: {filepath}") errors = [] warnings = []
# Read the file try: df = pd.read_parquet(filepath) except Exception as e: return [f"Cannot read file: {e}"], []
print(f"Total records: {len(df)}")
# Check required columns required_cols = ['id', 'is_navigable', 'geometry'] missing_cols = [c for c in required_cols if c not in df.columns] if missing_cols: errors.append(f"Missing required columns: {missing_cols}") return errors, warnings
# Check for null values for col in required_cols: null_count = df[col].isnull().sum() if null_count > 0: errors.append(f"Column '{col}' has {null_count} null values")
# Check ID uniqueness duplicate_ids = df['id'].duplicated().sum() if duplicate_ids > 0: errors.append(f"Found {duplicate_ids} duplicate IDs")
# Check data types if not pd.api.types.is_integer_dtype(df['id']): errors.append(f"Column 'id' should be integer, got {df['id'].dtype}")
if not pd.api.types.is_bool_dtype(df['is_navigable']): errors.append(f"Column 'is_navigable' should be boolean, got {df['is_navigable'].dtype}")
# Validate geometries linestring_pattern = r'^LINESTRING\s*\([^)]+\)$' invalid_geom = 0 for idx, geom in df['geometry'].items(): if not isinstance(geom, str): invalid_geom += 1 elif not re.match(linestring_pattern, geom.strip(), re.IGNORECASE): invalid_geom += 1
if invalid_geom > 0: errors.append(f"Found {invalid_geom} invalid geometries (must be WKT LINESTRING)")
# Check for empty geometries empty_geom = df['geometry'].str.contains(r'LINESTRING\s*\(\s*\)', case=False, regex=True).sum() if empty_geom > 0: warnings.append(f"Found {empty_geom} empty geometries")
# Summary print(f"\nValidation Results:") print(f" Errors: {len(errors)}") print(f" Warnings: {len(warnings)}")
if errors: print("\nErrors:") for e in errors: print(f" ❌ {e}")
if warnings: print("\nWarnings:") for w in warnings: print(f" ⚠️ {w}")
if not errors: print("\n✅ File is valid for GEM!")
return errors, warnings
# Usageerrors, warnings = validate_gem_data('my_road_data.parquet')Common data quality issues
Issue 1: Invalid geometry format
Problem: Geometries not in WKT LineString format.
Solution:
from shapely import wktfrom shapely.geometry import LineString
def fix_geometry(geom): """Convert various geometry formats to WKT LineString.""" try: # If it's already a valid WKT string parsed = wkt.loads(geom) if isinstance(parsed, LineString): return geom else: return None # Not a LineString except: return None
df['geometry'] = df['geometry'].apply(fix_geometry)df = df.dropna(subset=['geometry'])Issue 2: Duplicate IDs
Problem: Multiple records share the same ID.
Solution:
# Option 1: Keep first occurrencedf = df.drop_duplicates(subset=['id'], keep='first')
# Option 2: Regenerate IDsdf['id'] = range(1, len(df) + 1)Issue 3: Mixed geometry types
Problem: Dataset contains Points, Polygons, etc. alongside LineStrings.
Solution:
# Filter to LineStrings onlydf = df[df['geometry'].str.upper().str.startswith('LINESTRING')]Issue 4: Coordinate system issues
Problem: Coordinates in wrong order or projection.
Solution:
import geopandas as gpdfrom shapely import wkt
# Read and reprojectgdf = gpd.read_file('roads.shp')gdf = gdf.to_crs('EPSG:4326') # Convert to WGS84
# Extract WKTdf['geometry'] = gdf.geometry.apply(lambda g: g.wkt)Best practices
Before uploading
- Start small: Test with a subset (1,000-10,000 records) before processing full dataset
- Validate thoroughly: Run validation script on every file
- Check file size: Large files may take longer to upload; plan accordingly
- Use descriptive filenames:
city_roads_2024_v1.parquetnotdata.parquet
Data quality tips
- Clean geometries: Remove self-intersections and invalid geometries
- Ensure connectivity: Connected road networks match better than isolated segments
- Include all segments: Don’t filter out small roads—they help with context
- Accurate navigability: Set
is_navigablecorrectly for better matching
File naming conventions
Recommended naming pattern:
{region}_{data_type}_{date}_{version}.parquetExamples:
netherlands_roads_20240115_v1.parquetcalifornia_highways_20240120_v2.parquettokyo_streets_20240118_final.parquet
Sample data
Here’s a minimal sample file you can use for testing:
import pandas as pd
# Sample Amsterdam road segmentssample_data = { 'id': [1, 2, 3, 4, 5], 'is_navigable': [True, True, True, True, False], 'geometry': [ 'LINESTRING (4.8952 52.3702, 4.8960 52.3710)', 'LINESTRING (4.8960 52.3710, 4.8975 52.3725)', 'LINESTRING (4.8975 52.3725, 4.8990 52.3740)', 'LINESTRING (4.8990 52.3740, 4.9005 52.3755)', 'LINESTRING (4.9005 52.3755, 4.9020 52.3770)' ]}
df = pd.DataFrame(sample_data)df.to_parquet('sample_gem_input.parquet', index=False)print("Sample file created: sample_gem_input.parquet")Output data schema
| Field | Type | Description |
|---|---|---|
id | string or integer | Your original road segment ID |
gers | string | Matched GERS ID (UUID format) |
confidence | integer | Match confidence score (0-100) |
lr_id | string | Linear reference: coordinates and GERS ID |
lr_gers | string | Linear reference: distance range and original ID |
Example:
{"id":"abc","gers":"550e8400-e29b-41d4-a716-446655440000","confidence":99,"lr_id":"52.0197-76.36744#550e8400-e29b-41d4-a716-446655440000","lr_gers":"0.0-100.0#abc"}Next steps
Once your data is prepared and validated:
- UI Workflow Guide - Upload through the web interface
- API Workflow Guide - Upload and manage data through the API
- Quick Reference - Command cheat sheet