Data preparation guide

This guide provides detailed instructions for preparing your geospatial data for use with GEM, including code examples, validation techniques, and best practices.

Data format requirements

GEM requires input data in Apache Parquet format with a specific schema.

Required schema

Field	Type	Description	Example
`id`	integer or string	Unique identifier for each road segment	`5707295`
`is_navigable`	boolean	Whether the road is navigable by vehicles	`true`
`geometry`	string	Road geometry in WKT LineString format	`"LINESTRING (145.18 -37.87, 145.18 -37.87)"`

Geometry format

The geometry field must contain valid Well-Known Text (WKT) LineString geometries:

1
LINESTRING (longitude1 latitude1, longitude2 latitude2, ...)

Examples:

1
LINESTRING (145.18156 -37.87340, 145.18092 -37.87356)
2
LINESTRING (4.8952 52.3702, 4.8960 52.3710, 4.8975 52.3725)

Creating Parquet files

Using Python (pandas + pyarrow)

1
import pandas as pd
2
import pyarrow as pa
3
import pyarrow.parquet as pq
4

5
# Create sample data
6
data = {
7
    'id': [1, 2, 3, 4, 5],
8
    'is_navigable': [True, True, False, True, True],
9
    'geometry': [
10
        'LINESTRING (4.8952 52.3702, 4.8960 52.3710)',
11
        'LINESTRING (4.8960 52.3710, 4.8975 52.3725)',
12
        'LINESTRING (4.8975 52.3725, 4.8990 52.3740)',
13
        'LINESTRING (4.8990 52.3740, 4.9005 52.3755)',
14
        'LINESTRING (4.9005 52.3755, 4.9020 52.3770)'
15
    ]
16
}
17

18
# Create DataFrame
19
df = pd.DataFrame(data)
20

21
# Define schema with correct types
22
schema = pa.schema([
23
    ('id', pa.int64()),
24
    ('is_navigable', pa.bool_()),
25
    ('geometry', pa.string())
26
])
27

28
# Convert to PyArrow Table with schema
29
table = pa.Table.from_pandas(df, schema=schema)
30

31
# Write to Parquet
32
pq.write_table(table, 'my_road_data.parquet')
33

34
print(f"Created Parquet file with {len(df)} records")

Using Python (GeoPandas)

If your data is already in a geospatial format (Shapefile, GeoJSON, etc.):

1
import geopandas as gpd
2
import pandas as pd
3

4
# Read source data
5
gdf = gpd.read_file('roads.shp')
6

7
# Prepare for GEM
8
gem_data = pd.DataFrame({
9
    'id': range(1, len(gdf) + 1),  # Generate unique IDs
10
    'is_navigable': gdf['navigable'].fillna(True),  # Default to True
11
    'geometry': gdf.geometry.apply(lambda g: g.wkt)  # Convert to WKT
12
})
13

14
# Filter to LineStrings only
15
gem_data = gem_data[gem_data['geometry'].str.startswith('LINESTRING')]
16

17
# Save as Parquet
18
gem_data.to_parquet('gem_input.parquet', index=False)
19

20
print(f"Exported {len(gem_data)} road segments")

Using PySpark

For large datasets:

1
from pyspark.sql import SparkSession
2
from pyspark.sql.types import StructType, StructField, LongType, BooleanType, StringType
3

4
# Initialize Spark
5
spark = SparkSession.builder.appName("GEM Data Prep").getOrCreate()
6

7
# Define schema
8
schema = StructType([
9
    StructField("id", LongType(), False),
10
    StructField("is_navigable", BooleanType(), False),
11
    StructField("geometry", StringType(), False)
12
])
13

14
# Read your source data
15
source_df = spark.read.format("your_format").load("your_data")
16

17
# Transform to GEM schema
18
gem_df = source_df.select(
19
    source_df["road_id"].alias("id"),
20
    source_df["navigable"].alias("is_navigable"),
21
    source_df["wkt_geometry"].alias("geometry")
22
)
23

24
# Write as Parquet
25
gem_df.write.parquet("gem_input.parquet")

Data validation

Always validate your data before uploading to GEM.

Python validation script

1
import pandas as pd
2
import re
3

4
def validate_gem_data(filepath):
5
    """Validate a Parquet file for GEM compatibility."""
6

7
    print(f"Validating: {filepath}")
8
    errors = []
9
    warnings = []
10

11
    # Read the file
12
    try:
13
        df = pd.read_parquet(filepath)
14
    except Exception as e:
15
        return [f"Cannot read file: {e}"], []
16

17
    print(f"Total records: {len(df)}")
18

19
    # Check required columns
20
    required_cols = ['id', 'is_navigable', 'geometry']
21
    missing_cols = [c for c in required_cols if c not in df.columns]
22
    if missing_cols:
23
        errors.append(f"Missing required columns: {missing_cols}")
24
        return errors, warnings
25

26
    # Check for null values
27
    for col in required_cols:
28
        null_count = df[col].isnull().sum()
29
        if null_count > 0:
30
            errors.append(f"Column '{col}' has {null_count} null values")
31

32
    # Check ID uniqueness
33
    duplicate_ids = df['id'].duplicated().sum()
34
    if duplicate_ids > 0:
35
        errors.append(f"Found {duplicate_ids} duplicate IDs")
36

37
    # Check data types
38
    if not pd.api.types.is_integer_dtype(df['id']):
39
        errors.append(f"Column 'id' should be integer, got {df['id'].dtype}")
40

41
    if not pd.api.types.is_bool_dtype(df['is_navigable']):
42
        errors.append(f"Column 'is_navigable' should be boolean, got {df['is_navigable'].dtype}")
43

44
    # Validate geometries
45
    linestring_pattern = r'^LINESTRING\s*\([^)]+\)$'
46
    invalid_geom = 0
47
    for idx, geom in df['geometry'].items():
48
        if not isinstance(geom, str):
49
            invalid_geom += 1
50
        elif not re.match(linestring_pattern, geom.strip(), re.IGNORECASE):
51
            invalid_geom += 1
52

53
    if invalid_geom > 0:
54
        errors.append(f"Found {invalid_geom} invalid geometries (must be WKT LINESTRING)")
55

56
    # Check for empty geometries
57
    empty_geom = df['geometry'].str.contains(r'LINESTRING\s*\(\s*\)', case=False, regex=True).sum()
58
    if empty_geom > 0:
59
        warnings.append(f"Found {empty_geom} empty geometries")
60

61
    # Summary
62
    print(f"\nValidation Results:")
63
    print(f"  Errors: {len(errors)}")
64
    print(f"  Warnings: {len(warnings)}")
65

66
    if errors:
67
        print("\nErrors:")
68
        for e in errors:
69
            print(f"  ❌ {e}")
70

71
    if warnings:
72
        print("\nWarnings:")
73
        for w in warnings:
74
            print(f"  ⚠️ {w}")
75

76
    if not errors:
77
        print("\n✅ File is valid for GEM!")
78

79
    return errors, warnings
80

81
# Usage
82
errors, warnings = validate_gem_data('my_road_data.parquet')

Common data quality issues

Issue 1: Invalid geometry format

Problem: Geometries not in WKT LineString format.

Solution:

1
from shapely import wkt
2
from shapely.geometry import LineString
3

4
def fix_geometry(geom):
5
    """Convert various geometry formats to WKT LineString."""
6
    try:
7
        # If it's already a valid WKT string
8
        parsed = wkt.loads(geom)
9
        if isinstance(parsed, LineString):
10
            return geom
11
        else:
12
            return None  # Not a LineString
13
    except:
14
        return None
15

16
df['geometry'] = df['geometry'].apply(fix_geometry)
17
df = df.dropna(subset=['geometry'])

Issue 2: Duplicate IDs

Problem: Multiple records share the same ID.

Solution:

1
# Option 1: Keep first occurrence
2
df = df.drop_duplicates(subset=['id'], keep='first')
3

4
# Option 2: Regenerate IDs
5
df['id'] = range(1, len(df) + 1)

Issue 3: Mixed geometry types

Problem: Dataset contains Points, Polygons, etc. alongside LineStrings.

Solution:

1
# Filter to LineStrings only
2
df = df[df['geometry'].str.upper().str.startswith('LINESTRING')]

Issue 4: Coordinate system issues

Problem: Coordinates in wrong order or projection.

Solution:

1
import geopandas as gpd
2
from shapely import wkt
3

4
# Read and reproject
5
gdf = gpd.read_file('roads.shp')
6
gdf = gdf.to_crs('EPSG:4326')  # Convert to WGS84
7

8
# Extract WKT
9
df['geometry'] = gdf.geometry.apply(lambda g: g.wkt)

Best practices

Before uploading

Start small: Test with a subset (1,000-10,000 records) before processing full dataset
Validate thoroughly: Run validation script on every file
Check file size: Large files may take longer to upload; plan accordingly
Use descriptive filenames: city_roads_2024_v1.parquet not data.parquet

Data quality tips

Clean geometries: Remove self-intersections and invalid geometries
Ensure connectivity: Connected road networks match better than isolated segments
Include all segments: Don’t filter out small roads—they help with context
Accurate navigability: Set is_navigable correctly for better matching

File naming conventions

Recommended naming pattern:

1
{region}_{data_type}_{date}_{version}.parquet

Examples:

netherlands_roads_20240115_v1.parquet
california_highways_20240120_v2.parquet
tokyo_streets_20240118_final.parquet

Sample data

Here’s a minimal sample file you can use for testing:

1
import pandas as pd
2

3
# Sample Amsterdam road segments
4
sample_data = {
5
    'id': [1, 2, 3, 4, 5],
6
    'is_navigable': [True, True, True, True, False],
7
    'geometry': [
8
        'LINESTRING (4.8952 52.3702, 4.8960 52.3710)',
9
        'LINESTRING (4.8960 52.3710, 4.8975 52.3725)',
10
        'LINESTRING (4.8975 52.3725, 4.8990 52.3740)',
11
        'LINESTRING (4.8990 52.3740, 4.9005 52.3755)',
12
        'LINESTRING (4.9005 52.3755, 4.9020 52.3770)'
13
    ]
14
}
15

16
df = pd.DataFrame(sample_data)
17
df.to_parquet('sample_gem_input.parquet', index=False)
18
print("Sample file created: sample_gem_input.parquet")

Output data schema

Field	Type	Description
`id`	string or integer	Your original road segment ID
`gers`	string	Matched GERS ID (UUID format)
`confidence`	integer	Match confidence score (0-100)
`lr_id`	string	Linear reference: coordinates and GERS ID
`lr_gers`	string	Linear reference: distance range and original ID

Example:

1
{"id":"abc","gers":"550e8400-e29b-41d4-a716-446655440000","confidence":99,"lr_id":"52.0197-76.36744#550e8400-e29b-41d4-a716-446655440000","lr_gers":"0.0-100.0#abc"}

Next steps

Once your data is prepared and validated:

UI Workflow Guide - Upload through the web interface
API Workflow Guide - Upload and manage data through the API
Quick Reference - Command cheat sheet