Skip to content

Data Contracts

This document specifies the exact input data requirements for each source type consumed by UK_SSPA v2, the structure of all output artifacts, and the rules governing null handling, duplicate merging, and missing optional inputs.

See also: docs/glossary.md for definitions of observation wells, control points, linesink, and coordinate space conventions.


1. Observation Wells Shapefile

Config key: data_sources.observation_wells

Property Requirement
Geometry type Point (non-Point geometries are accepted but their centroid is used — a warning is not emitted; prefer true Point geometry)
Required attribute column The column named by water_level_col in config — must be numeric (float or int)
Null values Rows with null/NaN geometry are assigned NaN coordinates; rows with null water_level_col values are passed through as NaN — downstream kriging will fail if NaN values reach the model
CRS Must be consistent with all other inputs. The tool does not reproject.
Minimum records 0 records is accepted (returns empty arrays); however kriging requires at least as many points as drift terms + 1

Config structure:

"data_sources": {
  "observation_wells": {
    "path": "path/to/wells.shp",
    "water_level_col": "head_m"
  }
}

Loaded by: load_observation_wells()

Returns: Three 1-D np.ndarray arrays (wx, wy, wh) — x coordinates, y coordinates, and head values — all in raw space.

Exceptions raised:

Condition Exception
path key missing from config KeyError
File does not exist at path FileNotFoundError
water_level_col key missing from config KeyError
water_level_col not found in shapefile columns KeyError
Length mismatch between geometry and value arrays ValueError

2. Linesink River Shapefile (for AEM Drift and/or Control Points)

Config key: data_sources.linesink_river

Property Requirement
Geometry type LineString or MultiLineString
CRS Must match observation wells. The tool does not reproject.

2.1 Required Columns (for AEM Drift)

Column Config Key Type Notes
Group identifier group_column string Groups line segments into named linesink elements; all segments sharing the same value are summed into one drift term
Linesink strength strength_col numeric (float) Hydraulic resistance or strength value per segment; default column name is "resistance"

2.2 Optional Columns (for Control Points)

Column Config Key Type Notes
Start elevation control_points.z_start_col numeric (float) Water surface elevation at the start vertex of each line feature
End elevation control_points.z_end_col numeric (float) Water surface elevation at the end vertex of each line feature

Null handling for control points: If either z_start_col or z_end_col is NaN for a given feature, that entire feature is skipped (no control points are generated from it). A warning is logged: "Skipped N line features due to missing Z-values (NaN)." If the columns are missing entirely from the shapefile, a warning is logged and all features are skipped.

Shapefile column name truncation: Shapefiles truncate column names to 10 characters. load_line_features() automatically checks both the full name and the first 10 characters when looking up z_start_col and z_end_col.

2.3 Control Points Sub-Configuration

Control points are synthetic data points generated by sampling along line features at regular intervals. They are merged with observation wells before kriging.

Config Key Type Default Description
control_points.enabled bool true If false, no control points are generated from this source (returns empty arrays)
control_points.spacing float 50.0 Distance between sampled points along each line segment (in CRS units); must be > 0
control_points.z_start_col string null Column name for start elevation
control_points.z_end_col string null Column name for end elevation
control_points.nugget_override float 0.0 Nugget value assigned to all generated control points
control_points.avoid_vertices bool false If true, sample points avoid the exact start/end vertices of each segment
control_points.perpendicular_offset float 0.0 Offset distance applied perpendicular to the line direction at each sample point

Elevation interpolation: The head value at each control point is linearly interpolated between z_start_col (at distance 0) and z_end_col (at the full segment length). This assumes line digitization direction matches the flow direction (start → end).

Loaded by: load_line_features()

Returns: Four 1-D np.ndarray arrays (cp_x, cp_y, cp_h, cp_n) — x coordinates, y coordinates, head values, and per-point nugget values — all in raw space.

Exceptions raised:

Condition Exception
path key missing from config KeyError
File does not exist at path FileNotFoundError
control_points.spacing ≤ 0 ValueError

When control_points.enabled = false: Returns four empty arrays immediately without reading the file.


3. Data Preparation and Duplicate Handling

After loading, all sources are merged and cleaned by prepare_data().

3.1 Merging

Observation well arrays (wx, wy, wh) are concatenated with all control point source arrays (cx, cy, ch). The nugget array cp_n from control points is not passed to prepare_data() — it is handled separately in the pipeline.

3.2 Duplicate Removal

remove_duplicate_points() is called with min_dist = config["min_separation_distance"].

Behavior Detail
Algorithm Pairwise distance clustering: for each unprocessed point, all points within min_dist are grouped into a cluster and replaced by their arithmetic mean (x, y, and h)
min_dist ≤ 0 No merging is performed; all points are returned unchanged
min_dist not set Defaults to 0.0 (no merging)
Empty input Returns empty arrays immediately
Log output "Removed duplicates: reduced N -> M points using min_dist=D"

Important: Duplicate removal uses the raw space coordinates. Points that are co-located in raw space but represent different physical features (e.g., a well and a control point at the same location) will be merged into a single averaged point.

3.3 What Happens When Optional Inputs Are Missing

Scenario Behavior
No linesink shapefile configured control_points_list is empty; only observation wells are used
Linesink shapefile configured but control_points.enabled = false Empty arrays returned; no control points added
Linesink shapefile has no valid features (all NaN elevations) Warning logged; empty arrays returned; only observation wells used
Observation wells shapefile is empty (0 records) Empty arrays returned; kriging will fail downstream

4. Output Artifacts

4.1 Contour Shapefile

Triggered by: output.export_contours = true

Written by: export_contours()

Property Value
Geometry type LineString (3D — Z coordinate = contour elevation value)
Output path output.contour_output_path (default: "contours.shp")
Attribute columns elevation (float) — the contour level value
CRS Inherited from the observation wells shapefile CRS (passed through from the loaded GeoDataFrame)
Contour interval output.contour_interval (float, must be > 0; default: 1.0)
Directory creation Output directory is created automatically if it does not exist

Null/empty behavior: If Z_grid contains only NaN values, or if no contour levels fall within the grid range, no file is written and a warning is logged.

Contour level alignment: Levels are aligned to multiples of contour_interval starting from floor(z_min / interval) * interval.

4.2 Auxiliary Points Shapefile

Triggered by: output.export_points = true

Written by: export_aux_points()

Property Value
Geometry type Point
Output path output.points_output_path (default: "observation_points.shp")
Attribute columns x (float), y (float), h (float) — coordinates and head values of all merged data points (observation wells + control points after duplicate removal)
CRS Inherited from the observation wells shapefile CRS

4.3 Map Figure

Triggered by: output.generate_map = true

Property Value
Type matplotlib figure
Content Two subplots: kriged head surface (left) and kriging standard deviation surface (right)
Observation wells Plotted as scatter points on the head surface
Control points Plotted as scatter points (distinct marker) if present
Display/save Displayed interactively (no save path is configurable in the current implementation)

4.4 Water-Level GeoTIFF Raster

Triggered by: output.export_water_level_tif = true

Written by: export_water_level_tif()

Property Value
Format GeoTIFF-compatible raster (.tif)
Output path output.water_level_tif_output_path (default: "output/water_levels.tif")
Cell values Kriged water levels from Z_grid
Grid orientation Uses raw-space prediction grid; rows are written north-to-south
NoData handling NaN values are written as -9999
Dependencies Requires rasterio
Directory creation Output directory is created automatically if it does not exist

4.5 Water-Level ASCII Grid Raster

Triggered by: output.export_water_level_asc = true

Written by: export_water_level_ascii_grid()

Property Value
Format Arc/Info ASCII Grid (.asc)
Output path output.water_level_asc_output_path (default: "output/water_levels.asc")
Cell values Kriged water levels from Z_grid
Header fields ncols, nrows, xllcorner, yllcorner, cellsize, NODATA_value
Grid orientation Uses raw-space prediction grid; rows are written north-to-south
NoData handling NaN values are written as -9999
Grid requirement Grid must be regular and have square cells
Directory creation Output directory is created automatically if it does not exist

5. Grid Definition

The prediction grid is defined in raw space using the following config keys:

Config Key Type Description
grid.x_min float Minimum X coordinate of the grid extent
grid.x_max float Maximum X coordinate of the grid extent
grid.y_min float Minimum Y coordinate of the grid extent
grid.y_max float Maximum Y coordinate of the grid extent
grid.resolution float Grid cell size (spacing between prediction points) in CRS units

Coordinate space note: The grid is defined in raw space. Internally, grid coordinates are transformed to model space for kriging prediction (when anisotropy is enabled), then results are returned on the original raw-space grid. See docs/glossary.md for the raw vs. model space distinction.

Validation: x_min < x_max and y_min < y_max are enforced; violation raises ValueError.


6. CRS Consistency Requirements

The tool does not perform any coordinate reprojection. All input shapefiles must share the same Coordinate Reference System (CRS):

  • Observation wells shapefile
  • Linesink river shapefile (if used)
  • Any additional line feature sources

The grid extent (grid.x_min, etc.) must also be specified in the same CRS.

If CRS values differ between inputs, no error is raised by the tool — results will be silently incorrect. It is the user's responsibility to ensure CRS consistency before running the pipeline.