Data Contracts¶
This document specifies the exact input data requirements for each source type consumed by UK_SSPA v2, the structure of all output artifacts, and the rules governing null handling, duplicate merging, and missing optional inputs.
See also:
docs/glossary.mdfor definitions of observation wells, control points, linesink, and coordinate space conventions.
1. Observation Wells Shapefile¶
Config key: data_sources.observation_wells
| Property | Requirement |
|---|---|
| Geometry type | Point (non-Point geometries are accepted but their centroid is used — a warning is not emitted; prefer true Point geometry) |
| Required attribute column | The column named by water_level_col in config — must be numeric (float or int) |
| Null values | Rows with null/NaN geometry are assigned NaN coordinates; rows with null water_level_col values are passed through as NaN — downstream kriging will fail if NaN values reach the model |
| CRS | Must be consistent with all other inputs. The tool does not reproject. |
| Minimum records | 0 records is accepted (returns empty arrays); however kriging requires at least as many points as drift terms + 1 |
Config structure:
"data_sources": {
"observation_wells": {
"path": "path/to/wells.shp",
"water_level_col": "head_m"
}
}
Loaded by: load_observation_wells()
Returns: Three 1-D np.ndarray arrays (wx, wy, wh) — x coordinates, y coordinates, and head values — all in raw space.
Exceptions raised:
| Condition | Exception |
|---|---|
path key missing from config |
KeyError |
File does not exist at path |
FileNotFoundError |
water_level_col key missing from config |
KeyError |
water_level_col not found in shapefile columns |
KeyError |
| Length mismatch between geometry and value arrays | ValueError |
2. Linesink River Shapefile (for AEM Drift and/or Control Points)¶
Config key: data_sources.linesink_river
| Property | Requirement |
|---|---|
| Geometry type | LineString or MultiLineString |
| CRS | Must match observation wells. The tool does not reproject. |
2.1 Required Columns (for AEM Drift)¶
| Column | Config Key | Type | Notes |
|---|---|---|---|
| Group identifier | group_column |
string | Groups line segments into named linesink elements; all segments sharing the same value are summed into one drift term |
| Linesink strength | strength_col |
numeric (float) | Hydraulic resistance or strength value per segment; default column name is "resistance" |
2.2 Optional Columns (for Control Points)¶
| Column | Config Key | Type | Notes |
|---|---|---|---|
| Start elevation | control_points.z_start_col |
numeric (float) | Water surface elevation at the start vertex of each line feature |
| End elevation | control_points.z_end_col |
numeric (float) | Water surface elevation at the end vertex of each line feature |
Null handling for control points: If either
z_start_colorz_end_colisNaNfor a given feature, that entire feature is skipped (no control points are generated from it). A warning is logged:"Skipped N line features due to missing Z-values (NaN)."If the columns are missing entirely from the shapefile, a warning is logged and all features are skipped.Shapefile column name truncation: Shapefiles truncate column names to 10 characters.
load_line_features()automatically checks both the full name and the first 10 characters when looking upz_start_colandz_end_col.
2.3 Control Points Sub-Configuration¶
Control points are synthetic data points generated by sampling along line features at regular intervals. They are merged with observation wells before kriging.
| Config Key | Type | Default | Description |
|---|---|---|---|
control_points.enabled |
bool | true |
If false, no control points are generated from this source (returns empty arrays) |
control_points.spacing |
float | 50.0 |
Distance between sampled points along each line segment (in CRS units); must be > 0 |
control_points.z_start_col |
string | null |
Column name for start elevation |
control_points.z_end_col |
string | null |
Column name for end elevation |
control_points.nugget_override |
float | 0.0 |
Nugget value assigned to all generated control points |
control_points.avoid_vertices |
bool | false |
If true, sample points avoid the exact start/end vertices of each segment |
control_points.perpendicular_offset |
float | 0.0 |
Offset distance applied perpendicular to the line direction at each sample point |
Elevation interpolation: The head value at each control point is linearly interpolated between z_start_col (at distance 0) and z_end_col (at the full segment length). This assumes line digitization direction matches the flow direction (start → end).
Loaded by: load_line_features()
Returns: Four 1-D np.ndarray arrays (cp_x, cp_y, cp_h, cp_n) — x coordinates, y coordinates, head values, and per-point nugget values — all in raw space.
Exceptions raised:
| Condition | Exception |
|---|---|
path key missing from config |
KeyError |
File does not exist at path |
FileNotFoundError |
control_points.spacing ≤ 0 |
ValueError |
When control_points.enabled = false: Returns four empty arrays immediately without reading the file.
3. Data Preparation and Duplicate Handling¶
After loading, all sources are merged and cleaned by prepare_data().
3.1 Merging¶
Observation well arrays (wx, wy, wh) are concatenated with all control point source arrays (cx, cy, ch). The nugget array cp_n from control points is not passed to prepare_data() — it is handled separately in the pipeline.
3.2 Duplicate Removal¶
remove_duplicate_points() is called with min_dist = config["min_separation_distance"].
| Behavior | Detail |
|---|---|
| Algorithm | Pairwise distance clustering: for each unprocessed point, all points within min_dist are grouped into a cluster and replaced by their arithmetic mean (x, y, and h) |
min_dist ≤ 0 |
No merging is performed; all points are returned unchanged |
min_dist not set |
Defaults to 0.0 (no merging) |
| Empty input | Returns empty arrays immediately |
| Log output | "Removed duplicates: reduced N -> M points using min_dist=D" |
Important: Duplicate removal uses the raw space coordinates. Points that are co-located in raw space but represent different physical features (e.g., a well and a control point at the same location) will be merged into a single averaged point.
3.3 What Happens When Optional Inputs Are Missing¶
| Scenario | Behavior |
|---|---|
| No linesink shapefile configured | control_points_list is empty; only observation wells are used |
Linesink shapefile configured but control_points.enabled = false |
Empty arrays returned; no control points added |
| Linesink shapefile has no valid features (all NaN elevations) | Warning logged; empty arrays returned; only observation wells used |
| Observation wells shapefile is empty (0 records) | Empty arrays returned; kriging will fail downstream |
4. Output Artifacts¶
4.1 Contour Shapefile¶
Triggered by: output.export_contours = true
Written by: export_contours()
| Property | Value |
|---|---|
| Geometry type | LineString (3D — Z coordinate = contour elevation value) |
| Output path | output.contour_output_path (default: "contours.shp") |
| Attribute columns | elevation (float) — the contour level value |
| CRS | Inherited from the observation wells shapefile CRS (passed through from the loaded GeoDataFrame) |
| Contour interval | output.contour_interval (float, must be > 0; default: 1.0) |
| Directory creation | Output directory is created automatically if it does not exist |
Null/empty behavior: If Z_grid contains only NaN values, or if no contour levels fall within the grid range, no file is written and a warning is logged.
Contour level alignment: Levels are aligned to multiples of contour_interval starting from floor(z_min / interval) * interval.
4.2 Auxiliary Points Shapefile¶
Triggered by: output.export_points = true
Written by: export_aux_points()
| Property | Value |
|---|---|
| Geometry type | Point |
| Output path | output.points_output_path (default: "observation_points.shp") |
| Attribute columns | x (float), y (float), h (float) — coordinates and head values of all merged data points (observation wells + control points after duplicate removal) |
| CRS | Inherited from the observation wells shapefile CRS |
4.3 Map Figure¶
Triggered by: output.generate_map = true
| Property | Value |
|---|---|
| Type | matplotlib figure |
| Content | Two subplots: kriged head surface (left) and kriging standard deviation surface (right) |
| Observation wells | Plotted as scatter points on the head surface |
| Control points | Plotted as scatter points (distinct marker) if present |
| Display/save | Displayed interactively (no save path is configurable in the current implementation) |
4.4 Water-Level GeoTIFF Raster¶
Triggered by: output.export_water_level_tif = true
Written by: export_water_level_tif()
| Property | Value |
|---|---|
| Format | GeoTIFF-compatible raster (.tif) |
| Output path | output.water_level_tif_output_path (default: "output/water_levels.tif") |
| Cell values | Kriged water levels from Z_grid |
| Grid orientation | Uses raw-space prediction grid; rows are written north-to-south |
| NoData handling | NaN values are written as -9999 |
| Dependencies | Requires rasterio |
| Directory creation | Output directory is created automatically if it does not exist |
4.5 Water-Level ASCII Grid Raster¶
Triggered by: output.export_water_level_asc = true
Written by: export_water_level_ascii_grid()
| Property | Value |
|---|---|
| Format | Arc/Info ASCII Grid (.asc) |
| Output path | output.water_level_asc_output_path (default: "output/water_levels.asc") |
| Cell values | Kriged water levels from Z_grid |
| Header fields | ncols, nrows, xllcorner, yllcorner, cellsize, NODATA_value |
| Grid orientation | Uses raw-space prediction grid; rows are written north-to-south |
| NoData handling | NaN values are written as -9999 |
| Grid requirement | Grid must be regular and have square cells |
| Directory creation | Output directory is created automatically if it does not exist |
5. Grid Definition¶
The prediction grid is defined in raw space using the following config keys:
| Config Key | Type | Description |
|---|---|---|
grid.x_min |
float | Minimum X coordinate of the grid extent |
grid.x_max |
float | Maximum X coordinate of the grid extent |
grid.y_min |
float | Minimum Y coordinate of the grid extent |
grid.y_max |
float | Maximum Y coordinate of the grid extent |
grid.resolution |
float | Grid cell size (spacing between prediction points) in CRS units |
Coordinate space note: The grid is defined in raw space. Internally, grid coordinates are transformed to model space for kriging prediction (when anisotropy is enabled), then results are returned on the original raw-space grid. See
docs/glossary.mdfor the raw vs. model space distinction.
Validation: x_min < x_max and y_min < y_max are enforced; violation raises ValueError.
6. CRS Consistency Requirements¶
The tool does not perform any coordinate reprojection. All input shapefiles must share the same Coordinate Reference System (CRS):
- Observation wells shapefile
- Linesink river shapefile (if used)
- Any additional line feature sources
The grid extent (grid.x_min, etc.) must also be specified in the same CRS.
If CRS values differ between inputs, no error is raised by the tool — results will be silently incorrect. It is the user's responsibility to ensure CRS consistency before running the pipeline.