Data Contracts¶

This document specifies the exact input data requirements for each source type consumed by UK_SSPA v2, the structure of all output artifacts, and the rules governing null handling, duplicate merging, and missing optional inputs.

See also: docs/glossary.md for definitions of observation wells, control points, linesink, and coordinate space conventions.

1. Observation Wells Shapefile¶

Config key: data_sources.observation_wells

Property	Requirement
Geometry type	Point (non-Point geometries are accepted but their centroid is used — a warning is not emitted; prefer true Point geometry)
Required attribute column	The column named by `water_level_col` in config — must be numeric (float or int)
Null values	Rows with null/NaN geometry are assigned `NaN` coordinates; rows with null `water_level_col` values are passed through as `NaN` — downstream kriging will fail if NaN values reach the model
CRS	Must be consistent with all other inputs. The tool does not reproject.
Minimum records	0 records is accepted (returns empty arrays); however kriging requires at least as many points as drift terms + 1

Config structure:

"data_sources": {
  "observation_wells": {
    "path": "path/to/wells.shp",
    "water_level_col": "head_m"
  }
}

Loaded by: load_observation_wells()

Returns: Three 1-D np.ndarray arrays (wx, wy, wh) — x coordinates, y coordinates, and head values — all in raw space.

Exceptions raised:

Condition	Exception
`path` key missing from config	`KeyError`
File does not exist at `path`	`FileNotFoundError`
`water_level_col` key missing from config	`KeyError`
`water_level_col` not found in shapefile columns	`KeyError`
Length mismatch between geometry and value arrays	`ValueError`

2. Linesink River Shapefile (for AEM Drift and/or Control Points)¶

Config key: data_sources.linesink_river

Property	Requirement
Geometry type	LineString or MultiLineString
CRS	Must match observation wells. The tool does not reproject.

2.1 Required Columns (for AEM Drift)¶

Column	Config Key	Type	Notes
Group identifier	`group_column`	string	Groups line segments into named linesink elements; all segments sharing the same value are summed into one drift term
Linesink strength	`strength_col`	numeric (float)	Hydraulic resistance or strength value per segment; default column name is `"resistance"`

2.2 Optional Columns (for Control Points)¶

Column	Config Key	Type	Notes
Start elevation	`control_points.z_start_col`	numeric (float)	Water surface elevation at the start vertex of each line feature
End elevation	`control_points.z_end_col`	numeric (float)	Water surface elevation at the end vertex of each line feature

Null handling for control points: If either z_start_col or z_end_col is NaN for a given feature, that entire feature is skipped (no control points are generated from it). A warning is logged: "Skipped N line features due to missing Z-values (NaN)." If the columns are missing entirely from the shapefile, a warning is logged and all features are skipped.

Shapefile column name truncation: Shapefiles truncate column names to 10 characters. load_line_features() automatically checks both the full name and the first 10 characters when looking up z_start_col and z_end_col.

2.3 Control Points Sub-Configuration¶

Control points are synthetic data points generated by sampling along line features at regular intervals. They are merged with observation wells before kriging.

Config Key	Type	Default	Description
`control_points.enabled`	bool	`true`	If `false`, no control points are generated from this source (returns empty arrays)
`control_points.spacing`	float	`50.0`	Distance between sampled points along each line segment (in CRS units); must be > 0
`control_points.z_start_col`	string	`null`	Column name for start elevation
`control_points.z_end_col`	string	`null`	Column name for end elevation
`control_points.nugget_override`	float	`0.0`	Nugget value assigned to all generated control points
`control_points.avoid_vertices`	bool	`false`	If `true`, sample points avoid the exact start/end vertices of each segment
`control_points.perpendicular_offset`	float	`0.0`	Offset distance applied perpendicular to the line direction at each sample point

Elevation interpolation: The head value at each control point is linearly interpolated between z_start_col (at distance 0) and z_end_col (at the full segment length). This assumes line digitization direction matches the flow direction (start → end).

Loaded by: load_line_features()

Returns: Four 1-D np.ndarray arrays (cp_x, cp_y, cp_h, cp_n) — x coordinates, y coordinates, head values, and per-point nugget values — all in raw space.

Exceptions raised:

Condition	Exception
`path` key missing from config	`KeyError`
File does not exist at `path`	`FileNotFoundError`
`control_points.spacing` ≤ 0	`ValueError`

When control_points.enabled = false: Returns four empty arrays immediately without reading the file.

3. Data Preparation and Duplicate Handling¶

After loading, all sources are merged and cleaned by prepare_data().

3.1 Merging¶

Observation well arrays (wx, wy, wh) are concatenated with all control point source arrays (cx, cy, ch). The nugget array cp_n from control points is not passed to prepare_data() — it is handled separately in the pipeline.

3.2 Duplicate Removal¶

remove_duplicate_points() is called with min_dist = config["min_separation_distance"].

Behavior	Detail
Algorithm	Pairwise distance clustering: for each unprocessed point, all points within `min_dist` are grouped into a cluster and replaced by their arithmetic mean (x, y, and h)
`min_dist` ≤ 0	No merging is performed; all points are returned unchanged
`min_dist` not set	Defaults to `0.0` (no merging)
Empty input	Returns empty arrays immediately
Log output	`"Removed duplicates: reduced N -> M points using min_dist=D"`

Important: Duplicate removal uses the raw space coordinates. Points that are co-located in raw space but represent different physical features (e.g., a well and a control point at the same location) will be merged into a single averaged point.

3.3 What Happens When Optional Inputs Are Missing¶

Scenario	Behavior
No linesink shapefile configured	`control_points_list` is empty; only observation wells are used
Linesink shapefile configured but `control_points.enabled = false`	Empty arrays returned; no control points added
Linesink shapefile has no valid features (all NaN elevations)	Warning logged; empty arrays returned; only observation wells used
Observation wells shapefile is empty (0 records)	Empty arrays returned; kriging will fail downstream

4. Output Artifacts¶

4.1 Contour Shapefile¶

Triggered by: output.export_contours = true

Written by: export_contours()

Property	Value
Geometry type	LineString (3D — Z coordinate = contour elevation value)
Output path	`output.contour_output_path` (default: `"contours.shp"`)
Attribute columns	`elevation` (float) — the contour level value
CRS	Inherited from the observation wells shapefile CRS (passed through from the loaded GeoDataFrame)
Contour interval	`output.contour_interval` (float, must be > 0; default: `1.0`)
Directory creation	Output directory is created automatically if it does not exist

Null/empty behavior: If Z_grid contains only NaN values, or if no contour levels fall within the grid range, no file is written and a warning is logged.

Contour level alignment: Levels are aligned to multiples of contour_interval starting from floor(z_min / interval) * interval.

4.2 Auxiliary Points Shapefile¶

Triggered by: output.export_points = true

Written by: export_aux_points()

Property	Value
Geometry type	Point
Output path	`output.points_output_path` (default: `"observation_points.shp"`)
Attribute columns	`x` (float), `y` (float), `h` (float) — coordinates and head values of all merged data points (observation wells + control points after duplicate removal)
CRS	Inherited from the observation wells shapefile CRS

4.3 Map Figure¶

Triggered by: output.generate_map = true

Property	Value
Type	`matplotlib` figure
Content	Two subplots: kriged head surface (left) and kriging standard deviation surface (right)
Observation wells	Plotted as scatter points on the head surface
Control points	Plotted as scatter points (distinct marker) if present
Display/save	Displayed interactively (no save path is configurable in the current implementation)

4.4 Water-Level GeoTIFF Raster¶

Triggered by: output.export_water_level_tif = true

Written by: export_water_level_tif()

Property	Value
Format	GeoTIFF-compatible raster (`.tif`)
Output path	`output.water_level_tif_output_path` (default: `"output/water_levels.tif"`)
Cell values	Kriged water levels from `Z_grid`
Grid orientation	Uses raw-space prediction grid; rows are written north-to-south
NoData handling	`NaN` values are written as `-9999`
Dependencies	Requires `rasterio`
Directory creation	Output directory is created automatically if it does not exist

4.5 Water-Level ASCII Grid Raster¶

Triggered by: output.export_water_level_asc = true

Written by: export_water_level_ascii_grid()

Property	Value
Format	Arc/Info ASCII Grid (`.asc`)
Output path	`output.water_level_asc_output_path` (default: `"output/water_levels.asc"`)
Cell values	Kriged water levels from `Z_grid`
Header fields	`ncols`, `nrows`, `xllcorner`, `yllcorner`, `cellsize`, `NODATA_value`
Grid orientation	Uses raw-space prediction grid; rows are written north-to-south
NoData handling	`NaN` values are written as `-9999`
Grid requirement	Grid must be regular and have square cells
Directory creation	Output directory is created automatically if it does not exist

5. Grid Definition¶

The prediction grid is defined in raw space using the following config keys:

Config Key	Type	Description
`grid.x_min`	float	Minimum X coordinate of the grid extent
`grid.x_max`	float	Maximum X coordinate of the grid extent
`grid.y_min`	float	Minimum Y coordinate of the grid extent
`grid.y_max`	float	Maximum Y coordinate of the grid extent
`grid.resolution`	float	Grid cell size (spacing between prediction points) in CRS units

Coordinate space note: The grid is defined in raw space. Internally, grid coordinates are transformed to model space for kriging prediction (when anisotropy is enabled), then results are returned on the original raw-space grid. See docs/glossary.md for the raw vs. model space distinction.

Validation: x_min < x_max and y_min < y_max are enforced; violation raises ValueError.

6. CRS Consistency Requirements¶

The tool does not perform any coordinate reprojection. All input shapefiles must share the same Coordinate Reference System (CRS):

Observation wells shapefile
Linesink river shapefile (if used)
Any additional line feature sources

The grid extent (grid.x_min, etc.) must also be specified in the same CRS.

If CRS values differ between inputs, no error is raised by the tool — results will be silently incorrect. It is the user's responsibility to ensure CRS consistency before running the pipeline.