Skip to content

UK_SSPA v2 β€” Workflow Reference

Overview

This document describes the complete execution pipeline as implemented in main(). The pipeline is a linear sequence of stages that loads data, builds a Universal Kriging model with optional drift terms, and generates spatial predictions and outputs.


Pipeline Flowchart

flowchart TD
    S1["Stage 1: Load Configuration\n(raw space)"]
    S2["Stage 2: Initialize Variogram\n(no space)"]
    S3["Stage 3: Load Observation Wells\n(raw space)"]
    S4["Stage 4: Load Control Points\n(raw space)"]
    S5["Stage 5: Prepare & Merge Data\n(raw space)"]
    S6{"Anisotropy\nenabled?"}
    S6N["Skip transform\nx_model = all_x\ny_model = all_y\n(raw space)"]
    S6Y["Stage 6: Apply Coordinate Transform\n→ model space"]
    S7["Stage 7: Polynomial Drift\n(model space)"]
    S8{"Linesink drift\nenabled?"}
    S8Y["Stage 8: AEM Linesink Drift\n(model space OR raw space\nper apply_anisotropy flag)"]
    S9["Stage 9: Merge Drift Matrices\nnp.hstack([poly, aem])"]
    S10["Stage 10: Build UK Model\n(model space)"]
    S11["Stage 11: Diagnostics\n(model space)"]
    S12{"Cross-validation\nenabled?"}
    S12Y["Stage 12: LOOCV\n(raw space β†’ model space internally)"]
    S13["Stage 13: Grid Prediction\n(raw space grid β†’ model space internally)"]
    S14["Stage 14: Output Generation\n(raw space)"]

    S1 --> S2 --> S3 --> S4 --> S5 --> S6
    S6 -- No --> S6N --> S7
    S6 -- Yes --> S6Y --> S7
    S7 --> S8
    S8 -- No --> S9
    S8 -- Yes --> S8Y --> S9
    S9 --> S10 --> S11 --> S12
    S12 -- No --> S13
    S12 -- Yes --> S12Y --> S13
    S13 --> S14

Stage Reference Table

Stage Function(s) Called Inputs Outputs Coordinate Space Failure Modes
1. Configuration loading load_config() config.json path config dict N/A FileNotFoundError if path missing; KeyError if required keys absent; ValueError for invalid numeric values
2. Variogram initialization variogram() config dict variogram object N/A ValueError if sill ≀ 0, nugget β‰₯ sill, range ≀ 0, or ratio outside (0,1]
3. Observation well loading load_observation_wells() config dict wx, wy, wh arrays Raw FileNotFoundError if shapefile missing; KeyError if water_level_col absent
4. Control point loading load_line_features() source_conf, config cx, cy, ch, cn arrays per source Raw Warning logged and source skipped on failure; does not abort pipeline
5. Data preparation prepare_data() wx,wy,wh, ctrl_points_list, config all_x, all_y, all_h arrays Raw Returns empty arrays if no data; pipeline exits with error log
6. Coordinate transformation get_transform_params(), apply_transform() all_x, all_y, angle_major, ratio transform_params dict, x_model, y_model Raw β†’ Model Skipped entirely when anisotropy.enabled = false; x_model = all_x in that case
7. Polynomial drift compute_resc(), compute_polynomial_drift() x_model, y_model, config, resc drift_matrix_poly, term_names_poly Model Skipped if no drift terms enabled; returns zero-column matrix
8. AEM linesink drift compute_linesink_drift_matrix() calc_x, calc_y, linesinks_gdf, group_column, transform_params, sill drift_matrix_aem, term_names_aem, trained_scaling_factors Model or Raw (see Β§AEM Coordinate Space) Warning logged if shapefile path missing; returns zero-column matrix
9. Drift matrix merge np.hstack([poly, aem]) drift_matrix_poly, drift_matrix_aem drift_matrix, term_names Model Shape mismatch raises ValueError
10. Model building build_uk_model() x_model, y_model, all_h, drift_matrix, variogram_for_kriging uk_model (PyKrige UniversalKriging) Model PyKrige raises on singular kriging matrix
11. Diagnostics drift_diagnostics(), verify_drift_physics(), diagnose_kriging_system() uk_model, drift_matrix, term_names, variogram, all_h Log output; physics_results dict Model Non-fatal; logs warnings/errors but does not abort
12. Cross-validation cross_validate() all_x, all_y, all_h, config, variogram cv_results dict (rmse, mae, q1, q2) Raw β†’ Model internally Skipped when cross_validation.enabled = false; returns NaN metrics for < 3 points
13. Grid prediction predict_on_grid() uk_model, config, term_names, resc, transform_params, scaling_factors GX, GY, Z_grid, SS_grid Raw grid β†’ Model internally ValueError if grid bounds invalid (min β‰₯ max)
14. Output generation generate_map(), export_contours(), export_aux_points() GX, GY, Z_grid, SS_grid, config PNG map, contour shapefile, points shapefile Raw ValueError if contour interval ≀ 0; directory created automatically if missing

Detailed Stage Descriptions

Stage 1 β€” Configuration Loading

load_config() reads and validates config.json. All subsequent stages are driven by the returned config dict. See docs/configuration.md for the full key reference.

Key outputs: config dict


Stage 2 β€” Variogram Initialization

The variogram class is instantiated from the config dict. It stores all variogram parameters and exposes calculate_variogram(h) for distance-based semivariance evaluation.

When anisotropy.enabled = true, the variogram object is later cloned with anisotropy_enabled = False before being passed to PyKrige (see Stage 6). This prevents double-application of anisotropy.

Key outputs: variogram object with attributes sill, range_, nugget, anisotropy_enabled, anisotropy_ratio, angle_major


Stage 3 β€” Observation Well Loading

load_observation_wells() reads the point shapefile specified in data_sources.observation_wells.path and extracts the column named in water_level_col.

Coordinate space: Raw (CRS of the input shapefile).

Key outputs: wx, wy, wh β€” 1-D NumPy arrays of X coordinates, Y coordinates, and water level values.


Stage 4 β€” Control Point Loading

For each entry in data_sources that is not observation_wells, load_line_features() is called to generate synthetic control points along line features (e.g., rivers). The add_control_points flag (default true) controls whether a source is processed.

Each call returns a 4-tuple (cx, cy, ch, cn): - cx, cy β€” control point coordinates (raw space) - ch β€” interpolated elevation values - cn β€” per-point nugget overrides (used internally by prepare_data)

Failures are caught and logged as warnings; the pipeline continues with remaining sources.

Coordinate space: Raw.

Key outputs: ctrl_points_list β€” list of (cx, cy, ch) tuples.


Stage 5 β€” Data Preparation and Merging

prepare_data() merges observation wells and all control point sets into a single dataset, applying min_separation_distance deduplication via remove_duplicate_points().

If the merged dataset is empty, the pipeline logs an error and exits.

Coordinate space: Raw.

Key outputs: all_x, all_y, all_h β€” merged 1-D arrays used for all subsequent computation.


Stage 6 β€” Coordinate Transformation (Conditional)

Condition: variogram.anisotropy_enabled = true

When anisotropy is enabled:

  1. get_transform_params() computes the transformation parameters from the data centroid, angle_major, and anisotropy_ratio. Returns a dict with keys center, R (rotation matrix), S (scaling matrix).
  2. apply_transform() applies the transform: the input azimuth angle is internally converted to arithmetic (alpha = 90 - azimuth), then the standard rotation matrix is built. Coordinates are translated to the centroid, rotated, then the minor axis is scaled by 1/ratio.
  3. A clone of the variogram is created with anisotropy_enabled = False. This clone (variogram_for_kriging) is passed to PyKrige so that PyKrige does not apply its own internal anisotropy on top of the pre-transformed coordinates.

When anisotropy is disabled: - transform_params = None - x_model = all_x, y_model = all_y (raw coordinates used directly) - variogram_for_kriging = variogram (original object, unchanged)

Coordinate space: Raw β†’ Model.

Key outputs: transform_params, x_model, y_model, variogram_for_kriging

Angle convention: angle_major is an azimuth (clockwise from North), matching the KT3D SETROT convention. angle_major = 0Β° means the major axis of spatial correlation points North. angle_major = 90Β° means it points East. Internally, the code converts to arithmetic via alpha = 90 - azimuth. See docs/glossary.md for the full conversion table.


Stage 7 β€” Polynomial Drift Computation

Condition: At least one polynomial drift term is enabled in drift_terms.

  1. compute_resc() computes the rescaling factor: resc = sqrt(sill / max(radsqd, rangeΒ²)) where radsqd is the squared maximum radius of the data extent. The rangeΒ² floor prevents instability when data extent is small relative to the correlation range.
  2. compute_polynomial_drift() builds the drift matrix columns. Term ordering is always [linear_x, linear_y, quadratic_x, quadratic_y] regardless of config dict order.

Coordinate space: Model (uses x_model, y_model).

Key outputs: drift_matrix_poly (shape [N, n_poly_terms]), term_names_poly (ordered list of term name strings), resc (scalar)


Stage 8 β€” AEM Linesink Drift Computation (Conditional)

Condition: drift_terms.linesink_river is true or {"use": true, ...}, AND a valid shapefile path exists.

compute_linesink_drift_matrix() computes the Analytic Element Method potential field for each linesink group. It returns: - drift_matrix_aem β€” shape [N, n_groups] - term_names_aem β€” one name per group (the group_column value) - trained_scaling_factors β€” dict mapping group name β†’ scaling factor

AEM Coordinate Space

The coordinates passed to AEM computation depend on the apply_anisotropy flag:

apply_anisotropy Coordinates used for AEM Linesink geometry
true (default) x_model, y_model (model space) Transformed to model space
false all_x, all_y (raw space) Kept in raw space

Critical: When apply_anisotropy = false, the AEM drift is computed in raw space while polynomial drift and kriging use model space. This is physically meaningful when the river geometry should not be distorted by the anisotropy transformation, but requires careful interpretation.

Key outputs: drift_matrix_aem, term_names_aem, trained_scaling_factors


Stage 9 β€” Drift Matrix Merging

drift_matrix = np.hstack([drift_matrix_poly, drift_matrix_aem])
term_names   = term_names_poly + term_names_aem

The final drift_matrix has shape [N, n_poly + n_aem]. If no drift terms are active, both sub-matrices are zero-column arrays and drift_matrix has shape [N, 0].

Critical contract: The order of columns in drift_matrix and the corresponding entries in term_names must be identical between training (this stage) and prediction (Stage 13). Any change in enabled terms between runs will cause a column count mismatch error in predict_at_points().


Stage 10 β€” Model Building

build_uk_model() wraps PyKrige's UniversalKriging with drift_terms="specified". The drift matrix columns are passed as specified_drift_arrays.

Coordinate space: Model (x_model, y_model).

Key outputs: uk_model β€” trained UniversalKriging instance.


Stage 11 β€” Diagnostics

Three diagnostic functions run after model training. All are non-fatal (log warnings/errors but do not abort):

  1. drift_diagnostics() β€” checks drift column magnitudes relative to the variogram sill. Warns if any term's ratio exceeds 1000.
  2. verify_drift_physics() β€” for each polynomial term, fits a regression and checks RΒ² > 0.999 and slope error < 1%. Returns "PASS", "FAIL", or "SKIP" per term. AEM terms are skipped.
  3. diagnose_kriging_system() β€” performs an exact interpolation test (predicts at the first training point and checks error < 0.01) and re-reports drift magnitude ratios.
  4. output_drift_coefficients() β€” runs an OLS regression of all_h on drift_matrix and logs the coefficients. Non-intrusive diagnostic only.

Stage 12 β€” Cross-Validation (Conditional)

Condition: cross_validation.enabled = true

cross_validate() performs Leave-One-Out Cross-Validation (LOOCV). For each fold, it rebuilds the full pipeline (including coordinate transformation and drift computation) using the remaining Nβˆ’1 points, then predicts at the held-out point.

Coordinate space: Accepts raw coordinates; applies transformation internally per fold.

Key outputs: cv_results dict with keys rmse, mae, q1 (mean standardized error), q2 (variance of standardized errors).


Stage 13 β€” Grid Prediction

predict_on_grid() generates a regular grid from the bounds and resolution in config.grid, then predicts at every grid node.

Internally: 1. Grid nodes are defined in raw space from config.grid bounds. 2. If transform_params is not None, grid nodes are transformed to model space before prediction. 3. Drift columns are reconstructed at each grid node using compute_drift_at_points() with the same resc and term_names from training. 4. AEM drift at grid nodes uses scaling_factors = trained_scaling_factors (passed explicitly) to ensure the same normalization as training.

Critical contract: trained_scaling_factors from Stage 8 must be passed to predict_on_grid() via the scaling_factors parameter. If omitted, the AEM potential will be re-normalized to the grid point distribution, producing incorrect predictions.

Key outputs: GX, GY (meshgrid arrays, raw space), Z_grid (predicted head), SS_grid (kriging variance).


Stage 14 β€” Output Generation

Three optional outputs, each controlled by config.output:

Output Config Key Function Format
Prediction map generate_map: true generate_map() Matplotlib figure (displayed or saved as PNG)
Contour lines export_contours: true export_contours() Shapefile (LineString, 3D with elevation attribute)
Observation points export_points: true export_aux_points() Shapefile (Point with head attribute)
Water-level GeoTIFF export_water_level_tif: true export_water_level_tif() Raster .tif
Water-level ASCII grid export_water_level_asc: true export_water_level_ascii_grid() Raster .asc

Output directories are created automatically if they do not exist.


Critical Contracts Summary

These invariants must hold across the entire pipeline. Violating any of them produces incorrect results or runtime errors.

1. term_names Order Consistency

The list term_names produced in Stage 9 must be identical (same length, same order) when used in: - Stage 10: build_uk_model() β€” establishes the column mapping - Stage 13: predict_on_grid() β†’ predict_at_points() β€” validates column count

Consequence of violation: ValueError: Drift column count mismatch at prediction time.

2. AEM scaling_factors Persistence

The trained_scaling_factors dict returned by compute_linesink_drift_matrix() in Stage 8 must be passed unchanged to predict_on_grid() in Stage 13 via the scaling_factors parameter.

Consequence of violation: AEM drift columns at prediction points will be normalized to the prediction-point distribution rather than the training-point distribution, breaking the linear relationship between training and prediction drift.

3. Variogram Clone for Anisotropy

When anisotropy.enabled = true, the variogram passed to PyKrige (variogram_for_kriging) must have anisotropy_enabled = False. The original variogram object (with anisotropy enabled) is used only for parameter extraction (sill, range, angle, ratio).

Consequence of violation: PyKrige applies its own internal anisotropy transformation on top of the pre-transformed coordinates, resulting in double-application of anisotropy.

4. AEM Coordinate Space Consistency

When apply_anisotropy = false for linesink drift, the AEM drift matrix is computed in raw space (all_x, all_y). The polynomial drift and kriging still use model space (x_model, y_model). This mixed-space configuration is intentional and supported, but the same apply_anisotropy setting must be used consistently between training and prediction.


Coordinate Space Summary

Data Space at Each Stage
Observation wells loaded Raw
Control points loaded Raw
Merged dataset (all_x, all_y) Raw
After transform (x_model, y_model) Model (or Raw if anisotropy disabled)
Polynomial drift columns Model
AEM drift columns (apply_anisotropy=true) Model
AEM drift columns (apply_anisotropy=false) Raw
PyKrige training coordinates Model
Grid definition (config.grid bounds) Raw
Grid nodes passed to PyKrige Model (transformed internally)
Output grid arrays (GX, GY) Raw
Contour/point shapefiles Raw