Jiskta Methodology & Data Lineage

White Paper on Data Integrity and Quantitative Precision

1. Introduction

As industries ranging from algorithmic trading and real-estate physical risk assessment to academic research and strict regulatory ESG reporting (such as EU CSRD) increasingly rely on high-resolution climate data, the demand for absolute transparency regarding the source and aggregation of this data is paramount. Data lineage can no longer be a black box.

Jiskta is not a predictive modeling agency or an AI platform. Jiskta is a high-speed data conduit and quantitative aggregation engine that acts as an auditable bridge between official scientific archives (such as the European Union’s Copernicus programs) and downstream corporate, regulatory, or academic applications.

This white paper outlines the deterministic mathematical and spatial methodologies used by Jiskta to ensure 100% data integrity, exact scientific precision, and strict auditability.

2. Definitive Data Provenance (The European Gold Standard)

Jiskta relies exclusively on authoritative, peer-reviewed, and publicly funded institutional data. We do not use third-party commercial blends, proprietary IoT sensor networks of unknown calibration, or unverified secondary sources.

Atmospheric & Climate Data: Sourced directly from the Copernicus Atmosphere Monitoring Service (CAMS) and Copernicus Climate Change Service (C3S), operated by the European Centre for Medium-Range Weather Forecasts (ECMWF) on behalf of the European Commission.
- Datasets: EAC4 Validated Reanalysis, CAMS European Reanalysis, ERA5 Global Reanalysis.
Industrial Emissions Data: Sourced from the European Environment Agency (EEA).
- Dataset: European Pollutant Release and Transfer Register (E-PRTR).
Water Risk Data: Sourced from the World Resources Institute (WRI).
- Dataset: Aqueduct 4.0 Global Water Risk Indicators (catchment polygons, rasterised to 0.1°).
Biodiversity Protection Data: Sourced from the European Environment Agency (EEA).
- Dataset: Natura 2000 protected areas network (27,295 EU protected sites, rasterised to 0.1° with distance transform).
Spatial Land Cover & Nightlights: Sourced from NASA Earthdata and the European Commission Joint Research Centre (JRC).
- Datasets: VIIRS Nighttime Lights (economic intensity), MODIS Land Cover, Global Human Settlement Layer (GHSL).
Macro-Economic Context: Sourced openly from The World Bank and Eurostat.

3. Data Integrity & The “No AI” Guarantee

A critical risk in modern ESG reporting is the use of Generative AI or machine learning models to “estimate” local pollution where spatial data is missing. This introduces un-auditable margins of error (hallucinations).

Jiskta strictly prohibits the use of AI gap-filling.

Our pipeline ingests the official institutional NetCDF (Network Common Data Form) scientific files.
The raw values provided by ECMWF are mathematically quantized using a deterministic precision matrix (e.g., maintaining 0.01 µg/m³ exact precision for NO2 points) and cached using mathematically lossless integer delta encoding. This ensures that the extracted values mathematically match the exact scientific precision output by the original forecasting instruments.
If an official European grid cell lacks spatial data for a specific hour, Jiskta retains the mathematical null state and excludes it from the denominator during aggregation. We never invent spatial data.

3.1. Temporal Gap-Filling via Data Maturation

While Jiskta refuses to use statistical models to guess missing geographic data, we do solve the problem of temporal reporting lags (the delay between real-world emissions and the publication of validated scientific datasets) by employing an official ECMWF data maturation cascade.

To provide companies with up-to-date figures for the current reporting year, Jiskta bridges temporal gaps by automatically prioritizing the highest available official tier:

Near Real-Time (NRT) Analysis: Used to cover the immediate ~1-4 month temporal gap before reanalysis is published.
Interim Reanalysis: Automatically replaces NRT data after ~4 months, introducing heavier ground-station assimilation for stronger trend accuracy.
Validated Reanalysis (The Gold Standard): Replaces Interim data after 18-24 months. Fully validated and frozen for final regulatory compliance.

The Jiskta Audit Trail explicitly logs which data tier was mapped to each month so auditors can see the exact scientific maturity of the data retrieved.

4. Spatial Methodology: Matching Locations to Grids

When a user uploads a physical address or GPS coordinate, Jiskta performs a deterministic mathematical transformation to query the surrounding environment.

4.1. Point-to-Grid Snapping (Nearest Neighbor)

Copernicus environmental data is published as a raster grid (e.g., 0.1° x 0.1° resolution over Europe). When a specific facility coordinate (lat, lon) is queried, Jiskta uses strict Nearest Neighbor snapping.

The system calculates the exact center of all surrounding official grid cells.
It snaps the coordinate to the single grid cell center that minimizes the Haversine distance, representing the immediate atmospheric column above the facility.
Mathematical offset: The maximum spatial error margin is exactly bounded by the grid resolution (e.g., ±0.05 degrees).

4.2. Spatial Bounding Boxes (Area Averaging)

For larger sites (e.g., a port or a large industrial complex), users can query a bounding box defined by (lat_min, lat_max, lon_min, lon_max).

Jiskta identifies the complete set of grid cells (i, j) that fall within this polygon.
The raw hourly values of all matched cells are uniformly averaged using a standard arithmetic mean: $\bar{x}_{t} = \frac{1}{N} \sum_{k=1}^{N} x_{k,t}$ Where $N$ is the number of valid grid cells in the bounding box at time $t$ .

4.3. Administrative Spatial Joins (Regional Aggregation)

To power macro-level reporting, Jiskta executes deterministic spatial joins between high-resolution raster grids and administrative boundaries (e.g., European NUTS3 macro-regions or global ISO2 countries).

The engine mathematically evaluates intersecting raster grid cells within the defined administrative polygon.
Regional values are derived by calculating the mean of all cells native to that region, enabling exact programmatic correlations between physical atmospheric states and the region’s socio-economic metrics (like GDP or population density).

5. Temporal Methodology: Aggregation & Statistics

To provide 10-year historical baselines (required by ESRS E2), Jiskta aggregates millions of hourly data points into human-readable daily, monthly, or annual metrics.

5.1. Temporal Means

All temporal aggregations are calculated strictly using arithmetic means over exactly matched time boundaries (UTC). For example, a monthly average for January 2023 is mathematically derived by summarizing the $24 \times 31 = 744$ hourly atmospheric state values, divided by $744$ . No seasonal weighting or proprietary smoothing is applied.

5.2. Trend Analysis (Ordinary Least Squares)

When calculating historical pollution or climate trends (e.g., “NO2 has fallen by -0.4 µg/m³/year”), Jiskta uses standard Ordinary Least Squares (OLS) linear regression over the time series. This is a universally accepted statistical method that any auditor can reproduce mathematically in standard spreadsheet software.

5.3. Regulatory Threshold Analysis (Exceedance)

Many ESG directives and World Health Organization (WHO) guidelines define compliance using absolute limit values (e.g., measuring the exact number of hours an asset was exposed to PM2.5 > 15 µg/m³).

To compute this, Jiskta does not use probability distributions.
The query engine iterates sequentially over the historic time-series vector for the specific location and tallies a deterministic integer count of hours where the detected value strictly exceeds the defined threshold.
The result is a mathematically pure ratio of breached hours to valid observed hours.

6. Proximity Analysis (E-PRTR Industrial Facilities)

To provide context on nearby heavy polluters, Jiskta performs a radial Haversine search around the user’s coordinate.

The formula accounts for the Earth’s curvature (spherical distance).
It queries the EEA’s geocoded database of active industrial facilities.
All facilities within the requested radius (e.g., 10 km) are returned alongside their officially reported annual emission mass (in kg/year).

7. Environmental Baseline Datasets: Conversion Methodology

Two static environmental risk datasets are rasterised to the standard 0.1° × 0.1° grid used throughout Jiskta. This section documents the exact conversion methodology so users understand the spatial precision and limitations of each.

7.1. WRI Aqueduct 4.0 — Water Risk Rasterisation

Source: World Resources Institute (WRI), Aqueduct Water Risk Atlas v4.0 (2023).
License: CC BY 4.0. Cite as: Kuzma et al. (2023), Aqueduct 4.0, WRI Technical Note.
Native format: Polygon shapefile — one record per hydrological catchment (sub-basin), globally covering ~100 k–500 k distinct catchment polygons.
Native variables: Baseline Water Stress (bws), Baseline Water Depletion (bwd), Riverine Flood Risk (rfr), Drought Risk (drr) — each reported as a continuous risk score and a discrete 1–5 category (1 = Low, 5 = Extremely High).

Conversion pipeline:

Download: The official Aqueduct GeoPackage is downloaded from the WRI ArcGIS Hub. Polygons already carry pre-computed risk category fields (bws_cat, bwd_cat, rfr_cat, drr_cat).
Rasterisation at 0.1°: A 0.1° × 0.1° global grid (3600 × 1800 cells) is constructed. For each grid cell, the dominant catchment polygon is identified using a centroid-overlap rule — the category value of the catchment whose polygon contains the grid cell centre (lat + 0.05, lon + 0.05) is assigned. For ocean/no-data cells, a sentinel value of 0 is written.
Tie-breaking: Where multiple catchment polygons overlap a single cell (common near river deltas), the polygon with the greatest intersection area takes precedence.
Continuous score: In addition to the integer category, the continuous bws_score, bwd_score, rfr_score, drr_score values are stored as float32 for research users.
Binary index: The rasterised grid is serialised as a compact binary file (water_stress.bin) using a AQRS v1 header (magic, version, n_lats, n_lons, res, lat_min, lon_min) followed by one 20-byte record per cell (4× int16 category + 4× float32 score, packed).

Precision & limitations:

The assignment is strictly nearest-centroid — a 0.1° cell centroid is at most 7 km from the nearest edge of its assigned catchment.
Coastal cells may return a score of 0 if their centroid falls marginally in the ocean; this is expected behaviour (water risk is not defined for open water).
The dataset is a static 1979–2019 calibration baseline; it does not change seasonally or year-by-year. It will be updated when WRI releases Aqueduct 5.0.
API field: water_stress in every GET /api/v1/enrich response (global, no credits deducted).

7.2. EEA Natura 2000 — Biodiversity Proximity Time Series

Source: European Environment Agency (EEA), Natura 2000 protected area network (2024 update) + EEA N2K Backbone ArcGIS REST API.
License: CC BY 4.0. Attribution: “European Environment Agency (EEA), Natura 2000 network”.
Coverage: European Union + Norway, Switzerland, UK — approximately 30°N–72°N, 32°W–45°E.
Native format: ArcGIS MapServer polygon service (27,295 protected sites — Special Protection Areas (SPA) and Special Areas of Conservation / Sites of Community Importance (SAC/SCI)).
Time series: Annual rasters 2004–2024 (21 years), exposable via optional ?year= parameter.

Conversion pipeline:

High-resolution raster download: The EEA’s ArcGIS MapServer export endpoint is queried at 5× the target resolution — i.e. for a 770 × 450 output grid (0.1° cells), the download is 3850 × 2250 pixels. This oversampling is critical: at native 1:1 resolution the server applies cartographic generalisation (minimum symbol size) that inflates polygon footprints, creating false positives in urban cells adjacent to protected areas.
Majority-vote downsampling: Each 0.1° output cell corresponds to 5 × 5 = 25 sub-pixels in the high-resolution download. A cell is classified as in_natura = True only if more than 50% (>12/25) of its sub-pixels are covered by a Natura 2000 polygon fill. This strict majority vote eliminates rendering artefacts while correctly capturing genuine partial coverage (e.g. Brussels — Forêt de Soignes SAC genuinely intersects the 0.1° cell).
Validation against polygon API: Results are cross-checked by querying the EEA ArcGIS feature service for the exact 0.1° bounding box of suspicious cells. City-centre cells previously showing false positives (London, Berlin) now correctly return in_natura = False.
Distance transform: For cells outside any Natura 2000 site, the Euclidean distance to the nearest protected area boundary is computed using scipy.ndimage.distance_transform_edt applied to the binary in_natura grid. Distance is scaled at approximately 11.1 km per pixel (the latitude-direction cell size at 0.1° resolution), yielding dist_km rounded to 1 decimal place.
Binary index (N2KR): The raster result is stored as a compact N2KR v1 binary file (~1.7 MB): a 32-byte header followed by 5 bytes per cell (float32 dist_km + uint8 flags, bit 0 = in_natura). One file per year is produced: natura2000_2004.bin through natura2000_2024.bin.
Named-sites index (N2KS v2): A complementary N2KS v2 binary file (~4.6 MB, 27,295 records, 168 bytes/record) is built from the EEA ArcGIS feature service. V2 adds two uint16 fields per site: year_spa (year the site was first submitted as an SPA) and year_sac (year the site was first submitted as a SAC/SCI). The file layout is 32 bytes header + 168 bytes per site.

Designation year assignment (N2KS v2):

Designation years are derived from the EEA N2K Backbone ArcGIS REST API (N2K_Backbone/Releases_Spatial_R3_stg), which exposes the submission date of each 27,295-site record. Individual designation dates are not published in machine-readable form by the EEA, so the following proxy is used:

Sites whose earliest recorded submission date (min Date field) is on or before 31 December 2017 are assigned year = 2004 (the BASELINE_YEAR). These sites formed part of the original Habitats Directive compliance baseline and have been present in the network since its establishment.
Sites whose earliest recorded submission date is 2018 or later are assigned the actual year of first appearance in the Backbone.

This approach correctly captures the rapid network expansion that occurred in 2021–2024 (large batches of new sites submitted by France, Poland, and Greece) while avoiding the false impression that all sites were designated recently. Approximately 27,176 of the 27,295 sites have a year assigned; the remaining ~119 sites have insufficient metadata and default to year 0 (treated as always-present by the query engine).

Time series raster interpretation:

Year 2004–2020: ~1,749 cells classified as in_natura = True — the original baseline ~109 sites
Year 2021: ~2,167 cells (572 additional sites submitted)
Year 2023: ~3,873 cells (9,839 additional sites)
Year 2024: ~9,903 cells (14,137 additional sites — largest single batch submission)

This is a proxy for the expansion of the protected area network over time. Users researching biodiversity trends should be aware that the apparent rapid growth in 2021–2024 partly reflects EEA data digitisation catching up with long-standing national designations, not necessarily new land being protected.

Site type codes:

A — Special Protection Area (SPA, Birds Directive)
B — SPA + Special Area of Conservation (both Directives)
C — Special Area of Conservation / Site of Community Importance (SAC/SCI, Habitats Directive)

API response fields:

year_used — the actual raster year applied (floor to nearest available year ≤ requested ?year=)
year_spa, year_sac — site designation years from N2KS v2 (0 if unknown)
When in_natura=true: sitecode, sitename, sitetype (the containing site)
When in_natura=false: nearest_sitecode, nearest_sitename, nearest_sitetype (closest site by centroid distance)
Both index files must be present for site identification. If only N2KR is present, only in_natura, dist_km, and year_used are returned.

Precision & limitations:

Spatial resolution is 0.1° (~8–11 km depending on latitude). A cell partially covered by a Natura 2000 site (e.g. a forest on the urban fringe) will be tagged in_natura = True if the majority of the cell is inside the polygon.
dist_km = 0.0 means the cell centre falls inside a protected area; it does not mean the facility centroid itself lies inside (which would require finer-resolution geometry).
The N2KS centroid is the bbox midpoint of the simplified ArcGIS polygon. For large or irregular sites (e.g. archipelago SPAs) this centroid may not fall inside the actual habitat; sitename + sitecode remain correct for CSRD disclosure purposes.
The year assignment proxy (EEA Backbone submission date) may not match the precise national gazette date of a designation. For CSRD purposes, the year provides a conservative approximation of when a site entered the EEA’s official Natura 2000 reporting.
The dataset is rebuilt annually when EEA publishes a new Natura 2000 boundaries update.
API field: natura_2000 in GET /api/v1/enrich responses — EU only. Field is absent for non-EU coordinates. ESRS E4-2 §30, E4-5 §40.

8. Conclusion

The Jiskta ecosystem is designed to be a transparent pane of glass between complex European scientific archives and corporate ESG reporting. By relying strictly on arithmetic aggregation, authorized data sources, and rejecting AI-based estimation models, Jiskta guarantees that all output metrics are 100% reproducible and audit-safe.