CDF Datasets – WP

This section collects the atmospheric datasets produced by the Complete Data Fusion (CDF) algorithm — the output products of the fusion process. Each CDF dataset is a new atmospheric product obtained by fusing two or more tested input datasets using the CDF algorithm, inheriting and improving upon the information content of its sources.

A CDF-fused product is itself a full optimal-estimation product: it comes with its own state vector, averaging kernel matrix, total error covariance matrix, and a priori information. It can therefore be subjected to the same quality tests applied to the input datasets (auto-consistency, completeness), and can in principle be used as input to further fusion steps. The key difference with respect to the input datasets is that a CDF dataset must also demonstrate, through independent validation, that the fusion has produced a genuine improvement over the individual inputs — not merely a formal combination.

Central goal of the EMM project. The production and public release of validated CDF datasets is the primary deliverable of the EMM project. The Tested Datasets section documents what goes into the CDF algorithm and verifies that the inputs are suitable; this section documents what comes out — the fused products, their quality characterization, and their validation against independent measurements. The two sections are designed to be read together, providing a complete, traceable chain from raw satellite products to fused atmospheric data.

What defines a CDF Dataset

Each CDF dataset is uniquely identified by:

Input combination — the specific set of satellite instruments and retrieval products that were fused (e.g., MIPAS + IASI/FORLI)
Atmospheric constituent — the target variable of the fused product (e.g., ozone, temperature)
CDF configuration — the algorithm version (CDF(2022) or CDF(2015)), the reference vertical grid and a priori, and the strategy adopted for interpolation and coincidence errors
Temporal and spatial coverage — the period, geographic extent, and spatio-temporal resolution of the production

For each dataset, the improvement introduced by the fusion is quantified through two complementary approaches:

Internal characterization — comparison of the fused product’s degrees of freedom (DOFs), averaging kernel diagonal, and total error profiles against those of the individual input datasets. A successful fusion must yield higher DOFs and lower total errors.
Independent validation — comparison of the fused product against reference measurements not used in the fusion (e.g., ozonesondes, ground-based lidars, independent satellite products). The validation must demonstrate that the improvement predicted by the internal characterization translates into a real reduction in bias and/or variability.

Template for CDF Dataset documentation

Each CDF dataset is documented in a dedicated child page following a standardized template. The template mirrors and extends the structure used for the input datasets, adding the sections specific to fused products (fusion configuration, improvement characterization, validation). The sections are:

§	Section	Content
1	General information	Constituent, input instruments, observation period, geographic coverage, spatial and temporal resolution
2	Input datasets	Links to the Tested Datasets pages for each input product, with a summary of their auto-consistency test results. This provides full traceability of the fused product back to its sources.
3	CDF configuration	Algorithm version (CDF(2022) / CDF(2015)), reference vertical grid and a priori, coincidence criteria (spatial and temporal windows), interpolation error model, coincidence error model, Gram–Schmidt basis expansion (if applicable)
4	Product characterization	Averaging kernel diagonal, DOFs, total error profiles — shown alongside the corresponding quantities of the input datasets to demonstrate the improvement introduced by the fusion
5	Auto-consistency test	The fused product is itself a full OE product: the same auto-consistency tests applied to the input datasets are applied to the fused product to verify its internal coherence
6	Independent validation	Comparison with reference measurements not used in the fusion (ozonesondes, ground-based instruments, independent satellite products). Bias profiles, standard deviation, correlation statistics — shown for both the fused product and the individual inputs, to quantify the added value of the fusion
7	Data access and citation	FAIR references: persistent identifier (DOI), download URL, data format, licence, recommended citation. The goal is to make all CDF datasets publicly available for the scientific community.
8	Related work	Links to the pilot study that preceded the production, to the published paper describing the dataset (if available), and to the bibliography

Connection to the input datasets. The template is deliberately structured to echo the Tested Datasets pages: the same tests that verify the quality of the inputs are also applied to the outputs, ensuring a consistent quality framework throughout the entire CDF processing chain. The key additions in the CDF Dataset template are §3 (fusion configuration, which has no counterpart in the input template) and §6 (independent validation, which goes beyond the auto-consistency tests used for inputs).

Dataset index

CDF Dataset	Input instruments	Constituent	Period	Reference	Status
MIPAS+IASI O₃	MIPAS/IFAC (limb) + IASI/AERIS (nadir TIR)	O₃	2008–2011	Guidetti et al. (2026)	In production
MIPAS+GOME-2 O₃	MIPAS/IFAC (limb) + GOME-2/AC-SAF (nadir UV)	O₃	2008–2011	—	Planned
MIPAS+IASI+GOME-2 O₃	MIPAS/IFAC + IASI/AERIS + GOME-2/AC-SAF	O₃	2008–2011	—	Future

Legend — status: Published = dataset produced, validated, and publicly available with DOI · In production = dataset being produced, publication in progress · Planned = dedicated tuning and validation study required before production · Future = target combination identified, feasibility demonstrated in pilot studies.

MIPAS+IASI O₃ — first CDF dataset

The first CDF dataset produced from real satellite observations combines MIPAS (limb, Envisat) and IASI (nadir TIR, Metop) ozone profiles over the period 2008–2011. The dataset was developed and validated in the framework of L. Guidetti’s PhD project and is described in Guidetti et al. (2026).

Input instruments

MIPAS/IFAC (limb, ~24.6 DOFs) + IASI/AERIS FORLI-O₃ (nadir, ~2.8 DOFs)

Fusion result

Fused O₃ profile on IASI grid (~3.6 DOFs in troposphere/LS); reduced bias and total errors vs both inputs

Validation

Validated against WOUDC ozonesondes (5–55°N); improvement confirmed on independent years (2009–2011)

The key scientific result is that the contribution of MIPAS improves the quality of the IASI product even in the troposphere, where MIPAS itself does not measure — a direct demonstration of the information propagation mechanism in the CDF framework. The fused product also enables the detection and characterization of stratospheric ozone intrusions that are not resolved by either instrument individually.

Traceability. The input MIPAS dataset is documented in the MIPAS/IFAC tested dataset page; the input IASI dataset in the IASI/AERIS page. Both pass the CDF(2022) auto-consistency test. The exploratory studies that preceded this production are described in the Pilot Studies section.

Planned and future datasets

MIPAS+GOME-2 O₃

The exploratory characterization carried out in the pilot studies has shown that the MIPAS+GOME-2 fusion yields DOFs and error reduction comparable to MIPAS+IASI when both are expressed on the same grid and a priori. However, a dedicated tuning study is required before production: the coincidence and interpolation error strategies must be optimized specifically for the MIPAS+GOME-2 combination, and the fused product must be validated against independent reference measurements (ozonesondes) with the same rigour applied to the MIPAS+IASI dataset.

MIPAS+IASI+GOME-2 O₃ — three-instrument gridded product

The combination of all three instruments — MIPAS (limb), IASI (nadir TIR), and GOME-2 (nadir UV) — on a regular 1°×1° grid over the full 2008–2011 period would represent the most comprehensive CDF ozone product achievable from these missions. The feasibility of this combination has been demonstrated in the pilot studies, and daily coverage analysis shows that the three instruments together provide dense spatial sampling. This dataset would combine MIPAS’s vertical resolution in the stratosphere with the complementary tropospheric sensitivity of IASI and GOME-2, and the dense horizontal coverage of the nadir instruments. Its production requires both the MIPAS+IASI and MIPAS+GOME-2 fusion chains to be individually validated and optimized.

Guiding principles

The production and documentation of CDF datasets follows these principles:

Full traceability — every fused product is linked back to its input datasets, which are independently tested and documented in the Tested Datasets section. The fusion configuration (algorithm version, grid, a priori, error strategies) is recorded in detail.
Self-consistency — the fused product is treated as a first-class OE product and subjected to the same auto-consistency tests used for input datasets. A fused product that fails its own auto-consistency test signals a problem in the fusion process.
Independent validation — internal characterization (DOFs, errors) is necessary but not sufficient. Each dataset must be validated against reference measurements that were not part of the fusion, and the validation must cover both the tuning period and an independent temporal segment.
FAIR data — all validated CDF datasets will be published with persistent identifiers (DOIs), open access, standardized metadata, and recommended citations, following the FAIR principles for scientific data.
Reusability — because CDF-fused products carry full OE characterization (state vector, AK, VCM, a priori), they can be ingested by data assimilation systems, used as input to further CDF steps, or compared with model output using standard AK-smoothing techniques.

Tested Datasets — the input datasets used by the CDF algorithm, with completeness and auto-consistency test results
Pilot Studies — exploratory fusion experiments on real data that precede systematic production
CDF Algorithm — mathematical formulation, prerequisites, and test descriptions
CDF Tests — auto-consistency and mono-type fusion test descriptions
Bibliography — annotated references for the CDF algorithm and its applications

What defines a CDF Dataset

Template for CDF Dataset documentation

Dataset index

MIPAS+IASI O3 — first CDF dataset

Planned and future datasets

MIPAS+GOME-2 O3

MIPAS+IASI+GOME-2 O3 — three-instrument gridded product

Guiding principles

Related pages

MIPAS+IASI O₃ — first CDF dataset

MIPAS+GOME-2 O₃

MIPAS+IASI+GOME-2 O₃ — three-instrument gridded product