This page collects the open questions, planned work, and longer-term research directions identified in the course of the EMM project. They span four areas: algorithm development, dataset production and testing, future instruments and missions, and tools and infrastructure. Items within each area are roughly ordered from most concrete and near-term to most exploratory.
1. Algorithm Development
These are open theoretical and methodological questions about the CDF algorithm itself — its mathematical foundations, numerical robustness, and extensions to new measurement scenarios. Most of these items affect the correctness or generality of the algorithm regardless of which instruments are used.
1.1 Rigorous test criteria
The success conditions for both the auto-consistency test and the mono-type fusion test are currently defined heuristically. For the auto-consistency test, the condition is informally “differences must be within total error”; for the mono-type fusion test, the fused product must show “higher DOFs and lower errors.” Neither criterion has been given a precise mathematical formulation that could be applied objectively across all datasets.
Developing rigorous, quantitative thresholds — ideally derived from the statistical properties of the input products rather than set empirically — would make the tests reproducible and interpretable as genuine data quality indicators. This is particularly important for the auto-consistency test, which doubles as a diagnostic tool for the uncertainty characterization stored in the input files.
1.2 Guarantee of characterization matrix consistency in the output
A well-formed OE product must satisfy three algebraic relationships among its characterization matrices (P1–P3, see the Prerequisites page). The current CDF formulations derive the output matrices directly from the input quantities without explicitly enforcing these relationships. While in theory they are satisfied exactly, numerical errors can break them in practice, leaving the fused product internally inconsistent.
An improved CDF(2022) formulation is needed that guarantees P1–P3 hold for the output matrices exactly, independently of numerical round-off. This would both increase robustness and eliminate a potential source of downstream errors when the fused product is used as input to further processing steps (assimilation, additional CDF, etc.).
1.3 Mathematical analysis of the CDF(2015) auto-consistency condition
When CDF(2015) is applied in auto-consistency mode, the output profile reduces to the input profile only if the additional condition \mathbf{A}_i^T \mathbf{S}_{ni}^{-1} \approx \mathbf{S}_i^{-1} is satisfied. This condition is not always met by real data products, and it is not yet understood what it means physically or statistically, or under what circumstances it can be expected to hold. A mathematical analysis of this condition — its relationship to the properties of the retrieval, its sensitivity to singular covariance structures, and its implications for the validity of CDF(2015) applications — remains an open problem.
1.4 A priori constraint strength
The CDF algorithm requires a choice of a priori profile and variance-covariance matrix for the fused product. This choice determines how strongly the output is constrained toward the a priori in regions where measurement information is limited. Two related problems remain open:
- For fused products: develop a principled method for selecting the fused-product a priori, ensuring that the constraint neither suppresses real atmospheric variability nor leaves the solution poorly constrained in unmeasured altitude regions.
- For generic OE products: the same problem exists for individual retrievals; understanding it in the single-instrument context is a prerequisite for addressing it in the fusion context.
The pilot studies have already shown empirically that the choice of a priori matters: when GOME-2 is reprojected onto the IASI grid using the IASI a priori, approximately 7 DOFs are lost compared to operating on the native GOME-2 grid. A theoretical understanding of this effect would guide the choice of reference a priori in future fusion configurations.
1.5 Coincidence and interpolation errors
Handling the errors introduced by imperfect spatial/temporal coincidence of measurements and by vertical interpolation between different grids is one of the most practically important open problems. Several aspects need attention:
- Coincidence error quantification: current approaches use fixed adjustments based on a priori atmospheric variability; more adaptive methods that account for the actual atmospheric state at the coincidence location and time are needed.
- Interpolation error strategies: nine different strategies were compared in the MIPAS+IASI pilot study, but no strategy has been shown to be universally optimal. A systematic theoretical analysis of the trade-offs — and validation on diverse instrument combinations — is needed.
- CDF(2022) formulas for configurations B and C: while Ceccherini et al. (2022) derived the CDF(2022) counterparts of the interpolation/coincidence error formulas, their properties in edge cases (very different grids, asymmetric coincidence windows) have not been fully characterized.
1.6 Systematic errors
The current CDF formulations assume that all input measurement errors are random (zero mean). Systematic errors — offsets and biases present in individual products — are not accounted for and can, in principle, propagate into the fused product in non-trivial ways. Developing a principled treatment of systematic errors in the CDF context is an important open problem, especially as CDF is applied to instrument combinations with known inter-calibration offsets.
1.7 Fusion on non-overlapping vertical grids
The standard CDF formulation requires all input products to share a common vertical range. An extended algorithm using a union-of-grids approach (described on the Extensions page) has been developed, but it has not yet been validated on real data. The correct handling of altitude boundaries, the physical interpretation of the fused product in regions covered by only one instrument, and the behaviour of the averaging kernels at grid edges all require careful empirical validation before the extension can be used in production.
1.8 State vectors with components on different vertical grids
When fusing multi-target retrieval (MTR) products whose different atmospheric species are each defined on a different vertical grid (e.g., ozone on a 41-layer grid and CO on a 19-layer grid), the standard interpolation framework cannot be applied directly. The extension requires defining per-species interpolation matrices and generalizing the sampling matrices accordingly. This extension is particularly relevant for IASI/EUMETSAT products, where different constituents are already retrieved on different grids.
1.9 Numerical stability and error analysis
The CDF equations involve repeated matrix inversions on products with potentially large dynamic range. Detailed analysis of truncation and round-off errors — and their dependence on the condition numbers of the covariance matrices and the magnitude of the state vector components — is needed to establish practical guidelines for safe numerical implementation. This includes defining diagnostic metrics to detect numerical instabilities before they corrupt the fused product silently.
1.10 Optimization of a priori constraint strength
The CDF framework allows, in principle, a choice of the overall strength of the a priori constraint applied to the fused product — independently of the choice of a priori profile (item 1.4). In practice, the a priori covariance matrix Sa can be scaled by a regularization parameter whose optimal value is not known a priori. Standard regularization theory offers objective criteria for this choice (L-curve, generalized cross-validation, discrepancy principle), but their adaptation to the CDF context — where the “signal” is the fused atmospheric state and the “noise” comes from multiple heterogeneous instrument sources — has not been studied. An overly strong a priori suppresses the measurement information contributed by the instruments; an overly weak one leads to poorly constrained solutions in altitude regions where no instrument has sensitivity. Developing an objective, data-driven method to optimize the constraint strength would improve the reliability and reproducibility of CDF-fused products across different instrument combinations and atmospheric regimes.
1.11 Extension of CDF to non-optimal-estimation products
The CDF algorithm, in its current formulation, requires all input products to be retrieved using the optimal estimation (OE) method, so that the state vector, averaging kernel matrix, and error covariance matrices are available. An increasing number of operational and research satellite products are, however, derived using methods that do not directly provide these quantities: neural networks, machine learning retrievals, empirical regression algorithms, or physical retrieval codes that do not propagate full error covariance information. Extending the CDF framework to accommodate such products — either by developing surrogate OE characterization from the available information (e.g., deriving approximate AK and VCM from ensemble or perturbation methods) or by reformulating the fusion to work with weaker characterization requirements — would substantially broaden the range of satellite data that can participate in CDF-based fusion. This extension is particularly relevant in the perspective of next-generation operational missions where machine learning retrievals are expected to become increasingly common.
2. Dataset Production and Testing
These items concern the application of the CDF algorithm to real satellite data — producing, validating, and characterizing the output datasets, and extending the testing of input datasets.
2.1 Smart-averages for all tested datasets
Mono-type fusion — fusing multiple collocated products from the same instrument into a single higher-quality product — is a useful application of CDF in its own right: it produces a regridded L3-like product that is free from the a priori bias of the individual inputs. For GOME-2 and IASI, preliminary results are already available from the 2021 pilot study. For all five tested datasets (GOME-2/AC-SAF, IASI/AERIS, IASI/EUMETSAT, MIPAS/IFAC, OMPS/NASA), dedicated scripts need to be developed to produce systematic smart-average results and document them in the corresponding dataset pages.
2.2 MIPAS+GOME-2 O3 — dedicated fusion study
The MIPAS+IASI fusion has been thoroughly characterized and validated (Guidetti et al., 2026). The MIPAS+GOME-2 configuration was explored in the 2025 pilot study and shows comparable improvement, but a dedicated study is needed before production: optimization of the coincidence/interpolation error strategy specifically for the GOME-2 product characteristics, full validation against WOUDC ozonesondes comparable to that performed for MIPAS+IASI, and assessment of the sensitivity to the choice of a priori and reference grid.
2.3 MIPAS+IASI+GOME-2 gridded product (1°×1°, 2008–2011)
Combining all three instruments on a regular 1°×1° grid over the full four-year period 2008–2011 is the most comprehensive ozone product achievable from these missions. Coverage analysis has shown that the three instruments together provide dense spatial and temporal sampling. This production requires both the MIPAS+IASI and MIPAS+GOME-2 fusion chains to be individually validated and optimized, making it contingent on item 2.2 above.
2.4 Open issues in existing tested datasets
Two unresolved issues in the current tested dataset documentation require attention:
- MIPAS/IFAC state vector dimension: the dataset page refers to 91 state vector elements while the test scripts operate on 81. The discrepancy likely reflects the inclusion of 10 continuum parameters in the full state vector; this needs to be confirmed and documented explicitly.
- OMPS/NASA AK reading: two variants of averaging kernel reading from the OMPS product file (from the AK array directly, or reconstructed from the Jacobian K) give very different DOFs (4.91 vs. 9.08) and test outcomes. The correct interpretation of the file format needs to be established before OMPS can be used as a reliable CDF input.
2.5 Validation methodology
The validation of the MIPAS+IASI dataset against WOUDC ozonesondes established a methodology (AK-smoothing of the reference, stratification by latitude band, independent temporal validation) that should be standardized and applied consistently to all future CDF datasets. A shared validation framework — with agreed reference datasets, smoothing procedures, statistical metrics, and reporting format — would make the results comparable across different instrument combinations and time periods.
3. Future Instruments and Missions
The CDF framework is not limited to the instrument combinations currently tested. Several extensions to operational and future satellite missions are of interest.
3.1 TROPOMI (Sentinel-5P)
TROPOMI is an operational nadir UV/SWIR spectrometer on Sentinel-5P (launched 2017) with very high horizontal resolution (~3.5 × 5.5 km). Its ozone profile product is a natural candidate for fusion with MIPAS-heritage or IASI-type limb/nadir instruments. The feasibility of integrating TROPOMI L2 products into the CDF framework was identified as a priority in the 2021 pilot study.
3.2 MetOp-SG: UVNS and IASI-NG
The next generation of EUMETSAT polar satellites (MetOp-SG, first launch 2025) carries UVNS (ultraviolet–visible–near-infrared–shortwave-infrared sounder, the Sentinel-5 instrument) and IASI-NG (next-generation thermal infrared sounder). Simulated fusion studies (Zoppetti et al., 2021) have demonstrated significant gains from combining these two instruments. Applying CDF to real UVNS and IASI-NG data as they become available is a direct follow-on to the simulation work.
3.3 MTG: UVN and IRS
The Meteosat Third Generation (MTG) geostationary satellites carry a UV–VIS–NIR sounder (UVN, the Flexible Combined Imager complement) and an Infrared Sounder (IRS). Their geostationary orbit provides continuous temporal sampling over Europe and Africa. Fusion of MTG instruments with LEO sounders (MetOp-SG) exploits the complementarity of geostationary temporal coverage with LEO vertical resolution — the scenario studied in simulation by Zoppetti et al. (2021) and Tirelli et al. (2020).
3.4 Expanded constituent coverage
The current EMM work focuses on ozone. The CDF framework is constituent-agnostic and could be applied to other atmospheric species for which complementary satellite retrievals are available: water vapour (IASI TIR + GPS radio occultation), methane (TROPOMI + IASI), carbon monoxide, and temperature profiles. Multi-target retrieval (MTR) fusion — where the state vectors contain multiple species simultaneously — allows the correlations among species within a single retrieval to be preserved and propagated into the fused product.
4. Tools and Infrastructure
These items concern the software environment, data access, and dissemination infrastructure needed to scale CDF from a research tool to an operational service.
4.1 CDF online demonstrator
An interactive demonstrator — accessible via the web — that allows users to upload two L2 products and obtain a CDF-fused result in near-real time was identified as a goal in the 2021 pilot study. Such a tool would lower the barrier to testing CDF with new instrument combinations and serve as both an educational resource and a practical validation aid.
4.2 Integration with EUMETSAT data catalog
Integration of the CDF processing chain with operational data delivery systems (EUMETSAT Data Store, AERIS, AC-SAF) would enable on-demand fusion of newly acquired L2 products as part of routine data processing. This is a prerequisite for any real-time or near-real-time CDF application.
4.3 FAIR data publication of CDF datasets
All validated CDF datasets will be published following the FAIR principles (Findable, Accessible, Interoperable, Reusable): persistent DOIs, standardized netCDF format with CF-compliant metadata, complete uncertainty characterization (state vector, AK, VCM, a priori), and open licence. Because CDF-fused products carry full OE characterization, they are immediately reusable in data assimilation systems without further processing — a significant advantage over L3 products that typically discard the averaging kernel information.
4.4 Assimilation-ready output format
Data assimilation systems (NWP models, chemical transport models) require input observations in formats compatible with their observation operators. Defining a standard output format for CDF datasets — one that exposes the averaging kernel, error covariance, and a priori in a way directly compatible with major assimilation frameworks (ECMWF, NEMO, GEOS-Chem) — would maximize the scientific impact of the fused products and is a logical next step after FAIR publication.
4.5 Python library: reorganization, optimization, and publication
The CDF algorithm is currently implemented as a Python 3 library developed at IFAC-CNR in the course of the EMM project. The library is functional and has been used to produce all the results documented on this website, but it was developed primarily as a research tool and has not yet been prepared for public release. Several steps are needed before publication:
- Reorganization — review and rationalize the module structure, separating the core CDF algorithm (formulas, configurations, error handling) from application-specific scripts (dataset readers, coincidence search, validation tools) and from utilities (plotting, I/O, grid interpolation).
- Code cleanup — remove dead code and experimental branches, harmonize naming conventions, ensure consistent use of physical units and variable naming throughout, and add docstrings to all public functions and classes.
- Optimization — profile the most computationally intensive operations (matrix inversions, interpolation, coincidence search over large datasets) and identify bottlenecks. Implement vectorized or sparse-matrix approaches where appropriate to allow scaling to the full multi-year datasets planned in section 2.
- Testing — develop a test suite covering the core CDF formulas (unit tests against analytical results), the auto-consistency and mono-type fusion tests (regression tests against the reference outputs documented on this website), and the dataset readers (integration tests against the actual file formats).
- Publication — publish the library under an open-source licence (e.g., MIT or EUPL) with a persistent identifier (DOI via Zenodo or similar), a citation entry, and versioned releases aligned with the CDF dataset publications. A short code paper or software note in a relevant journal (e.g., Geoscientific Model Development or Journal of Open Source Software) would provide a citable reference for users.
Publication of the library is a prerequisite for full reproducibility of the CDF datasets: users who wish to verify or extend the published results must be able to access and run the same code that produced them.
Items from this page that are already in progress or have dedicated pages elsewhere on this site: MIPAS+IASI O3 production → CDF Datasets; pilot studies for MIPAS+IASI and IASI+GOME-2 → Pilot Studies; algorithmic details of extensions and numerical issues → Extensions and Algorithm — Open Questions.
