D5.4 Tape Archive Interface
Climate in a narrow sense is usually defined as the average weather, or more rigorously, as the statistical description in terms of the mean and variability of relevant quantities over a period of time ranging from months to thousands or millions of years. The classical period for averaging these variables is 30 years, as defined by the World Meteorological Organization. The relevant quantities are most often surface variables such as temperature, precipitation and wind. Climate in a wider sense is the state, including a statistical description, of the climate system. modelling community is preparing for the next major model inter-comparison programme – CMIP6. This activity and associated modelling projects are expected to generate an archive in the Earth System Grid Federation (ESGF) over the next 5 years of at least 30-50 petabytes. Making this available online to users is costly and trade-off decisions have to be made.
Experience gained with the CMIP5CMIP5
Fifth Coupled Model Intercomparison Project archive would suggest that a centralised online model may need to be adapted to make better use of the climate data produced and held by the modelling centres; specifically:
- A significant proportion of the data held in the archive is rarely or never accessed; though the cost of submission, archiving and management of the data in ESGF is high and the modelling groups responsible for generating these data are also maintaining their own tape/disk archives.
- Users outside of the groups involved in model inter-comparison often require data that has been produced as part of the MIP climate simulations but has not been requested to be archived by the CMIP5 project (especially high volume sub-daily data for use in regional simulations).
It is clear that the ESGF will need to adopt a different long term solution for providing data access, with heavily used core datasets held in the ESGF online disk archives, and the less popular datasets held in existing modelling centre disk/tape archives and only moved to ESGF when requested by users or a group within the climate research or climate impacts community. Given the wide scope of the datasets that may be required to support climate impact projects it is important that this approach forms part of a future CLIPC environment.
In task 5.4 of the CLIPC project, work was undertaken to develop two interfaces between ESGF and remote climate archives held on robotic tape archive systems. These two activities looked at different aspects of a future distributed archive extension to ESGF:
Linköping University/National Supercomputer Centre implemented a demonstration system that allows a user to request EURO4M data produced by SMHI and held in the MARS archive in GRIBGRIB
GRIdded Binary. GRIB is a mathematically concise data format commonly used in meteorology to store historical and forecast weather data. format. Their SODA (System of Online Data Access) system enables a user to access EURO4M data in the MARS archive through the standard ESGF CoG interface. To achieve this, metadatametadata
Information about meteorological and climatological data concerning how and when they were measured, their quality, known problems and other characteristics. from the MARS archive is transferred to the ESGF Solr database to be used in the data search systems. When a user requests download of data from the EURO4M dataset, it uses the standard wget mechanism of ESGF, but is routed through the archive-specific plugin to the SODA scheduler and download management services.
Preliminary testing of this demonstrator has been completed, and it is the intention that the SODA service will ultimately become part of the ESGF software package.
The UK Met Office demonstration system looked at the issue of managing the distribution of climate datasets between the ESGF archives and the local modelling centre archives. This system has been deployed between the Met Office and the CEDA ESGF node, with the Met Office MASS archive system used as the demonstrator tape archive. The system allows the Met Office and CEDA to agree which datasets are routinely ingested into the ESGF archive and which data is held in the MASS archive. If the Met Office or CEDA receive requests for data held in the MASS archive, it can be made available and uploaded immediately. Care has been taken to consider the typical lifecycle of climate data, in order to deal with problems that are identified with data after it has been published. Functional and performance testing during 2016 has been successful, and it is the intention to use this system to support the management of all Met Office datasets for CMIP6.
Although the two demonstrators take different approaches to the problem of distributed access to climate data held in tape archives, there are common features that could be used to support a unified approach. For example, the interfaces developed for the Met Office demonstrator implement the services that are required for the SODA plugins in the LIU/NSC demonstrator. It would also be possible to configure SODA to respond to data upload requests from the modelling centre as implemented in the Met Office solution.
While the results of this work will be finding immediate application for the CMIP6 project, any future evolution of the CLIPC portal should consider the option of introducing features of both developments, in order to open up access to the wider range of data held in modelling centre tape archives.