Chrome Extension
WeChat Mini Program
Use on ChatGLM

A Parquet Cube alternative to store gridded data for data analytics and modeling

Jean-Michel Zigna, Reda Semlal, Flavien Gouillon,Ethan Davis, Elisabeth Lambert, Frédéric Briol, Romain Prod-Homme,Sean Arms,Lionel Zawadzki

crossref(2021)

Cited 0|Views1
No score
Abstract
Abstract. The volume of data in the field of Earth data observation has increased considerably, especially with the emergence of new generations of satellites and models providing much more precise measures and thus voluminous data and files. One of the most traditional and popular data formats used in scientific and education communities (reference) is the NetCDF format. However, it was designed before the development of cloud storage and parallel processing in big data architectures. Alternative solutions, under open source or under proprietary licences, appeared in the past few years (See Rasdaman, Opendatacube). These data cubes are managing the storage and the services for an easy access to the data but they are also altering the input information applying conversions and/or reprojections to homogenize their internal data structure, introducing a bias in the scientific value of the data. The consequence is that it drives the users in a closed infrastructure, made of customized storage and access services. The objective of this study is to propose a light new open source solution which is able to store gridded datasets into a native big data format and make data available for parallel processing, analytics or artificial intelligence learning. There is a demand for developing a unique storage solution that would be opened to different users: Scientists, setting up their prototypes and models in their customized environment and qualifying their data to publish as Copernicus datatsets for instance; Operational teams, in charge of the daily processing of data which can be run in another environment, to ingest the product in an archive and make it available to end-users for additional model and data science processing. Data ingestion and storage are key factors to study to ensure good performances in further subsetting access services and parallel processing. Through typical end users’ use cases, four storage and services implementations are compared through benchmarks: Unidata's THREDDS Data Server (TDS) which is a traditional NetCDF data access service solution built on the NetCDF-Java, an extension of the THREDDS Data Server using object store, pangeo/Dask/Python ecosystem, and the alternative Hadoop/Spark/Parquet solution, driven by CLS technical and business requirements.
More
Translated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined