NASA Logo

NTRS

NTRS - NASA Technical Reports Server

Back to Results
Use of Schema on Read in Earth Science Data ArchivesTraditionally, NASA Earth Science data archives have file-based storage using proprietary data file formats, such as HDF and HDF-EOS, which are optimized to support fast and efficient storage of spaceborne and model data as they are generated. The use of file-based storage essentially imposes an indexing strategy based on data dimensions. In most cases, NASA Earth Science data uses time as the primary index, leading to poor performance in accessing data in spatial dimensions. For example, producing a time series for a single spatial grid cell involves accessing a large number of data files. With exponential growth in data volume due to the ever-increasing spatial and temporal resolution of the data, using file-based archives poses significant performance and cost barriers to data discovery and access. Storing and disseminating data in proprietary data formats imposes an additional access barrier for users outside the mainstream research community. At the NASA Goddard Earth Sciences Data Information Services Center (GES DISC), we have evaluated applying the schema-on-read principle to data access and distribution. We used Apache Parquet to store geospatial data, and have exposed data through Amazon Web Services (AWS) Athena, AWS Simple Storage Service (S3), and Apache Spark. Using the schema-on-read approach allows customization of indexing spatially or temporally to suit the data access pattern. The storage of data in open formats such as Apache Parquet has widespread support in popular programming languages. A wide range of solutions for handling big data lowers the access barrier for all users. This presentation will discuss formats used for data storage, frameworks with This presentation will discuss formats used for data storage, frameworks with support for schema-on-read used for data access, and common use cases covering data usage patterns seen in a geospatial data archive.
Document ID
20180000716
Document Type
Presentation
Authors
Hegde, Mahabaleshwara
(Adnet Systems, Inc. Greenbelt, MD, United States)
Smit, Christine
(Telophase Corporation Arlington, VA, United States)
Pilone, Paul
(Element84 Alexandria, VA, United States)
Petrenko, Maksym
(Adnet Systems, Inc. Greenbelt, MD, United States)
Pham, Long
(NASA Goddard Space Flight Center Greenbelt, MD, United States)
Date Acquired
January 25, 2018
Publication Date
December 11, 2017
Subject Category
Mathematical And Computer Sciences (General)
Report/Patent Number
GSFC-E-DAA-TN50564
Meeting Information
2017 American Geophysical Union (AGU) Fall Meeting(New Orleans, LA)
Funding Number(s)
CONTRACT_GRANT: 80GSFC17C0003
Distribution Limits
Public
Copyright
Use by or on behalf of the US Gov. Permitted.
Keywords
cloud applications
Earth scienc
Giovanni
big data

Available Downloads

NameType 20180000716.pdf STI