NASA Logo

NTRS

NTRS - NASA Technical Reports Server

Back to Results
Climatespark: an In-Memory Distributed Computing Framework for Big Climate Data AnalyticsThe unprecedented growth of climate data creates new opportunities for climate studies, and yet big climate data pose a grand challenge to climatologists to efficiently manage and analyze big data. The complexity of climate data content and analytical algorithms increases the difficulty of implementing algorithms on high performance computing systems. This paper proposes an in-memory, distributed computing framework, ClimateSpark, to facilitate complex big data analytics and time-consuming computational tasks. Chunking data structure improves parallel I/O efficiency, while a spatiotemporal index is built for the chunks to avoid unnecessary data reading and preprocessing. An integrated, multi-dimensional, array-based data model (ClimateRDD) and ETL operations are developed to address big climate data variety by integrating the processing components of the climate data lifecycle. ClimateSpark utilizes Spark SQL and Apache Zeppelin to develop a web portal to facilitate the interaction among climatologists, climate data, analytic operations and computing resources (e.g., using SQL query and Scala/Python notebook). Experimental results show that ClimateSpark conducts different spatiotemporal data queries/analytics with high efficiency and data locality. ClimateSpark is easily adaptable to other big multiple- dimensional, array-based datasets in various geoscience domains.
Document ID
20180003101
Acquisition Source
Goddard Space Flight Center
Document Type
Reprint (Version printed in journal)
Authors
Hu, Fei
(George Mason Univ. Fairfax, VA, United States)
Yang, Chaowei
(George Mason Univ. Fairfax, VA, United States)
Schnase, John L.
(NASA Goddard Space Flight Center Greenbelt, MD, United States)
Duffy, Daniel Q.
(NASA Goddard Space Flight Center Greenbelt, MD, United States)
Xu, Mengchao
(George Mason Univ. Fairfax, VA, United States)
Bowen, Michael K.
(NASA Goddard Space Flight Center Greenbelt, MD, United States)
Lee, Tsengdar
(NASA Headquarters Washington, DC United States)
Song, Weiwei
(George Mason Univ. Fairfax, VA, United States)
Date Acquired
May 27, 2018
Publication Date
March 21, 2018
Publication Information
Publication: Computers and Geosciences
Publisher: Elsevier
Volume: 115
ISSN: 0098-3004
Subject Category
Computer Programming And Software
Report/Patent Number
GSFC-E-DAA-TN55663
ISSN: 0098-3004
Report Number: GSFC-E-DAA-TN55663
Funding Number(s)
CONTRACT_GRANT: NSF ICER-1540998
CONTRACT_GRANT: NNX15AM85G
CONTRACT_GRANT: NSF IIP-1338925
Distribution Limits
Public
Copyright
Other

Available Downloads

There are no available downloads for this record.
No Preview Available