NASA Logo

NTRS

NTRS - NASA Technical Reports Server

Due to the lapse in federal government funding, NASA is not updating this website. We sincerely regret this inconvenience.

Back to Results
Application of ML/AI for Identifying Earth Science Datasets in Research PublicationsNASA Data Active Archive Centers, or DAACs, ingest, store and distribute data acquired from satellites, ground systems as well as modelling data. These data are organized by the datasets, each presenting collection of files usually associated with the certain mission, instrument, processing level, parameter(s), algorithm and/or model. The number of datasets offered by a single DAAC to the public varies. GES DISC, for example, currently offers for public use approximately ~1,300 datasets. While each publicly offered dataset comes with supporting documentation, it is challenging for novice and even experienced scientists to navigate among the datasets that offer similar parameters to find the datasets for their particular research application. Supplying dataset documentation with the scientific paper citations that refer to that dataset provides means for the dataset users to educate themselves with the application research that dataset is being used in. Collecting citations of the papers that use the datasets for their research yield valuable insights into application areas of those datasets, information about usage of the dataset groups for specific applications and those application topics. It also gives insights into the “deep metrics” of the dataset usage, as opposed to the common metrics of the dataset usage such as number of users who downloaded the dataset files and volumes of downloaded data. Association of a certain scientific paper with the dataset(s) presents a challenge because most of the paper authors do not properly cite the datasets, datasets usually have cryptic names and Digital Object Identifiers (DOIs) that are used for dataset identification were assigned to the datasets only few years ago. Simple Google or online library search do not provide even meaningful fraction of the results when performed by the dataset name or DOI, however they provide too many results when the search is done by more broader terms such as mission and instrument names. Attempts to create an AI system capable to identify dataset in the scientific papers have already been made using neural networks classifiers on the basis of the dataset mission, instrument and variable name. This method was applied to NASA SEDAC, which has 41 datasets in total. In GES DISC there
can be as many as ~100 datasets per mission/instrument with some of the datasets consisting of multiple variables so there is a need for more differentiating parameters for dataset identification in the paper. The approach we are currently investigating is creating AI classifiers that are based on multiple dataset features, or keywords, extracted from the NASA Earthdata Common Dataset Repository (CMR). The features are weighted based on how precisely they can identify a dataset. The classifier uses preprocessed paper text as input and searches for the CMR datasets whose feature sets are the closest to the feature sets contained in the paper. The challenges of dataset identification include variety of ways the paper authors describe the datasets in their papers and incomplete tagging of the CMR dataset description (DIFs).
Document ID
20210010138
Acquisition Source
Goddard Space Flight Center
Document Type
Presentation
Authors
Irina Gerasimov
(Adnet Systems (United States) Bethesda, Maryland, United States)
Jacob Atkins
(GSFC INTERNS)
Edward Jahoda
(GSFC INTERNS)
Andrey Savtchenko
(Adnet Systems (United States) Bethesda, Maryland, United States)
Jerome Alfred
(Adnet Systems (United States) Bethesda, Maryland, United States)
Jennifer Wei
(Goddard Space Flight Center Greenbelt, Maryland, United States)
Date Acquired
February 12, 2021
Subject Category
Cybernetics, Artificial Intelligence And Robotics
Computer Programming And Software
Meeting Information
Meeting: NASA 2nd AI Workshop
Location: Virtual
Country: US
Start Date: February 9, 2021
End Date: February 11, 2021
Sponsors: Jet Propulsion Lab
Funding Number(s)
CONTRACT_GRANT: 80GSFC17C0003
Distribution Limits
Public
Copyright
Public Use Permitted.
Technical Review
NASA Peer Committee
No Preview Available