NASA Logo

NTRS

NTRS - NASA Technical Reports Server

Back to Results
A Hybrid Approach to Labeling Datasets in Earth Science Publications NASA Data Centers provide the public with thousands of datasets that result in published papers,
reports, and conference proceedings. Collecting accurate metrics on usage of these datasets is
key to connecting different areas of knowledge and evaluating the datasets’ impact. While most
of the datasets have Digital Object Identifiers (DOIs) assigned, most publications do not cite
them hampering the automated search of these publications. Instead, articles mention attributes
like organization, instrument, mission, variable, or a publication describing the dataset. Often
only domain experts can deduce the dataset that was used in the publication text. The lack of a
citation slows the spread of information and reduces the research’s impact. With thousands of
papers produced each year, an automated means of labeling datasets is critical. This paper
explores a hybrid approach of heuristics and a Natural Language Processing (NLP) Named
Entity Recognition (NER) model to find and label the datasets used within Earth Science papers.
Heuristics are used to produce the labelled sentences and any potential dataset candidates that
can be derived from a sentence. The heuristic labels the sentences with the names of mission,
instrument, re-analysis models, and science keywords taken from the Global Change Master
Directory (GCMD) ontology. Additionally, it uses those labels to generate the dataset citation
candidates. If the mission, instrument, and variable are sufficient to create the citation for the
dataset the citation and the label the domain expert reviews the output without going through the
NLP model. If the extracted label is not sufficient to label the dataset on its own, the sentence
and its associated dataset labels will be inputted into the NER model. The model outputs the
labeled sentence and the potential dataset candidates with their associated probabilities. The
domain expert then reviews the NER model’s output and the correct labels are determined. The
newly labelled papers can then be used as additional training data. This creates an iterative
process for the approach to continuously improve. Because all the possible mentions are gathered
by the model, the domain expert can quickly and easily label the papers resulting in large time
savings.
Document ID
20205010366
Acquisition Source
Goddard Space Flight Center
Document Type
Presentation
Authors
Jacob Atkins
Irina Gerasimov
(Adnet Systems (United States) Bethesda, Maryland, United States)
Mo Khayat
Date Acquired
November 18, 2020
Subject Category
Documentation And Information Science
Meeting Information
Meeting: AGU Fall Meeting 2020
Location: Virtual
Country: US
Start Date: December 1, 2020
End Date: December 17, 2020
Sponsors: American Geophysical Union
Funding Number(s)
CONTRACT_GRANT: 80GSFC17C0003
Distribution Limits
Public
Copyright
Public Use Permitted.
Technical Review
NASA Peer Committee
No Preview Available