NASA Logo

NTRS

NTRS - NASA Technical Reports Server

Back to Results
Scraping Unstructured Data to Explore the Relationship between Rainfall Anomalies and Vector-Borne Disease OutbreaksAccording to the World Health Organization (WHO), vector-borne diseases such as malaria and dengue account for 17% of all infectious disease cases and lead to more than 700,000 deaths per year. Tracking and predicting the spread of vector-borne diseases is a vital task that could save hundreds of thousands of lives annually. Oftentimes, the first reports of vector-borne disease outbreaks occur through emails and online reporting systems long before they are officially documented. Tracking and predicting the emergence and spread of vector-borne disease outbreaks requires extracting data from these unstructured sources in combination with historical weather and climate data to understand the underlying background triggers and disease dynamics. In this work, we develop a data extraction pipeline for the online outbreak reporting website ProMED-mail that utilizes a web scraper, transformer neural network summarizer, and named entity recognizer to obtain a dataset of malaria, dengue, zika, and chikungunya outbreaks over the last 30 years. This scraped dataset was further analyzed in association with global rainfall anomalies derived from NASA’s Integrated Multi-satellitE Retrievals for GPM [Global Precipitation Mission] (IMERG) dataset. This preliminary analysis was to understand the effect of global rainfall patterns on the spread of vector-borne diseases. Analysis of the ProMED-mail and GPM data shows that vector-borne disease outbreaks are clustered towards the tropics and outbreaks are often amplified during the rainy seasons. Our scraped dataset can be a valuable tool in creating comprehensive georeferenced disease records for modeling and predicting future outbreaks.
Document ID
20220000528
Acquisition Source
Goddard Space Flight Center
Document Type
Conference Paper
Authors
Ethan Joseph
(Rensselaer Polytechnic Institute Troy, New York, United States)
Thilanka Munasinghe
(Rensselaer Polytechnic Institute Troy, New York, United States)
Heidi Tubbs
(Universities Space Research Association Columbia, Maryland, United States)
Bhaskar Bishnoi
(Universities Space Research Association Columbia, Maryland, United States)
Assaf Anyamba
(Universities Space Research Association Columbia, Maryland, United States)
Date Acquired
January 25, 2022
Publication Date
January 13, 2022
Publication Information
Publication: 2021 IEEE International Conference on Big Data (Big Data)
Publisher: IEEE
ISBN: 978-1-6654-4599-3
e-ISBN: 978-1-6654-3902-2
Subject Category
Documentation And Information Science
Meeting Information
Meeting: 2021 IEEE International Conference on Big Data (Big Data)
Location: Virtual
Country: US
Start Date: December 15, 2021
End Date: December 18, 2021
Sponsors: Institute of Electrical and Electronics Engineers
Funding Number(s)
PROJECT: 17- HAQ17-0065
CONTRACT_GRANT: 80NSSC22M0001
Distribution Limits
Public
Copyright
Portions of document may include copyright protected material.
Technical Review
External Peer Committee
Keywords
Web scraping
data mining
epidemiology
Document Inquiry

Available Downloads

There are no available downloads for this record.
No Preview Available