NTRS - NASA Technical Reports Server

Back to Results
Experiences with Text Mining Large Collections of Unstructured Systems Development Artifacts at JPLOften repositories of systems engineering artifacts at NASA's Jet Propulsion Laboratory (JPL) are so large and poorly structured that they have outgrown our capability to effectively manually process their contents to extract useful information. Sophisticated text mining methods and tools seem a quick, low-effort approach to automating our limited manual efforts. Our experiences of exploring such methods mainly in three areas including historical risk analysis, defect identification based on requirements analysis, and over-time analysis of system anomalies at JPL, have shown that obtaining useful results requires substantial unanticipated efforts - from preprocessing the data to transforming the output for practical applications. We have not observed any quick 'wins' or realized benefit from short-term effort avoidance through automation in this area. Surprisingly we have realized a number of unexpected long-term benefits from the process of applying text mining to our repositories. This paper elaborates some of these benefits and our important lessons learned from the process of preparing and applying text mining to large unstructured system artifacts at JPL aiming to benefit future TM applications in similar problem domains and also in hope for being extended to broader areas of applications.
Document ID
Document Type
Conference Paper
Port, Dan (Hawaii Univ. Honolulu, HI, United States)
Nikora, Allen (Jet Propulsion Lab., California Inst. of Tech. Pasadena, CA, United States)
Hihn, Jairus (Jet Propulsion Lab., California Inst. of Tech. Pasadena, CA, United States)
Huang, LiGuo (Southern Methodist Univ. Dallas, TX, United States)
Date Acquired
August 25, 2013
Publication Date
March 1, 2011
Publication Information
Publication: 33rd International Conference on Software Engineering (ICSE '11)
Subject Category
Documentation and Information Science
Meeting Information
33rd International Conference on Software Engineering (ICSE ''11). ACM, New York, NY, USA, 701-710(Honolulu, HI)
Distribution Limits
text mining
risk identification
failure report analysis
requirement analysis