NTRS - NASA Technical Reports Server

Back to Results
Mining Distance Based Outliers in Near Linear Time with Randomization and a Simple Pruning RuleDefining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.
Document ID
Document Type
Preprint (Draft being sent to journal)
Bay, Stephen D. (Institute for the Study of Learning and Expertise Palo Alto, CA, United States)
Schwabacher, Mark (NASA Ames Research Center Moffett Field, CA, United States)
Date Acquired
September 7, 2013
Publication Date
February 28, 2003
Subject Category
Statistics and Probability
Meeting Information
ACM International Conference on Knowledge Discovery and Data Mining(Washington, D.C.)
Funding Number(s)
Distribution Limits
Work of the US Gov. Public Use Permitted.

Available Downloads

NameType 20030022754.pdf STI