Mining Distance Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule

Bay, Stephen D.; Schwabacher, Mark

Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.

Document ID

20030022754

Acquisition Source

Ames Research Center

Document Type

Preprint (Draft being sent to journal)

Authors

Date Acquired

September 7, 2013

Publication Date

February 28, 2003

Subject Category

Meeting Information

Meeting: ACM International Conference on Knowledge Discovery and Data Mining

Location: Washington, D.C.

Country: United States

Start Date: August 24, 2003

End Date: August 27, 2003

Funding Number(s)

Distribution Limits

Public

Work of the US Gov. Public Use Permitted.

Available Downloads

Name

Type

20030022754.pdf

STI

No Preview Available

NTRS

NTRS - NASA Technical Reports Server

Available Downloads

Related Records