Scalable Detection of Anomalous Patterns With Connectivity Constraints

Skyler Speakman, Edward McFowland, Daniel Neill

Research output: Contribution to journalArticle

Abstract

We present GraphScan, a novel method for detecting arbitrarily shaped connected clusters in graph or network data. Given a graph structure, data observed at each node, and a score function defining the anomalousness of a set of nodes, GraphScan can efficiently and exactly identify the most anomalous (highest-scoring) connected subgraph. Kulldorff’s spatial scan, which searches over circles consisting of a center location and its k − 1 nearest neighbors, has been extended to include connectivity constraints by FlexScan. However, FlexScan performs an exhaustive search over connected subsets and is computationally infeasible for k > 30. Alternatively, the upper level set (ULS) scan scales well to large graphs but is not guaranteed to find the highest-scoring subset. We demonstrate that GraphScan is able to scale to graphs an order of magnitude larger than FlexScan, while guaranteeing that the highest-scoring subgraph will be identified. We evaluate GraphScan, Kulldorff’s spatial scan (searching over circles) and ULS in two different settings of public health surveillance. The first examines detection power using simulated disease outbreaks injected into real-world Emergency Department data. GraphScan improved detection power by identifying connected, irregularly shaped spatial clusters while requiring less than 4.3 sec of computation time per day of data. The second scenario uses contaminant plumes spreading through a water distribution system to evaluate the spatial accuracy of the methods. GraphScan improved spatial accuracy using data generated from noisy, binary sensors in the network while requiring less than 0.22 sec of computation time per hour of data.

Original languageEnglish (US)
Pages (from-to)1014-1033
Number of pages20
JournalJournal of Computational and Graphical Statistics
Volume24
Issue number4
DOIs
StatePublished - Oct 2 2015

Fingerprint

Anomalous
Connectivity
Scoring
Graph in graph theory
Level Set
Centre of a circle
Subgraph
Water Distribution Systems
Score Function
Subset
Evaluate
Exhaustive Search
Public Health
Vertex of a graph
Emergency
Surveillance
Nearest Neighbor
Data Structures
Circle
Graph

Keywords

  • Biosurveillance
  • Event detection
  • Graph mining
  • Scan statistics
  • Spatial scan statistic

ASJC Scopus subject areas

  • Discrete Mathematics and Combinatorics
  • Statistics and Probability
  • Statistics, Probability and Uncertainty

Cite this

Scalable Detection of Anomalous Patterns With Connectivity Constraints. / Speakman, Skyler; McFowland, Edward; Neill, Daniel.

In: Journal of Computational and Graphical Statistics, Vol. 24, No. 4, 02.10.2015, p. 1014-1033.

Research output: Contribution to journalArticle

@article{96230b91a24449b0842e2d1bfb38f76e,
title = "Scalable Detection of Anomalous Patterns With Connectivity Constraints",
abstract = "We present GraphScan, a novel method for detecting arbitrarily shaped connected clusters in graph or network data. Given a graph structure, data observed at each node, and a score function defining the anomalousness of a set of nodes, GraphScan can efficiently and exactly identify the most anomalous (highest-scoring) connected subgraph. Kulldorff’s spatial scan, which searches over circles consisting of a center location and its k − 1 nearest neighbors, has been extended to include connectivity constraints by FlexScan. However, FlexScan performs an exhaustive search over connected subsets and is computationally infeasible for k > 30. Alternatively, the upper level set (ULS) scan scales well to large graphs but is not guaranteed to find the highest-scoring subset. We demonstrate that GraphScan is able to scale to graphs an order of magnitude larger than FlexScan, while guaranteeing that the highest-scoring subgraph will be identified. We evaluate GraphScan, Kulldorff’s spatial scan (searching over circles) and ULS in two different settings of public health surveillance. The first examines detection power using simulated disease outbreaks injected into real-world Emergency Department data. GraphScan improved detection power by identifying connected, irregularly shaped spatial clusters while requiring less than 4.3 sec of computation time per day of data. The second scenario uses contaminant plumes spreading through a water distribution system to evaluate the spatial accuracy of the methods. GraphScan improved spatial accuracy using data generated from noisy, binary sensors in the network while requiring less than 0.22 sec of computation time per hour of data.",
keywords = "Biosurveillance, Event detection, Graph mining, Scan statistics, Spatial scan statistic",
author = "Skyler Speakman and Edward McFowland and Daniel Neill",
year = "2015",
month = "10",
day = "2",
doi = "10.1080/10618600.2014.960926",
language = "English (US)",
volume = "24",
pages = "1014--1033",
journal = "Journal of Computational and Graphical Statistics",
issn = "1061-8600",
publisher = "American Statistical Association",
number = "4",

}

TY - JOUR

T1 - Scalable Detection of Anomalous Patterns With Connectivity Constraints

AU - Speakman, Skyler

AU - McFowland, Edward

AU - Neill, Daniel

PY - 2015/10/2

Y1 - 2015/10/2

N2 - We present GraphScan, a novel method for detecting arbitrarily shaped connected clusters in graph or network data. Given a graph structure, data observed at each node, and a score function defining the anomalousness of a set of nodes, GraphScan can efficiently and exactly identify the most anomalous (highest-scoring) connected subgraph. Kulldorff’s spatial scan, which searches over circles consisting of a center location and its k − 1 nearest neighbors, has been extended to include connectivity constraints by FlexScan. However, FlexScan performs an exhaustive search over connected subsets and is computationally infeasible for k > 30. Alternatively, the upper level set (ULS) scan scales well to large graphs but is not guaranteed to find the highest-scoring subset. We demonstrate that GraphScan is able to scale to graphs an order of magnitude larger than FlexScan, while guaranteeing that the highest-scoring subgraph will be identified. We evaluate GraphScan, Kulldorff’s spatial scan (searching over circles) and ULS in two different settings of public health surveillance. The first examines detection power using simulated disease outbreaks injected into real-world Emergency Department data. GraphScan improved detection power by identifying connected, irregularly shaped spatial clusters while requiring less than 4.3 sec of computation time per day of data. The second scenario uses contaminant plumes spreading through a water distribution system to evaluate the spatial accuracy of the methods. GraphScan improved spatial accuracy using data generated from noisy, binary sensors in the network while requiring less than 0.22 sec of computation time per hour of data.

AB - We present GraphScan, a novel method for detecting arbitrarily shaped connected clusters in graph or network data. Given a graph structure, data observed at each node, and a score function defining the anomalousness of a set of nodes, GraphScan can efficiently and exactly identify the most anomalous (highest-scoring) connected subgraph. Kulldorff’s spatial scan, which searches over circles consisting of a center location and its k − 1 nearest neighbors, has been extended to include connectivity constraints by FlexScan. However, FlexScan performs an exhaustive search over connected subsets and is computationally infeasible for k > 30. Alternatively, the upper level set (ULS) scan scales well to large graphs but is not guaranteed to find the highest-scoring subset. We demonstrate that GraphScan is able to scale to graphs an order of magnitude larger than FlexScan, while guaranteeing that the highest-scoring subgraph will be identified. We evaluate GraphScan, Kulldorff’s spatial scan (searching over circles) and ULS in two different settings of public health surveillance. The first examines detection power using simulated disease outbreaks injected into real-world Emergency Department data. GraphScan improved detection power by identifying connected, irregularly shaped spatial clusters while requiring less than 4.3 sec of computation time per day of data. The second scenario uses contaminant plumes spreading through a water distribution system to evaluate the spatial accuracy of the methods. GraphScan improved spatial accuracy using data generated from noisy, binary sensors in the network while requiring less than 0.22 sec of computation time per hour of data.

KW - Biosurveillance

KW - Event detection

KW - Graph mining

KW - Scan statistics

KW - Spatial scan statistic

UR - http://www.scopus.com/inward/record.url?scp=84949544374&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84949544374&partnerID=8YFLogxK

U2 - 10.1080/10618600.2014.960926

DO - 10.1080/10618600.2014.960926

M3 - Article

VL - 24

SP - 1014

EP - 1033

JO - Journal of Computational and Graphical Statistics

JF - Journal of Computational and Graphical Statistics

SN - 1061-8600

IS - 4

ER -