Fast mining of massive tabular data via approximate distance computations

Graham Cormode, Piotr Indyk, Nick Koudas, Shanmugavelayutham Muthukrishnan

    Research output: Contribution to journalArticle

    Abstract

    Tabular data abound in many data stores: traditional re-lational databases store tables, and new applications also generate massive tabular datasets. For example, consider the geographic distribution of cell phone traffic at different base stations across the country or the evolution of traffic at Internet routers over time. Detecting similarity patterns in such data sets (e.g., which geographic regions have similar cell phone usage distribution, which IP subnet traffic distributions over time intervals are similar, etc) is of great importance. Identification of such patterns poses many conceptual challenges (what is a suitable similarity distance function for two “regions”) as well as technical challenges (how to perform similarity computations efficiently as massive tables get accumulated over time) that we address. We present methods for determining similar regions in massive tabular data. Our methods are for computing the “distance" between any two subregions of a tabular data: they are approximate, but highly accurate as we prove mathematically, and they are fast, running in time nearly linear in the table size. Our methods are general since these distance computations can be applied to any mining or similarity algorithms that use Lp norms. A novelty of our distance computation procedures is that they work for any Lp norms -not only the traditional p = 2 or p = 1, but for all p ≤ 2; the choice of p, say fractional p, provides an interesting alternative similarity behavior! We use our algorithms in a detailed experimental study of the clustering patterns in real tabular data obtained from one of AT&T’s data stores and show that our methods are substantially faster than straightforward methods while remaining highly accurate, and able to detect interesting patterns by varying the value of p.

    Original languageEnglish (US)
    Pages (from-to)605-614
    Number of pages10
    JournalProceedings - International Conference on Data Engineering
    DOIs
    StatePublished - Jan 1 2002

    Fingerprint

    Routers
    Base stations
    Internet

    ASJC Scopus subject areas

    • Software
    • Signal Processing
    • Information Systems

    Cite this

    Fast mining of massive tabular data via approximate distance computations. / Cormode, Graham; Indyk, Piotr; Koudas, Nick; Muthukrishnan, Shanmugavelayutham.

    In: Proceedings - International Conference on Data Engineering, 01.01.2002, p. 605-614.

    Research output: Contribution to journalArticle

    Cormode, Graham ; Indyk, Piotr ; Koudas, Nick ; Muthukrishnan, Shanmugavelayutham. / Fast mining of massive tabular data via approximate distance computations. In: Proceedings - International Conference on Data Engineering. 2002 ; pp. 605-614.
    @article{12b3ba59c789409eaa04165de56028c3,
    title = "Fast mining of massive tabular data via approximate distance computations",
    abstract = "Tabular data abound in many data stores: traditional re-lational databases store tables, and new applications also generate massive tabular datasets. For example, consider the geographic distribution of cell phone traffic at different base stations across the country or the evolution of traffic at Internet routers over time. Detecting similarity patterns in such data sets (e.g., which geographic regions have similar cell phone usage distribution, which IP subnet traffic distributions over time intervals are similar, etc) is of great importance. Identification of such patterns poses many conceptual challenges (what is a suitable similarity distance function for two “regions”) as well as technical challenges (how to perform similarity computations efficiently as massive tables get accumulated over time) that we address. We present methods for determining similar regions in massive tabular data. Our methods are for computing the “distance{"} between any two subregions of a tabular data: they are approximate, but highly accurate as we prove mathematically, and they are fast, running in time nearly linear in the table size. Our methods are general since these distance computations can be applied to any mining or similarity algorithms that use Lp norms. A novelty of our distance computation procedures is that they work for any Lp norms -not only the traditional p = 2 or p = 1, but for all p ≤ 2; the choice of p, say fractional p, provides an interesting alternative similarity behavior! We use our algorithms in a detailed experimental study of the clustering patterns in real tabular data obtained from one of AT&T’s data stores and show that our methods are substantially faster than straightforward methods while remaining highly accurate, and able to detect interesting patterns by varying the value of p.",
    author = "Graham Cormode and Piotr Indyk and Nick Koudas and Shanmugavelayutham Muthukrishnan",
    year = "2002",
    month = "1",
    day = "1",
    doi = "10.1109/ICDE.2002.994778",
    language = "English (US)",
    pages = "605--614",
    journal = "Proceedings - International Conference on Data Engineering",
    issn = "1084-4627",
    publisher = "Institute of Electrical and Electronics Engineers Inc.",

    }

    TY - JOUR

    T1 - Fast mining of massive tabular data via approximate distance computations

    AU - Cormode, Graham

    AU - Indyk, Piotr

    AU - Koudas, Nick

    AU - Muthukrishnan, Shanmugavelayutham

    PY - 2002/1/1

    Y1 - 2002/1/1

    N2 - Tabular data abound in many data stores: traditional re-lational databases store tables, and new applications also generate massive tabular datasets. For example, consider the geographic distribution of cell phone traffic at different base stations across the country or the evolution of traffic at Internet routers over time. Detecting similarity patterns in such data sets (e.g., which geographic regions have similar cell phone usage distribution, which IP subnet traffic distributions over time intervals are similar, etc) is of great importance. Identification of such patterns poses many conceptual challenges (what is a suitable similarity distance function for two “regions”) as well as technical challenges (how to perform similarity computations efficiently as massive tables get accumulated over time) that we address. We present methods for determining similar regions in massive tabular data. Our methods are for computing the “distance" between any two subregions of a tabular data: they are approximate, but highly accurate as we prove mathematically, and they are fast, running in time nearly linear in the table size. Our methods are general since these distance computations can be applied to any mining or similarity algorithms that use Lp norms. A novelty of our distance computation procedures is that they work for any Lp norms -not only the traditional p = 2 or p = 1, but for all p ≤ 2; the choice of p, say fractional p, provides an interesting alternative similarity behavior! We use our algorithms in a detailed experimental study of the clustering patterns in real tabular data obtained from one of AT&T’s data stores and show that our methods are substantially faster than straightforward methods while remaining highly accurate, and able to detect interesting patterns by varying the value of p.

    AB - Tabular data abound in many data stores: traditional re-lational databases store tables, and new applications also generate massive tabular datasets. For example, consider the geographic distribution of cell phone traffic at different base stations across the country or the evolution of traffic at Internet routers over time. Detecting similarity patterns in such data sets (e.g., which geographic regions have similar cell phone usage distribution, which IP subnet traffic distributions over time intervals are similar, etc) is of great importance. Identification of such patterns poses many conceptual challenges (what is a suitable similarity distance function for two “regions”) as well as technical challenges (how to perform similarity computations efficiently as massive tables get accumulated over time) that we address. We present methods for determining similar regions in massive tabular data. Our methods are for computing the “distance" between any two subregions of a tabular data: they are approximate, but highly accurate as we prove mathematically, and they are fast, running in time nearly linear in the table size. Our methods are general since these distance computations can be applied to any mining or similarity algorithms that use Lp norms. A novelty of our distance computation procedures is that they work for any Lp norms -not only the traditional p = 2 or p = 1, but for all p ≤ 2; the choice of p, say fractional p, provides an interesting alternative similarity behavior! We use our algorithms in a detailed experimental study of the clustering patterns in real tabular data obtained from one of AT&T’s data stores and show that our methods are substantially faster than straightforward methods while remaining highly accurate, and able to detect interesting patterns by varying the value of p.

    UR - http://www.scopus.com/inward/record.url?scp=0036215013&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=0036215013&partnerID=8YFLogxK

    U2 - 10.1109/ICDE.2002.994778

    DO - 10.1109/ICDE.2002.994778

    M3 - Article

    AN - SCOPUS:0036215013

    SP - 605

    EP - 614

    JO - Proceedings - International Conference on Data Engineering

    JF - Proceedings - International Conference on Data Engineering

    SN - 1084-4627

    ER -