Cluster-aware compression with provable k-means preservation

Nikolaos Freris, Michail Vlachos, Deepak S. Turaga

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    This work rigorously explores the design of clusterpreserving compression schemes for high-dimensional data. We focus on the K-means algorithm and identify conditions under which running the algorithm on the compressed data yields the same clustering outcome as on the original. The compression is performed using single and multi-bit minimum mean square error quantization schemes as well as a given clustering assignment of the original data. We provide theoretical guarantees on post-quantization cluster preservation under certain conditions on the cluster structure, and propose an additional data transformation that can ensure cluster preservation unconditionally; this transformation is invertible and thus induces virtually no distortion on the compressed data. In addition, we provide an efficient scheme for multi-bit allocation, per cluster and data dimension, which enables a trade-off between high compression efficiency and low data distortion. Our experimental studies highlight that the suggested scheme accurately preserved the clusters formed in all cases, while incurring minimal distortion on the data shapes. Our results can find many applications, e.g., in a) clustering, analysis and distribution of massive datasets, where the proposed data compression can boost performance while providing provable guarantees on the clustering result, as well as, in b) cloud computing services, as the optional transformation provides a data-hiding functionality in addition to preserving the K-means clustering outcome.

    Original languageEnglish (US)
    Title of host publicationProceedings of the 12th SIAM International Conference on Data Mining, SDM 2012
    Pages82-93
    Number of pages12
    StatePublished - Dec 1 2012
    Event12th SIAM International Conference on Data Mining, SDM 2012 - Anaheim, CA, United States
    Duration: Apr 26 2012Apr 28 2012

    Other

    Other12th SIAM International Conference on Data Mining, SDM 2012
    CountryUnited States
    CityAnaheim, CA
    Period4/26/124/28/12

    Fingerprint

    Data compression
    Cloud computing
    Mean square error

    Keywords

    • Cluster preservation
    • Clustering
    • Compression
    • K-means
    • Mmse quantization

    ASJC Scopus subject areas

    • Computer Science Applications

    Cite this

    Freris, N., Vlachos, M., & Turaga, D. S. (2012). Cluster-aware compression with provable k-means preservation. In Proceedings of the 12th SIAM International Conference on Data Mining, SDM 2012 (pp. 82-93)

    Cluster-aware compression with provable k-means preservation. / Freris, Nikolaos; Vlachos, Michail; Turaga, Deepak S.

    Proceedings of the 12th SIAM International Conference on Data Mining, SDM 2012. 2012. p. 82-93.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Freris, N, Vlachos, M & Turaga, DS 2012, Cluster-aware compression with provable k-means preservation. in Proceedings of the 12th SIAM International Conference on Data Mining, SDM 2012. pp. 82-93, 12th SIAM International Conference on Data Mining, SDM 2012, Anaheim, CA, United States, 4/26/12.
    Freris N, Vlachos M, Turaga DS. Cluster-aware compression with provable k-means preservation. In Proceedings of the 12th SIAM International Conference on Data Mining, SDM 2012. 2012. p. 82-93
    Freris, Nikolaos ; Vlachos, Michail ; Turaga, Deepak S. / Cluster-aware compression with provable k-means preservation. Proceedings of the 12th SIAM International Conference on Data Mining, SDM 2012. 2012. pp. 82-93
    @inproceedings{c08ed03425214297bb6703d5b842b396,
    title = "Cluster-aware compression with provable k-means preservation",
    abstract = "This work rigorously explores the design of clusterpreserving compression schemes for high-dimensional data. We focus on the K-means algorithm and identify conditions under which running the algorithm on the compressed data yields the same clustering outcome as on the original. The compression is performed using single and multi-bit minimum mean square error quantization schemes as well as a given clustering assignment of the original data. We provide theoretical guarantees on post-quantization cluster preservation under certain conditions on the cluster structure, and propose an additional data transformation that can ensure cluster preservation unconditionally; this transformation is invertible and thus induces virtually no distortion on the compressed data. In addition, we provide an efficient scheme for multi-bit allocation, per cluster and data dimension, which enables a trade-off between high compression efficiency and low data distortion. Our experimental studies highlight that the suggested scheme accurately preserved the clusters formed in all cases, while incurring minimal distortion on the data shapes. Our results can find many applications, e.g., in a) clustering, analysis and distribution of massive datasets, where the proposed data compression can boost performance while providing provable guarantees on the clustering result, as well as, in b) cloud computing services, as the optional transformation provides a data-hiding functionality in addition to preserving the K-means clustering outcome.",
    keywords = "Cluster preservation, Clustering, Compression, K-means, Mmse quantization",
    author = "Nikolaos Freris and Michail Vlachos and Turaga, {Deepak S.}",
    year = "2012",
    month = "12",
    day = "1",
    language = "English (US)",
    isbn = "9781611972320",
    pages = "82--93",
    booktitle = "Proceedings of the 12th SIAM International Conference on Data Mining, SDM 2012",

    }

    TY - GEN

    T1 - Cluster-aware compression with provable k-means preservation

    AU - Freris, Nikolaos

    AU - Vlachos, Michail

    AU - Turaga, Deepak S.

    PY - 2012/12/1

    Y1 - 2012/12/1

    N2 - This work rigorously explores the design of clusterpreserving compression schemes for high-dimensional data. We focus on the K-means algorithm and identify conditions under which running the algorithm on the compressed data yields the same clustering outcome as on the original. The compression is performed using single and multi-bit minimum mean square error quantization schemes as well as a given clustering assignment of the original data. We provide theoretical guarantees on post-quantization cluster preservation under certain conditions on the cluster structure, and propose an additional data transformation that can ensure cluster preservation unconditionally; this transformation is invertible and thus induces virtually no distortion on the compressed data. In addition, we provide an efficient scheme for multi-bit allocation, per cluster and data dimension, which enables a trade-off between high compression efficiency and low data distortion. Our experimental studies highlight that the suggested scheme accurately preserved the clusters formed in all cases, while incurring minimal distortion on the data shapes. Our results can find many applications, e.g., in a) clustering, analysis and distribution of massive datasets, where the proposed data compression can boost performance while providing provable guarantees on the clustering result, as well as, in b) cloud computing services, as the optional transformation provides a data-hiding functionality in addition to preserving the K-means clustering outcome.

    AB - This work rigorously explores the design of clusterpreserving compression schemes for high-dimensional data. We focus on the K-means algorithm and identify conditions under which running the algorithm on the compressed data yields the same clustering outcome as on the original. The compression is performed using single and multi-bit minimum mean square error quantization schemes as well as a given clustering assignment of the original data. We provide theoretical guarantees on post-quantization cluster preservation under certain conditions on the cluster structure, and propose an additional data transformation that can ensure cluster preservation unconditionally; this transformation is invertible and thus induces virtually no distortion on the compressed data. In addition, we provide an efficient scheme for multi-bit allocation, per cluster and data dimension, which enables a trade-off between high compression efficiency and low data distortion. Our experimental studies highlight that the suggested scheme accurately preserved the clusters formed in all cases, while incurring minimal distortion on the data shapes. Our results can find many applications, e.g., in a) clustering, analysis and distribution of massive datasets, where the proposed data compression can boost performance while providing provable guarantees on the clustering result, as well as, in b) cloud computing services, as the optional transformation provides a data-hiding functionality in addition to preserving the K-means clustering outcome.

    KW - Cluster preservation

    KW - Clustering

    KW - Compression

    KW - K-means

    KW - Mmse quantization

    UR - http://www.scopus.com/inward/record.url?scp=84880213193&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84880213193&partnerID=8YFLogxK

    M3 - Conference contribution

    SN - 9781611972320

    SP - 82

    EP - 93

    BT - Proceedings of the 12th SIAM International Conference on Data Mining, SDM 2012

    ER -