Scalable computation of distributions from large scale data sets

Abon Chaudhuri, Teng Yok Lee, Bo Zhou, Cong Wang, Tiantian Xu, Han Wei Shen, Tom Peterka, Yi-Jen Chiang

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    As we approach the era of exascale computing, the role of distributions to summarize, analyze and visualize large scale data is becoming more and more important. Since histograms continue to be a popular way of modeling the underlying data distribution, we propose a scalable and distributed framework for computing histograms from scalar and vector data at different levels of detail required by various types of analysis algorithms. We present efficient parallel techniques for histogram computation from regular as well as rectilinear grid data. We also study a technique called cross-validation to estimate the quality of computed histograms as a model of the actual data distribution. We parallelize cross-validation in a scalable manner to support histogram evaluation and selection of histogram parameters such as number of bins. We also present our distributed software framework for supporting science applications which require large scale distribution-based data analysis. The presented case studies highlight how the proposed algorithms and the related software benefit information theoretic and other distribution-driven analysis.

    Original languageEnglish (US)
    Title of host publicationIEEE Symposium on Large Data Analysis and Visualization 2012, LDAV 2012 - Proceedings
    Pages113-120
    Number of pages8
    DOIs
    StatePublished - 2012
    Event2nd Symposium on Large-Scale Data Analysis and Visualization, LDAV 2012 - Seattle, WA, United States
    Duration: Oct 14 2012Oct 19 2012

    Other

    Other2nd Symposium on Large-Scale Data Analysis and Visualization, LDAV 2012
    CountryUnited States
    CitySeattle, WA
    Period10/14/1210/19/12

    Fingerprint

    Bins

    ASJC Scopus subject areas

    • Computer Vision and Pattern Recognition
    • Information Systems

    Cite this

    Chaudhuri, A., Lee, T. Y., Zhou, B., Wang, C., Xu, T., Shen, H. W., ... Chiang, Y-J. (2012). Scalable computation of distributions from large scale data sets. In IEEE Symposium on Large Data Analysis and Visualization 2012, LDAV 2012 - Proceedings (pp. 113-120). [6378985] https://doi.org/10.1109/LDAV.2012.6378985

    Scalable computation of distributions from large scale data sets. / Chaudhuri, Abon; Lee, Teng Yok; Zhou, Bo; Wang, Cong; Xu, Tiantian; Shen, Han Wei; Peterka, Tom; Chiang, Yi-Jen.

    IEEE Symposium on Large Data Analysis and Visualization 2012, LDAV 2012 - Proceedings. 2012. p. 113-120 6378985.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Chaudhuri, A, Lee, TY, Zhou, B, Wang, C, Xu, T, Shen, HW, Peterka, T & Chiang, Y-J 2012, Scalable computation of distributions from large scale data sets. in IEEE Symposium on Large Data Analysis and Visualization 2012, LDAV 2012 - Proceedings., 6378985, pp. 113-120, 2nd Symposium on Large-Scale Data Analysis and Visualization, LDAV 2012, Seattle, WA, United States, 10/14/12. https://doi.org/10.1109/LDAV.2012.6378985
    Chaudhuri A, Lee TY, Zhou B, Wang C, Xu T, Shen HW et al. Scalable computation of distributions from large scale data sets. In IEEE Symposium on Large Data Analysis and Visualization 2012, LDAV 2012 - Proceedings. 2012. p. 113-120. 6378985 https://doi.org/10.1109/LDAV.2012.6378985
    Chaudhuri, Abon ; Lee, Teng Yok ; Zhou, Bo ; Wang, Cong ; Xu, Tiantian ; Shen, Han Wei ; Peterka, Tom ; Chiang, Yi-Jen. / Scalable computation of distributions from large scale data sets. IEEE Symposium on Large Data Analysis and Visualization 2012, LDAV 2012 - Proceedings. 2012. pp. 113-120
    @inproceedings{fac146802b7f4b35a4194f11d1dcb8af,
    title = "Scalable computation of distributions from large scale data sets",
    abstract = "As we approach the era of exascale computing, the role of distributions to summarize, analyze and visualize large scale data is becoming more and more important. Since histograms continue to be a popular way of modeling the underlying data distribution, we propose a scalable and distributed framework for computing histograms from scalar and vector data at different levels of detail required by various types of analysis algorithms. We present efficient parallel techniques for histogram computation from regular as well as rectilinear grid data. We also study a technique called cross-validation to estimate the quality of computed histograms as a model of the actual data distribution. We parallelize cross-validation in a scalable manner to support histogram evaluation and selection of histogram parameters such as number of bins. We also present our distributed software framework for supporting science applications which require large scale distribution-based data analysis. The presented case studies highlight how the proposed algorithms and the related software benefit information theoretic and other distribution-driven analysis.",
    author = "Abon Chaudhuri and Lee, {Teng Yok} and Bo Zhou and Cong Wang and Tiantian Xu and Shen, {Han Wei} and Tom Peterka and Yi-Jen Chiang",
    year = "2012",
    doi = "10.1109/LDAV.2012.6378985",
    language = "English (US)",
    isbn = "9781467347334",
    pages = "113--120",
    booktitle = "IEEE Symposium on Large Data Analysis and Visualization 2012, LDAV 2012 - Proceedings",

    }

    TY - GEN

    T1 - Scalable computation of distributions from large scale data sets

    AU - Chaudhuri, Abon

    AU - Lee, Teng Yok

    AU - Zhou, Bo

    AU - Wang, Cong

    AU - Xu, Tiantian

    AU - Shen, Han Wei

    AU - Peterka, Tom

    AU - Chiang, Yi-Jen

    PY - 2012

    Y1 - 2012

    N2 - As we approach the era of exascale computing, the role of distributions to summarize, analyze and visualize large scale data is becoming more and more important. Since histograms continue to be a popular way of modeling the underlying data distribution, we propose a scalable and distributed framework for computing histograms from scalar and vector data at different levels of detail required by various types of analysis algorithms. We present efficient parallel techniques for histogram computation from regular as well as rectilinear grid data. We also study a technique called cross-validation to estimate the quality of computed histograms as a model of the actual data distribution. We parallelize cross-validation in a scalable manner to support histogram evaluation and selection of histogram parameters such as number of bins. We also present our distributed software framework for supporting science applications which require large scale distribution-based data analysis. The presented case studies highlight how the proposed algorithms and the related software benefit information theoretic and other distribution-driven analysis.

    AB - As we approach the era of exascale computing, the role of distributions to summarize, analyze and visualize large scale data is becoming more and more important. Since histograms continue to be a popular way of modeling the underlying data distribution, we propose a scalable and distributed framework for computing histograms from scalar and vector data at different levels of detail required by various types of analysis algorithms. We present efficient parallel techniques for histogram computation from regular as well as rectilinear grid data. We also study a technique called cross-validation to estimate the quality of computed histograms as a model of the actual data distribution. We parallelize cross-validation in a scalable manner to support histogram evaluation and selection of histogram parameters such as number of bins. We also present our distributed software framework for supporting science applications which require large scale distribution-based data analysis. The presented case studies highlight how the proposed algorithms and the related software benefit information theoretic and other distribution-driven analysis.

    UR - http://www.scopus.com/inward/record.url?scp=84872198627&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84872198627&partnerID=8YFLogxK

    U2 - 10.1109/LDAV.2012.6378985

    DO - 10.1109/LDAV.2012.6378985

    M3 - Conference contribution

    AN - SCOPUS:84872198627

    SN - 9781467347334

    SP - 113

    EP - 120

    BT - IEEE Symposium on Large Data Analysis and Visualization 2012, LDAV 2012 - Proceedings

    ER -