Making interval-based clustering rank-aware

Julia Stoyanovich, Sihem Amer-Yahia, Tova Milo

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    In online applications, such as online dating, users often query and rank large collections of structured items. Top results tend to be homogeneous, which hinders data exploration. For example, a dating website user who is looking for a partner between 20 and 40 years old, and who sorts the matches by income from higher to lower, will see a large number of matches in their late 30s who hold an MBA degree and work in the financial industry, before seeing any matches in different age groups and walks of life. An alternative to presenting results in a ranked list is to find clusters in the result space, identified by a combination of attributes that correlate with rank. Such clusters may describe matches between 35 and 40 with an MBA, matches between 25 and 30 who work in the software industry, etc., allowing for data exploration of ranked results. We refer to the problem of finding such clusters as rank-aware interval-based clustering and argue that it is not addressed by standard clustering algorithms. We formally define the problem and, to solve it, propose a novel measure of locality, together with a family of clustering quality measures appropriate for this application scenario. These ingredients may be used by a variety of clustering algorithms, and we present BARAC, a particular subspace-clustering algorithm that enables rank-aware interval-based clustering in domains with heterogeneous attributes. We validate the effectiveness of our approach with a large-scale user study, and perform an extensive experimental evaluation of efficiency, demonstrating that our methods are practical on the large scale. Our evaluation is performed on large datasets from Yahoo! Personals, a leading online dating site, and on restaurant data from Yahoo! Local.

    Original languageEnglish (US)
    Title of host publicationAdvances in Database Technology - EDBT 2011
    Subtitle of host publication14th International Conference on Extending Database Technology, Proceedings
    Pages437-448
    Number of pages12
    DOIs
    StatePublished - Apr 18 2011
    Event14th International Conference on Extending Database Technology: Advances in Database Technology, EDBT 2011 - Uppsala, Sweden
    Duration: Mar 22 2011Mar 24 2011

    Other

    Other14th International Conference on Extending Database Technology: Advances in Database Technology, EDBT 2011
    CountrySweden
    CityUppsala
    Period3/22/113/24/11

    Fingerprint

    Clustering algorithms
    Websites
    Industry

    Keywords

    • Clustering
    • Data exploration
    • Ranking

    ASJC Scopus subject areas

    • Software
    • Human-Computer Interaction
    • Computer Vision and Pattern Recognition
    • Computer Networks and Communications

    Cite this

    Stoyanovich, J., Amer-Yahia, S., & Milo, T. (2011). Making interval-based clustering rank-aware. In Advances in Database Technology - EDBT 2011: 14th International Conference on Extending Database Technology, Proceedings (pp. 437-448) https://doi.org/10.1145/1951365.1951417

    Making interval-based clustering rank-aware. / Stoyanovich, Julia; Amer-Yahia, Sihem; Milo, Tova.

    Advances in Database Technology - EDBT 2011: 14th International Conference on Extending Database Technology, Proceedings. 2011. p. 437-448.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Stoyanovich, J, Amer-Yahia, S & Milo, T 2011, Making interval-based clustering rank-aware. in Advances in Database Technology - EDBT 2011: 14th International Conference on Extending Database Technology, Proceedings. pp. 437-448, 14th International Conference on Extending Database Technology: Advances in Database Technology, EDBT 2011, Uppsala, Sweden, 3/22/11. https://doi.org/10.1145/1951365.1951417
    Stoyanovich J, Amer-Yahia S, Milo T. Making interval-based clustering rank-aware. In Advances in Database Technology - EDBT 2011: 14th International Conference on Extending Database Technology, Proceedings. 2011. p. 437-448 https://doi.org/10.1145/1951365.1951417
    Stoyanovich, Julia ; Amer-Yahia, Sihem ; Milo, Tova. / Making interval-based clustering rank-aware. Advances in Database Technology - EDBT 2011: 14th International Conference on Extending Database Technology, Proceedings. 2011. pp. 437-448
    @inproceedings{44045b38dd0043e3b5eb58a72c05bed7,
    title = "Making interval-based clustering rank-aware",
    abstract = "In online applications, such as online dating, users often query and rank large collections of structured items. Top results tend to be homogeneous, which hinders data exploration. For example, a dating website user who is looking for a partner between 20 and 40 years old, and who sorts the matches by income from higher to lower, will see a large number of matches in their late 30s who hold an MBA degree and work in the financial industry, before seeing any matches in different age groups and walks of life. An alternative to presenting results in a ranked list is to find clusters in the result space, identified by a combination of attributes that correlate with rank. Such clusters may describe matches between 35 and 40 with an MBA, matches between 25 and 30 who work in the software industry, etc., allowing for data exploration of ranked results. We refer to the problem of finding such clusters as rank-aware interval-based clustering and argue that it is not addressed by standard clustering algorithms. We formally define the problem and, to solve it, propose a novel measure of locality, together with a family of clustering quality measures appropriate for this application scenario. These ingredients may be used by a variety of clustering algorithms, and we present BARAC, a particular subspace-clustering algorithm that enables rank-aware interval-based clustering in domains with heterogeneous attributes. We validate the effectiveness of our approach with a large-scale user study, and perform an extensive experimental evaluation of efficiency, demonstrating that our methods are practical on the large scale. Our evaluation is performed on large datasets from Yahoo! Personals, a leading online dating site, and on restaurant data from Yahoo! Local.",
    keywords = "Clustering, Data exploration, Ranking",
    author = "Julia Stoyanovich and Sihem Amer-Yahia and Tova Milo",
    year = "2011",
    month = "4",
    day = "18",
    doi = "10.1145/1951365.1951417",
    language = "English (US)",
    isbn = "9781450305280",
    pages = "437--448",
    booktitle = "Advances in Database Technology - EDBT 2011",

    }

    TY - GEN

    T1 - Making interval-based clustering rank-aware

    AU - Stoyanovich, Julia

    AU - Amer-Yahia, Sihem

    AU - Milo, Tova

    PY - 2011/4/18

    Y1 - 2011/4/18

    N2 - In online applications, such as online dating, users often query and rank large collections of structured items. Top results tend to be homogeneous, which hinders data exploration. For example, a dating website user who is looking for a partner between 20 and 40 years old, and who sorts the matches by income from higher to lower, will see a large number of matches in their late 30s who hold an MBA degree and work in the financial industry, before seeing any matches in different age groups and walks of life. An alternative to presenting results in a ranked list is to find clusters in the result space, identified by a combination of attributes that correlate with rank. Such clusters may describe matches between 35 and 40 with an MBA, matches between 25 and 30 who work in the software industry, etc., allowing for data exploration of ranked results. We refer to the problem of finding such clusters as rank-aware interval-based clustering and argue that it is not addressed by standard clustering algorithms. We formally define the problem and, to solve it, propose a novel measure of locality, together with a family of clustering quality measures appropriate for this application scenario. These ingredients may be used by a variety of clustering algorithms, and we present BARAC, a particular subspace-clustering algorithm that enables rank-aware interval-based clustering in domains with heterogeneous attributes. We validate the effectiveness of our approach with a large-scale user study, and perform an extensive experimental evaluation of efficiency, demonstrating that our methods are practical on the large scale. Our evaluation is performed on large datasets from Yahoo! Personals, a leading online dating site, and on restaurant data from Yahoo! Local.

    AB - In online applications, such as online dating, users often query and rank large collections of structured items. Top results tend to be homogeneous, which hinders data exploration. For example, a dating website user who is looking for a partner between 20 and 40 years old, and who sorts the matches by income from higher to lower, will see a large number of matches in their late 30s who hold an MBA degree and work in the financial industry, before seeing any matches in different age groups and walks of life. An alternative to presenting results in a ranked list is to find clusters in the result space, identified by a combination of attributes that correlate with rank. Such clusters may describe matches between 35 and 40 with an MBA, matches between 25 and 30 who work in the software industry, etc., allowing for data exploration of ranked results. We refer to the problem of finding such clusters as rank-aware interval-based clustering and argue that it is not addressed by standard clustering algorithms. We formally define the problem and, to solve it, propose a novel measure of locality, together with a family of clustering quality measures appropriate for this application scenario. These ingredients may be used by a variety of clustering algorithms, and we present BARAC, a particular subspace-clustering algorithm that enables rank-aware interval-based clustering in domains with heterogeneous attributes. We validate the effectiveness of our approach with a large-scale user study, and perform an extensive experimental evaluation of efficiency, demonstrating that our methods are practical on the large scale. Our evaluation is performed on large datasets from Yahoo! Personals, a leading online dating site, and on restaurant data from Yahoo! Local.

    KW - Clustering

    KW - Data exploration

    KW - Ranking

    UR - http://www.scopus.com/inward/record.url?scp=79953839805&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=79953839805&partnerID=8YFLogxK

    U2 - 10.1145/1951365.1951417

    DO - 10.1145/1951365.1951417

    M3 - Conference contribution

    SN - 9781450305280

    SP - 437

    EP - 448

    BT - Advances in Database Technology - EDBT 2011

    ER -