On distributing symmetric streaming computations

Jon Feldman, Shanmugavelayutham Muthukrishnan, Anastasios Sidiropoulos, Cliff Stein, Zoya Svitkina

    Research output: Contribution to journalArticle

    Abstract

    A common approach for dealing with large datasets is to stream over the input in one pass, and perform computations using sublinear resources. For truly massive datasets, however, even making a single pass over the data is prohibitive. Therefore, streaming computations must be distributed over many machines. In practice, obtaining significant speedups using distributed computation has numerous challenges including synchronization, load balancing, overcoming processor failures, and data distribution. Successful systems in practice such as Google's MapReduce and Apache's Hadoop address these problems by only allowing a certain class of highly distributable tasks defined by local computations that can be applied in any order to the input. The fundamental question that arises is: How does the class of computational tasks supported by these systems differ from the class for which streaming solutions exist? We introduce a simple algorithmic model for massive, unordered, distributed (mud) computation, as implemented by these systems. We show that in principle, mud algorithms are equivalent in power to symmetric streaming algorithms. More precisely, we show that any symmetric (order- invariant) function that can be computed by a streaming algorithm can also be computed by a mud algorithm, with comparable space and communication complexity. Our simulation uses Savitch's theorem and therefore has superpolynomial time complexity. We extend our simulation result to some natural classes of approximate and randomized streaming algorithms. We also give negative

    Original languageEnglish (US)
    Article number66
    JournalACM Transactions on Algorithms
    Volume6
    Issue number4
    DOIs
    StatePublished - Aug 1 2010

    Fingerprint

    Streaming
    Unordered
    Distributed Computation
    Distributed Algorithms
    Local Computation
    MapReduce
    Communication Complexity
    Space Complexity
    Data Distribution
    Load Balancing
    Large Data Sets
    Time Complexity
    Simulation
    Synchronization
    Resources
    Invariant
    Class
    Theorem

    Keywords

    • Distributed
    • Distributed computations
    • Mapreduce
    • Streaming
    • Symmetric

    ASJC Scopus subject areas

    • Mathematics (miscellaneous)

    Cite this

    Feldman, J., Muthukrishnan, S., Sidiropoulos, A., Stein, C., & Svitkina, Z. (2010). On distributing symmetric streaming computations. ACM Transactions on Algorithms, 6(4), [66]. https://doi.org/10.1145/1824777.1824786

    On distributing symmetric streaming computations. / Feldman, Jon; Muthukrishnan, Shanmugavelayutham; Sidiropoulos, Anastasios; Stein, Cliff; Svitkina, Zoya.

    In: ACM Transactions on Algorithms, Vol. 6, No. 4, 66, 01.08.2010.

    Research output: Contribution to journalArticle

    Feldman, J, Muthukrishnan, S, Sidiropoulos, A, Stein, C & Svitkina, Z 2010, 'On distributing symmetric streaming computations', ACM Transactions on Algorithms, vol. 6, no. 4, 66. https://doi.org/10.1145/1824777.1824786
    Feldman J, Muthukrishnan S, Sidiropoulos A, Stein C, Svitkina Z. On distributing symmetric streaming computations. ACM Transactions on Algorithms. 2010 Aug 1;6(4). 66. https://doi.org/10.1145/1824777.1824786
    Feldman, Jon ; Muthukrishnan, Shanmugavelayutham ; Sidiropoulos, Anastasios ; Stein, Cliff ; Svitkina, Zoya. / On distributing symmetric streaming computations. In: ACM Transactions on Algorithms. 2010 ; Vol. 6, No. 4.
    @article{0f00215095b041618d3a87b56ffec053,
    title = "On distributing symmetric streaming computations",
    abstract = "A common approach for dealing with large datasets is to stream over the input in one pass, and perform computations using sublinear resources. For truly massive datasets, however, even making a single pass over the data is prohibitive. Therefore, streaming computations must be distributed over many machines. In practice, obtaining significant speedups using distributed computation has numerous challenges including synchronization, load balancing, overcoming processor failures, and data distribution. Successful systems in practice such as Google's MapReduce and Apache's Hadoop address these problems by only allowing a certain class of highly distributable tasks defined by local computations that can be applied in any order to the input. The fundamental question that arises is: How does the class of computational tasks supported by these systems differ from the class for which streaming solutions exist? We introduce a simple algorithmic model for massive, unordered, distributed (mud) computation, as implemented by these systems. We show that in principle, mud algorithms are equivalent in power to symmetric streaming algorithms. More precisely, we show that any symmetric (order- invariant) function that can be computed by a streaming algorithm can also be computed by a mud algorithm, with comparable space and communication complexity. Our simulation uses Savitch's theorem and therefore has superpolynomial time complexity. We extend our simulation result to some natural classes of approximate and randomized streaming algorithms. We also give negative",
    keywords = "Distributed, Distributed computations, Mapreduce, Streaming, Symmetric",
    author = "Jon Feldman and Shanmugavelayutham Muthukrishnan and Anastasios Sidiropoulos and Cliff Stein and Zoya Svitkina",
    year = "2010",
    month = "8",
    day = "1",
    doi = "10.1145/1824777.1824786",
    language = "English (US)",
    volume = "6",
    journal = "ACM Transactions on Algorithms",
    issn = "1549-6325",
    publisher = "Association for Computing Machinery (ACM)",
    number = "4",

    }

    TY - JOUR

    T1 - On distributing symmetric streaming computations

    AU - Feldman, Jon

    AU - Muthukrishnan, Shanmugavelayutham

    AU - Sidiropoulos, Anastasios

    AU - Stein, Cliff

    AU - Svitkina, Zoya

    PY - 2010/8/1

    Y1 - 2010/8/1

    N2 - A common approach for dealing with large datasets is to stream over the input in one pass, and perform computations using sublinear resources. For truly massive datasets, however, even making a single pass over the data is prohibitive. Therefore, streaming computations must be distributed over many machines. In practice, obtaining significant speedups using distributed computation has numerous challenges including synchronization, load balancing, overcoming processor failures, and data distribution. Successful systems in practice such as Google's MapReduce and Apache's Hadoop address these problems by only allowing a certain class of highly distributable tasks defined by local computations that can be applied in any order to the input. The fundamental question that arises is: How does the class of computational tasks supported by these systems differ from the class for which streaming solutions exist? We introduce a simple algorithmic model for massive, unordered, distributed (mud) computation, as implemented by these systems. We show that in principle, mud algorithms are equivalent in power to symmetric streaming algorithms. More precisely, we show that any symmetric (order- invariant) function that can be computed by a streaming algorithm can also be computed by a mud algorithm, with comparable space and communication complexity. Our simulation uses Savitch's theorem and therefore has superpolynomial time complexity. We extend our simulation result to some natural classes of approximate and randomized streaming algorithms. We also give negative

    AB - A common approach for dealing with large datasets is to stream over the input in one pass, and perform computations using sublinear resources. For truly massive datasets, however, even making a single pass over the data is prohibitive. Therefore, streaming computations must be distributed over many machines. In practice, obtaining significant speedups using distributed computation has numerous challenges including synchronization, load balancing, overcoming processor failures, and data distribution. Successful systems in practice such as Google's MapReduce and Apache's Hadoop address these problems by only allowing a certain class of highly distributable tasks defined by local computations that can be applied in any order to the input. The fundamental question that arises is: How does the class of computational tasks supported by these systems differ from the class for which streaming solutions exist? We introduce a simple algorithmic model for massive, unordered, distributed (mud) computation, as implemented by these systems. We show that in principle, mud algorithms are equivalent in power to symmetric streaming algorithms. More precisely, we show that any symmetric (order- invariant) function that can be computed by a streaming algorithm can also be computed by a mud algorithm, with comparable space and communication complexity. Our simulation uses Savitch's theorem and therefore has superpolynomial time complexity. We extend our simulation result to some natural classes of approximate and randomized streaming algorithms. We also give negative

    KW - Distributed

    KW - Distributed computations

    KW - Mapreduce

    KW - Streaming

    KW - Symmetric

    UR - http://www.scopus.com/inward/record.url?scp=77956508106&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=77956508106&partnerID=8YFLogxK

    U2 - 10.1145/1824777.1824786

    DO - 10.1145/1824777.1824786

    M3 - Article

    VL - 6

    JO - ACM Transactions on Algorithms

    JF - ACM Transactions on Algorithms

    SN - 1549-6325

    IS - 4

    M1 - 66

    ER -