Improved file synchronization techniques for maintaining large replicated collections over slow networks

Torsten Suel, Patrick Noel, Dimitre Trendafilov

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    We study the problem of maintaining large replicated collections of files or documents in a distributed environment with limited bandwidth. This problem arises in a number of important applications, such as synchronization of data between accounts or devices, content distibution and web caching networks, web site mirroring, storage networks, and large scale web search and mining. At the core of the problem lies the following challenge, called the file synchronization problem: given two versions of a file on different machines, say an out-dated and a current one, how can we update the outdated version with minimum communication cost, by exploiting the significant similarity between the versions? While a popular open source tool for this problem called rsync is used in hundreds of thousands of installations, there have been only very few attempts to improve upon this tool in practice. In this paper, we propose a framework for remote file synchronization and describe several new techniques that result in significant bandwidth savings. Our focus is on applications where very large collections have to be maintained over slow connections. We show that a prototype implementation of our framework and techniques achieves significant improvements over rsync. As an example application, we focus on the efficient synchronization of very large web page collections for the purpose of search, mining, and content distribution.

    Original languageEnglish (US)
    Title of host publicationProceedings - 20th International Conference on Data Engineering - ICDE 2004
    Pages153-164
    Number of pages12
    Volume20
    DOIs
    StatePublished - 2004
    EventProceedings - 20th International Conference on Data Engineering - ICDE 2004 - Boston, MA., United States
    Duration: Mar 30 2004Apr 2 2004

    Other

    OtherProceedings - 20th International Conference on Data Engineering - ICDE 2004
    CountryUnited States
    CityBoston, MA.
    Period3/30/044/2/04

    Fingerprint

    Synchronization
    Websites
    Bandwidth
    World Wide Web
    Communication
    Costs

    ASJC Scopus subject areas

    • Software
    • Engineering(all)
    • Engineering (miscellaneous)

    Cite this

    Suel, T., Noel, P., & Trendafilov, D. (2004). Improved file synchronization techniques for maintaining large replicated collections over slow networks. In Proceedings - 20th International Conference on Data Engineering - ICDE 2004 (Vol. 20, pp. 153-164) https://doi.org/10.1109/ICDE.2004.1319992

    Improved file synchronization techniques for maintaining large replicated collections over slow networks. / Suel, Torsten; Noel, Patrick; Trendafilov, Dimitre.

    Proceedings - 20th International Conference on Data Engineering - ICDE 2004. Vol. 20 2004. p. 153-164.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Suel, T, Noel, P & Trendafilov, D 2004, Improved file synchronization techniques for maintaining large replicated collections over slow networks. in Proceedings - 20th International Conference on Data Engineering - ICDE 2004. vol. 20, pp. 153-164, Proceedings - 20th International Conference on Data Engineering - ICDE 2004, Boston, MA., United States, 3/30/04. https://doi.org/10.1109/ICDE.2004.1319992
    Suel T, Noel P, Trendafilov D. Improved file synchronization techniques for maintaining large replicated collections over slow networks. In Proceedings - 20th International Conference on Data Engineering - ICDE 2004. Vol. 20. 2004. p. 153-164 https://doi.org/10.1109/ICDE.2004.1319992
    Suel, Torsten ; Noel, Patrick ; Trendafilov, Dimitre. / Improved file synchronization techniques for maintaining large replicated collections over slow networks. Proceedings - 20th International Conference on Data Engineering - ICDE 2004. Vol. 20 2004. pp. 153-164
    @inproceedings{6069e7a3821647b28c2b1e0cf5a4dc71,
    title = "Improved file synchronization techniques for maintaining large replicated collections over slow networks",
    abstract = "We study the problem of maintaining large replicated collections of files or documents in a distributed environment with limited bandwidth. This problem arises in a number of important applications, such as synchronization of data between accounts or devices, content distibution and web caching networks, web site mirroring, storage networks, and large scale web search and mining. At the core of the problem lies the following challenge, called the file synchronization problem: given two versions of a file on different machines, say an out-dated and a current one, how can we update the outdated version with minimum communication cost, by exploiting the significant similarity between the versions? While a popular open source tool for this problem called rsync is used in hundreds of thousands of installations, there have been only very few attempts to improve upon this tool in practice. In this paper, we propose a framework for remote file synchronization and describe several new techniques that result in significant bandwidth savings. Our focus is on applications where very large collections have to be maintained over slow connections. We show that a prototype implementation of our framework and techniques achieves significant improvements over rsync. As an example application, we focus on the efficient synchronization of very large web page collections for the purpose of search, mining, and content distribution.",
    author = "Torsten Suel and Patrick Noel and Dimitre Trendafilov",
    year = "2004",
    doi = "10.1109/ICDE.2004.1319992",
    language = "English (US)",
    volume = "20",
    pages = "153--164",
    booktitle = "Proceedings - 20th International Conference on Data Engineering - ICDE 2004",

    }

    TY - GEN

    T1 - Improved file synchronization techniques for maintaining large replicated collections over slow networks

    AU - Suel, Torsten

    AU - Noel, Patrick

    AU - Trendafilov, Dimitre

    PY - 2004

    Y1 - 2004

    N2 - We study the problem of maintaining large replicated collections of files or documents in a distributed environment with limited bandwidth. This problem arises in a number of important applications, such as synchronization of data between accounts or devices, content distibution and web caching networks, web site mirroring, storage networks, and large scale web search and mining. At the core of the problem lies the following challenge, called the file synchronization problem: given two versions of a file on different machines, say an out-dated and a current one, how can we update the outdated version with minimum communication cost, by exploiting the significant similarity between the versions? While a popular open source tool for this problem called rsync is used in hundreds of thousands of installations, there have been only very few attempts to improve upon this tool in practice. In this paper, we propose a framework for remote file synchronization and describe several new techniques that result in significant bandwidth savings. Our focus is on applications where very large collections have to be maintained over slow connections. We show that a prototype implementation of our framework and techniques achieves significant improvements over rsync. As an example application, we focus on the efficient synchronization of very large web page collections for the purpose of search, mining, and content distribution.

    AB - We study the problem of maintaining large replicated collections of files or documents in a distributed environment with limited bandwidth. This problem arises in a number of important applications, such as synchronization of data between accounts or devices, content distibution and web caching networks, web site mirroring, storage networks, and large scale web search and mining. At the core of the problem lies the following challenge, called the file synchronization problem: given two versions of a file on different machines, say an out-dated and a current one, how can we update the outdated version with minimum communication cost, by exploiting the significant similarity between the versions? While a popular open source tool for this problem called rsync is used in hundreds of thousands of installations, there have been only very few attempts to improve upon this tool in practice. In this paper, we propose a framework for remote file synchronization and describe several new techniques that result in significant bandwidth savings. Our focus is on applications where very large collections have to be maintained over slow connections. We show that a prototype implementation of our framework and techniques achieves significant improvements over rsync. As an example application, we focus on the efficient synchronization of very large web page collections for the purpose of search, mining, and content distribution.

    UR - http://www.scopus.com/inward/record.url?scp=2442563450&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=2442563450&partnerID=8YFLogxK

    U2 - 10.1109/ICDE.2004.1319992

    DO - 10.1109/ICDE.2004.1319992

    M3 - Conference contribution

    VL - 20

    SP - 153

    EP - 164

    BT - Proceedings - 20th International Conference on Data Engineering - ICDE 2004

    ER -