Cluster-based delta compression of a collection of files

Z. Ouyang, N. Memon, T. Suel, D. Trendafilov

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Delta compression techniques are commonly used to succinctly represent an updated version of a file with respect to an earlier one. We study the use of delta compression in a somewhat different scenario, where we wish to compress a large collection of (more or less) related files by performing a sequence of pairwise delta compressions. The problem of finding an optimal delta encoding for a collection of files by taking pairwise deltas can be reduced to the problem of computing a branching of maximum weight in a weighted directed graph, but this solution is inefficient and thus does not scale to larger file collections. This motivates us to propose a framework for cluster-based delta compression that uses text clustering techniques to prune the graph of possible pairwise delta encodings. To demonstrate the efficacy of our approach, we present experimental results on collections of Web pages. Our experiments show that cluster-based delta compression of collections provides significant improvements in compression ratio as compared to individually compressing each file or using tar+gzip, at a moderate cost in efficiency.

    Original languageEnglish (US)
    Title of host publicationWISE 2002 - Proceedings of the 3rd International Conference on Web Information Systems Engineering
    PublisherInstitute of Electrical and Electronics Engineers Inc.
    Pages257-266
    Number of pages10
    ISBN (Print)0769517668, 9780769517667
    DOIs
    StatePublished - 2002
    Event3rd International Conference on Web Information Systems Engineering, WISE 2002 - Singapore, Singapore
    Duration: Dec 12 2002Dec 14 2002

    Other

    Other3rd International Conference on Web Information Systems Engineering, WISE 2002
    CountrySingapore
    CitySingapore
    Period12/12/0212/14/02

    Fingerprint

    Directed graphs
    Tar
    Websites
    Costs
    Experiments

    ASJC Scopus subject areas

    • Computer Networks and Communications
    • Information Systems
    • Control and Systems Engineering

    Cite this

    Ouyang, Z., Memon, N., Suel, T., & Trendafilov, D. (2002). Cluster-based delta compression of a collection of files. In WISE 2002 - Proceedings of the 3rd International Conference on Web Information Systems Engineering (pp. 257-266). [1181662] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/WISE.2002.1181662

    Cluster-based delta compression of a collection of files. / Ouyang, Z.; Memon, N.; Suel, T.; Trendafilov, D.

    WISE 2002 - Proceedings of the 3rd International Conference on Web Information Systems Engineering. Institute of Electrical and Electronics Engineers Inc., 2002. p. 257-266 1181662.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Ouyang, Z, Memon, N, Suel, T & Trendafilov, D 2002, Cluster-based delta compression of a collection of files. in WISE 2002 - Proceedings of the 3rd International Conference on Web Information Systems Engineering., 1181662, Institute of Electrical and Electronics Engineers Inc., pp. 257-266, 3rd International Conference on Web Information Systems Engineering, WISE 2002, Singapore, Singapore, 12/12/02. https://doi.org/10.1109/WISE.2002.1181662
    Ouyang Z, Memon N, Suel T, Trendafilov D. Cluster-based delta compression of a collection of files. In WISE 2002 - Proceedings of the 3rd International Conference on Web Information Systems Engineering. Institute of Electrical and Electronics Engineers Inc. 2002. p. 257-266. 1181662 https://doi.org/10.1109/WISE.2002.1181662
    Ouyang, Z. ; Memon, N. ; Suel, T. ; Trendafilov, D. / Cluster-based delta compression of a collection of files. WISE 2002 - Proceedings of the 3rd International Conference on Web Information Systems Engineering. Institute of Electrical and Electronics Engineers Inc., 2002. pp. 257-266
    @inproceedings{139432df319d4b129ee11a7c24544c94,
    title = "Cluster-based delta compression of a collection of files",
    abstract = "Delta compression techniques are commonly used to succinctly represent an updated version of a file with respect to an earlier one. We study the use of delta compression in a somewhat different scenario, where we wish to compress a large collection of (more or less) related files by performing a sequence of pairwise delta compressions. The problem of finding an optimal delta encoding for a collection of files by taking pairwise deltas can be reduced to the problem of computing a branching of maximum weight in a weighted directed graph, but this solution is inefficient and thus does not scale to larger file collections. This motivates us to propose a framework for cluster-based delta compression that uses text clustering techniques to prune the graph of possible pairwise delta encodings. To demonstrate the efficacy of our approach, we present experimental results on collections of Web pages. Our experiments show that cluster-based delta compression of collections provides significant improvements in compression ratio as compared to individually compressing each file or using tar+gzip, at a moderate cost in efficiency.",
    author = "Z. Ouyang and N. Memon and T. Suel and D. Trendafilov",
    year = "2002",
    doi = "10.1109/WISE.2002.1181662",
    language = "English (US)",
    isbn = "0769517668",
    pages = "257--266",
    booktitle = "WISE 2002 - Proceedings of the 3rd International Conference on Web Information Systems Engineering",
    publisher = "Institute of Electrical and Electronics Engineers Inc.",

    }

    TY - GEN

    T1 - Cluster-based delta compression of a collection of files

    AU - Ouyang, Z.

    AU - Memon, N.

    AU - Suel, T.

    AU - Trendafilov, D.

    PY - 2002

    Y1 - 2002

    N2 - Delta compression techniques are commonly used to succinctly represent an updated version of a file with respect to an earlier one. We study the use of delta compression in a somewhat different scenario, where we wish to compress a large collection of (more or less) related files by performing a sequence of pairwise delta compressions. The problem of finding an optimal delta encoding for a collection of files by taking pairwise deltas can be reduced to the problem of computing a branching of maximum weight in a weighted directed graph, but this solution is inefficient and thus does not scale to larger file collections. This motivates us to propose a framework for cluster-based delta compression that uses text clustering techniques to prune the graph of possible pairwise delta encodings. To demonstrate the efficacy of our approach, we present experimental results on collections of Web pages. Our experiments show that cluster-based delta compression of collections provides significant improvements in compression ratio as compared to individually compressing each file or using tar+gzip, at a moderate cost in efficiency.

    AB - Delta compression techniques are commonly used to succinctly represent an updated version of a file with respect to an earlier one. We study the use of delta compression in a somewhat different scenario, where we wish to compress a large collection of (more or less) related files by performing a sequence of pairwise delta compressions. The problem of finding an optimal delta encoding for a collection of files by taking pairwise deltas can be reduced to the problem of computing a branching of maximum weight in a weighted directed graph, but this solution is inefficient and thus does not scale to larger file collections. This motivates us to propose a framework for cluster-based delta compression that uses text clustering techniques to prune the graph of possible pairwise delta encodings. To demonstrate the efficacy of our approach, we present experimental results on collections of Web pages. Our experiments show that cluster-based delta compression of collections provides significant improvements in compression ratio as compared to individually compressing each file or using tar+gzip, at a moderate cost in efficiency.

    UR - http://www.scopus.com/inward/record.url?scp=84961214036&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84961214036&partnerID=8YFLogxK

    U2 - 10.1109/WISE.2002.1181662

    DO - 10.1109/WISE.2002.1181662

    M3 - Conference contribution

    AN - SCOPUS:84961214036

    SN - 0769517668

    SN - 9780769517667

    SP - 257

    EP - 266

    BT - WISE 2002 - Proceedings of the 3rd International Conference on Web Information Systems Engineering

    PB - Institute of Electrical and Electronics Engineers Inc.

    ER -