Approximate string joins in a database (almost) for free

Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas, Shanmugavelayutham Muthukrishnan, Divesh Srivastava

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    String data is ubiquitous, and its management has taken on particular importance in the past few years. Approximate queries are very important on string data especially for more complex queries involving joins. This is due, for example, to the prevalence of typographical errors in data, and multiple conventions for recording attributes such as name and address. Commercial databases do not support approximate string joins directly, and it is a challenge to implement this functionality efficiently with user-defined functions (UDFs). In this paper, we develop a technique for building approximate string join capabilities on top of commercial databases by exploiting facilities already available in them. At the core, our technique relies on matching short substrings of length q, called q-grams, and taking into account both positions of individual matches and the total number of such matches. Our approach applies to both approximate full string matching and approximate substring matching, with a variety of possible edit distance functions. The approximate string match predicate, with a suitable edit distance threshold, can be mapped into a vanilla relational expression and optimized by conventional relational optimizers. We demonstrate experimentally the benefits of our technique over the direct use of UDFs, using commercial database systems and real data. To study the I/O and CPU behavior of approximate string join algorithms with variations in edit distance and q-gram length, we also describe detailed experiments based on a prototype implementation.

    Original languageEnglish (US)
    Title of host publicationVLDB 2001 - Proceedings of 27th International Conference on Very Large Data Bases
    EditorsPeter M. G. Apers, Paolo Atzeni, Richard T. Snodgrass, Stefano Ceri, Kotagiri Ramamohanarao, Stefano Paraboschi
    PublisherMorgan Kaufmann
    Pages491-500
    Number of pages10
    ISBN (Electronic)1558608044, 9781558608047
    StatePublished - Jan 1 2001
    Event27th International Conference on Very Large Data Bases, VLDB 2001 - Roma, Italy
    Duration: Sep 11 2001Sep 14 2001

    Publication series

    NameVLDB 2001 - Proceedings of 27th International Conference on Very Large Data Bases

    Other

    Other27th International Conference on Very Large Data Bases, VLDB 2001
    CountryItaly
    CityRoma
    Period9/11/019/14/01

    Fingerprint

    Program processors
    Join
    Data base
    Experiments
    Query
    Functionality
    Prototype
    Distance function
    Experiment

    ASJC Scopus subject areas

    • Information Systems and Management
    • Computer Science Applications
    • Hardware and Architecture
    • Software
    • Computer Networks and Communications
    • Information Systems

    Cite this

    Gravano, L., Ipeirotis, P. G., Jagadish, H. V., Koudas, N., Muthukrishnan, S., & Srivastava, D. (2001). Approximate string joins in a database (almost) for free. In P. M. G. Apers, P. Atzeni, R. T. Snodgrass, S. Ceri, K. Ramamohanarao, & S. Paraboschi (Eds.), VLDB 2001 - Proceedings of 27th International Conference on Very Large Data Bases (pp. 491-500). (VLDB 2001 - Proceedings of 27th International Conference on Very Large Data Bases). Morgan Kaufmann.

    Approximate string joins in a database (almost) for free. / Gravano, Luis; Ipeirotis, Panagiotis G.; Jagadish, H. V.; Koudas, Nick; Muthukrishnan, Shanmugavelayutham; Srivastava, Divesh.

    VLDB 2001 - Proceedings of 27th International Conference on Very Large Data Bases. ed. / Peter M. G. Apers; Paolo Atzeni; Richard T. Snodgrass; Stefano Ceri; Kotagiri Ramamohanarao; Stefano Paraboschi. Morgan Kaufmann, 2001. p. 491-500 (VLDB 2001 - Proceedings of 27th International Conference on Very Large Data Bases).

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Gravano, L, Ipeirotis, PG, Jagadish, HV, Koudas, N, Muthukrishnan, S & Srivastava, D 2001, Approximate string joins in a database (almost) for free. in PMG Apers, P Atzeni, RT Snodgrass, S Ceri, K Ramamohanarao & S Paraboschi (eds), VLDB 2001 - Proceedings of 27th International Conference on Very Large Data Bases. VLDB 2001 - Proceedings of 27th International Conference on Very Large Data Bases, Morgan Kaufmann, pp. 491-500, 27th International Conference on Very Large Data Bases, VLDB 2001, Roma, Italy, 9/11/01.
    Gravano L, Ipeirotis PG, Jagadish HV, Koudas N, Muthukrishnan S, Srivastava D. Approximate string joins in a database (almost) for free. In Apers PMG, Atzeni P, Snodgrass RT, Ceri S, Ramamohanarao K, Paraboschi S, editors, VLDB 2001 - Proceedings of 27th International Conference on Very Large Data Bases. Morgan Kaufmann. 2001. p. 491-500. (VLDB 2001 - Proceedings of 27th International Conference on Very Large Data Bases).
    Gravano, Luis ; Ipeirotis, Panagiotis G. ; Jagadish, H. V. ; Koudas, Nick ; Muthukrishnan, Shanmugavelayutham ; Srivastava, Divesh. / Approximate string joins in a database (almost) for free. VLDB 2001 - Proceedings of 27th International Conference on Very Large Data Bases. editor / Peter M. G. Apers ; Paolo Atzeni ; Richard T. Snodgrass ; Stefano Ceri ; Kotagiri Ramamohanarao ; Stefano Paraboschi. Morgan Kaufmann, 2001. pp. 491-500 (VLDB 2001 - Proceedings of 27th International Conference on Very Large Data Bases).
    @inproceedings{86b9e5cda3c54ab5a78b8cb00cb56775,
    title = "Approximate string joins in a database (almost) for free",
    abstract = "String data is ubiquitous, and its management has taken on particular importance in the past few years. Approximate queries are very important on string data especially for more complex queries involving joins. This is due, for example, to the prevalence of typographical errors in data, and multiple conventions for recording attributes such as name and address. Commercial databases do not support approximate string joins directly, and it is a challenge to implement this functionality efficiently with user-defined functions (UDFs). In this paper, we develop a technique for building approximate string join capabilities on top of commercial databases by exploiting facilities already available in them. At the core, our technique relies on matching short substrings of length q, called q-grams, and taking into account both positions of individual matches and the total number of such matches. Our approach applies to both approximate full string matching and approximate substring matching, with a variety of possible edit distance functions. The approximate string match predicate, with a suitable edit distance threshold, can be mapped into a vanilla relational expression and optimized by conventional relational optimizers. We demonstrate experimentally the benefits of our technique over the direct use of UDFs, using commercial database systems and real data. To study the I/O and CPU behavior of approximate string join algorithms with variations in edit distance and q-gram length, we also describe detailed experiments based on a prototype implementation.",
    author = "Luis Gravano and Ipeirotis, {Panagiotis G.} and Jagadish, {H. V.} and Nick Koudas and Shanmugavelayutham Muthukrishnan and Divesh Srivastava",
    year = "2001",
    month = "1",
    day = "1",
    language = "English (US)",
    series = "VLDB 2001 - Proceedings of 27th International Conference on Very Large Data Bases",
    publisher = "Morgan Kaufmann",
    pages = "491--500",
    editor = "Apers, {Peter M. G.} and Paolo Atzeni and Snodgrass, {Richard T.} and Stefano Ceri and Kotagiri Ramamohanarao and Stefano Paraboschi",
    booktitle = "VLDB 2001 - Proceedings of 27th International Conference on Very Large Data Bases",

    }

    TY - GEN

    T1 - Approximate string joins in a database (almost) for free

    AU - Gravano, Luis

    AU - Ipeirotis, Panagiotis G.

    AU - Jagadish, H. V.

    AU - Koudas, Nick

    AU - Muthukrishnan, Shanmugavelayutham

    AU - Srivastava, Divesh

    PY - 2001/1/1

    Y1 - 2001/1/1

    N2 - String data is ubiquitous, and its management has taken on particular importance in the past few years. Approximate queries are very important on string data especially for more complex queries involving joins. This is due, for example, to the prevalence of typographical errors in data, and multiple conventions for recording attributes such as name and address. Commercial databases do not support approximate string joins directly, and it is a challenge to implement this functionality efficiently with user-defined functions (UDFs). In this paper, we develop a technique for building approximate string join capabilities on top of commercial databases by exploiting facilities already available in them. At the core, our technique relies on matching short substrings of length q, called q-grams, and taking into account both positions of individual matches and the total number of such matches. Our approach applies to both approximate full string matching and approximate substring matching, with a variety of possible edit distance functions. The approximate string match predicate, with a suitable edit distance threshold, can be mapped into a vanilla relational expression and optimized by conventional relational optimizers. We demonstrate experimentally the benefits of our technique over the direct use of UDFs, using commercial database systems and real data. To study the I/O and CPU behavior of approximate string join algorithms with variations in edit distance and q-gram length, we also describe detailed experiments based on a prototype implementation.

    AB - String data is ubiquitous, and its management has taken on particular importance in the past few years. Approximate queries are very important on string data especially for more complex queries involving joins. This is due, for example, to the prevalence of typographical errors in data, and multiple conventions for recording attributes such as name and address. Commercial databases do not support approximate string joins directly, and it is a challenge to implement this functionality efficiently with user-defined functions (UDFs). In this paper, we develop a technique for building approximate string join capabilities on top of commercial databases by exploiting facilities already available in them. At the core, our technique relies on matching short substrings of length q, called q-grams, and taking into account both positions of individual matches and the total number of such matches. Our approach applies to both approximate full string matching and approximate substring matching, with a variety of possible edit distance functions. The approximate string match predicate, with a suitable edit distance threshold, can be mapped into a vanilla relational expression and optimized by conventional relational optimizers. We demonstrate experimentally the benefits of our technique over the direct use of UDFs, using commercial database systems and real data. To study the I/O and CPU behavior of approximate string join algorithms with variations in edit distance and q-gram length, we also describe detailed experiments based on a prototype implementation.

    UR - http://www.scopus.com/inward/record.url?scp=84944318804&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84944318804&partnerID=8YFLogxK

    M3 - Conference contribution

    AN - SCOPUS:84944318804

    T3 - VLDB 2001 - Proceedings of 27th International Conference on Very Large Data Bases

    SP - 491

    EP - 500

    BT - VLDB 2001 - Proceedings of 27th International Conference on Very Large Data Bases

    A2 - Apers, Peter M. G.

    A2 - Atzeni, Paolo

    A2 - Snodgrass, Richard T.

    A2 - Ceri, Stefano

    A2 - Ramamohanarao, Kotagiri

    A2 - Paraboschi, Stefano

    PB - Morgan Kaufmann

    ER -