DataSynthesizer: Privacy-preserving synthetic datasets

Haoyue Ping, Julia Stoyanovich, Bill Howe

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    To facilitate collaboration over sensitive data, we present DataSynthesizer, a tool that takes a sensitive dataset as input and generates a structurally and statistically similar synthetic dataset with strong privacy guarantees. The data owners need not release their data, while potential collaborators can begin developing models and methods with some confidence that their results will work similarly on the real dataset. The distinguishing feature of DataSynthesizer is its usability-the data owner does not have to specify any parameters to start generating and sharing data safely and effectively. DataSynthesizer consists of three high-level modules-DataDe-scriber, DataGenerator and Modellnspector. The first, DataDescriber, investigates the data types, correlations and distributions of the attributes in the private dataset, and produces a data summary, adding noise to the distributions to preserve privacy. DataGenerator samples from the summary computed by DataDescriber and outputs synthetic data. Modellnspector shows an intuitive description of the data summary that was computed by DataDescriber, allowing the data owner to evaluate the accuracy of the summarization process and adjust any parameters, if desired. We describe DataSynthesizer and illustrate its use in an urban science context, where sharing sensitive, legally encumbered data between agencies and with outside collaborators is reported as the primary obstacle to data-driven governance.

    Original languageEnglish (US)
    Title of host publicationSSDBM 2017
    Subtitle of host publication29th International Conference on Scientific and Statistical Database Management
    PublisherAssociation for Computing Machinery
    VolumePart F128636
    ISBN (Electronic)9781450352826
    DOIs
    StatePublished - Jun 27 2017
    Event29th International Conference on Scientific and Statistical Database Management, SSDBM 2017 - Chicago, United States
    Duration: Jun 27 2017Jun 29 2017

    Other

    Other29th International Conference on Scientific and Statistical Database Management, SSDBM 2017
    CountryUnited States
    CityChicago
    Period6/27/176/29/17

    Keywords

    • Data sharing
    • Differential privacy
    • Synthetic data

    ASJC Scopus subject areas

    • Human-Computer Interaction
    • Computer Networks and Communications
    • Computer Vision and Pattern Recognition
    • Software

    Cite this

    Ping, H., Stoyanovich, J., & Howe, B. (2017). DataSynthesizer: Privacy-preserving synthetic datasets. In SSDBM 2017: 29th International Conference on Scientific and Statistical Database Management (Vol. Part F128636). [a42] Association for Computing Machinery. https://doi.org/10.1145/3085504.3091117

    DataSynthesizer : Privacy-preserving synthetic datasets. / Ping, Haoyue; Stoyanovich, Julia; Howe, Bill.

    SSDBM 2017: 29th International Conference on Scientific and Statistical Database Management. Vol. Part F128636 Association for Computing Machinery, 2017. a42.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Ping, H, Stoyanovich, J & Howe, B 2017, DataSynthesizer: Privacy-preserving synthetic datasets. in SSDBM 2017: 29th International Conference on Scientific and Statistical Database Management. vol. Part F128636, a42, Association for Computing Machinery, 29th International Conference on Scientific and Statistical Database Management, SSDBM 2017, Chicago, United States, 6/27/17. https://doi.org/10.1145/3085504.3091117
    Ping H, Stoyanovich J, Howe B. DataSynthesizer: Privacy-preserving synthetic datasets. In SSDBM 2017: 29th International Conference on Scientific and Statistical Database Management. Vol. Part F128636. Association for Computing Machinery. 2017. a42 https://doi.org/10.1145/3085504.3091117
    Ping, Haoyue ; Stoyanovich, Julia ; Howe, Bill. / DataSynthesizer : Privacy-preserving synthetic datasets. SSDBM 2017: 29th International Conference on Scientific and Statistical Database Management. Vol. Part F128636 Association for Computing Machinery, 2017.
    @inproceedings{119a857d08ee43ccb57b6cb7e23a5c8a,
    title = "DataSynthesizer: Privacy-preserving synthetic datasets",
    abstract = "To facilitate collaboration over sensitive data, we present DataSynthesizer, a tool that takes a sensitive dataset as input and generates a structurally and statistically similar synthetic dataset with strong privacy guarantees. The data owners need not release their data, while potential collaborators can begin developing models and methods with some confidence that their results will work similarly on the real dataset. The distinguishing feature of DataSynthesizer is its usability-the data owner does not have to specify any parameters to start generating and sharing data safely and effectively. DataSynthesizer consists of three high-level modules-DataDe-scriber, DataGenerator and Modellnspector. The first, DataDescriber, investigates the data types, correlations and distributions of the attributes in the private dataset, and produces a data summary, adding noise to the distributions to preserve privacy. DataGenerator samples from the summary computed by DataDescriber and outputs synthetic data. Modellnspector shows an intuitive description of the data summary that was computed by DataDescriber, allowing the data owner to evaluate the accuracy of the summarization process and adjust any parameters, if desired. We describe DataSynthesizer and illustrate its use in an urban science context, where sharing sensitive, legally encumbered data between agencies and with outside collaborators is reported as the primary obstacle to data-driven governance.",
    keywords = "Data sharing, Differential privacy, Synthetic data",
    author = "Haoyue Ping and Julia Stoyanovich and Bill Howe",
    year = "2017",
    month = "6",
    day = "27",
    doi = "10.1145/3085504.3091117",
    language = "English (US)",
    volume = "Part F128636",
    booktitle = "SSDBM 2017",
    publisher = "Association for Computing Machinery",

    }

    TY - GEN

    T1 - DataSynthesizer

    T2 - Privacy-preserving synthetic datasets

    AU - Ping, Haoyue

    AU - Stoyanovich, Julia

    AU - Howe, Bill

    PY - 2017/6/27

    Y1 - 2017/6/27

    N2 - To facilitate collaboration over sensitive data, we present DataSynthesizer, a tool that takes a sensitive dataset as input and generates a structurally and statistically similar synthetic dataset with strong privacy guarantees. The data owners need not release their data, while potential collaborators can begin developing models and methods with some confidence that their results will work similarly on the real dataset. The distinguishing feature of DataSynthesizer is its usability-the data owner does not have to specify any parameters to start generating and sharing data safely and effectively. DataSynthesizer consists of three high-level modules-DataDe-scriber, DataGenerator and Modellnspector. The first, DataDescriber, investigates the data types, correlations and distributions of the attributes in the private dataset, and produces a data summary, adding noise to the distributions to preserve privacy. DataGenerator samples from the summary computed by DataDescriber and outputs synthetic data. Modellnspector shows an intuitive description of the data summary that was computed by DataDescriber, allowing the data owner to evaluate the accuracy of the summarization process and adjust any parameters, if desired. We describe DataSynthesizer and illustrate its use in an urban science context, where sharing sensitive, legally encumbered data between agencies and with outside collaborators is reported as the primary obstacle to data-driven governance.

    AB - To facilitate collaboration over sensitive data, we present DataSynthesizer, a tool that takes a sensitive dataset as input and generates a structurally and statistically similar synthetic dataset with strong privacy guarantees. The data owners need not release their data, while potential collaborators can begin developing models and methods with some confidence that their results will work similarly on the real dataset. The distinguishing feature of DataSynthesizer is its usability-the data owner does not have to specify any parameters to start generating and sharing data safely and effectively. DataSynthesizer consists of three high-level modules-DataDe-scriber, DataGenerator and Modellnspector. The first, DataDescriber, investigates the data types, correlations and distributions of the attributes in the private dataset, and produces a data summary, adding noise to the distributions to preserve privacy. DataGenerator samples from the summary computed by DataDescriber and outputs synthetic data. Modellnspector shows an intuitive description of the data summary that was computed by DataDescriber, allowing the data owner to evaluate the accuracy of the summarization process and adjust any parameters, if desired. We describe DataSynthesizer and illustrate its use in an urban science context, where sharing sensitive, legally encumbered data between agencies and with outside collaborators is reported as the primary obstacle to data-driven governance.

    KW - Data sharing

    KW - Differential privacy

    KW - Synthetic data

    UR - http://www.scopus.com/inward/record.url?scp=85025678631&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=85025678631&partnerID=8YFLogxK

    U2 - 10.1145/3085504.3091117

    DO - 10.1145/3085504.3091117

    M3 - Conference contribution

    AN - SCOPUS:85025678631

    VL - Part F128636

    BT - SSDBM 2017

    PB - Association for Computing Machinery

    ER -