StatiX: Making XML count

Juliana Freire, Jayant R. Haritsa, Maya Ramanath, Prasan Roy, Jérôme Siméon

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The availability of summary data for XML documents has many applications, from providing users with quick feedback about their queries, to cost-based storage design and query optimization. StatiX is a novel XML Schema-aware statistics framework that exploits the structure derived by regular expressions (which define elements in an XML Schema) to pinpoint places in the schema that are likely sources of structural skew. As we discuss below, this information can be used to build concise, yet accurate, statistical summaries for XML data. StatiX leverages standard XML technology for gathering statistics, notably XML Schema validators, and it uses histograms to summarize both the structure and values in an XML document. In this paper we describe the StatiX system. We develop algorithms that decompose schemas to obtain statistics at different granularities and discuss how statistics can be gathered as documents are validated. We also present an experimental evaluation which demonstrates the accuracy and scalability of our approach and show an application of these statistics to cost-based XML storage design.

Original languageEnglish (US)
Title of host publicationProceedings of the ACM SIGMOD International Conference on Management of Data
EditorsM.F.B. Moon, A. Ailamaki
Pages181-191
Number of pages11
StatePublished - 2002
EventACM SIGMOD 2002 Proceedings of the ACM SIGMOD International Conference on Managment of Data - Madison, WI, United States
Duration: Jun 3 2002Jun 6 2002

Other

OtherACM SIGMOD 2002 Proceedings of the ACM SIGMOD International Conference on Managment of Data
CountryUnited States
CityMadison, WI
Period6/3/026/6/02

Fingerprint

XML
Statistics
Scalability
Costs
Availability
Feedback

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Freire, J., Haritsa, J. R., Ramanath, M., Roy, P., & Siméon, J. (2002). StatiX: Making XML count. In M. F. B. Moon, & A. Ailamaki (Eds.), Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 181-191)

StatiX : Making XML count. / Freire, Juliana; Haritsa, Jayant R.; Ramanath, Maya; Roy, Prasan; Siméon, Jérôme.

Proceedings of the ACM SIGMOD International Conference on Management of Data. ed. / M.F.B. Moon; A. Ailamaki. 2002. p. 181-191.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Freire, J, Haritsa, JR, Ramanath, M, Roy, P & Siméon, J 2002, StatiX: Making XML count. in MFB Moon & A Ailamaki (eds), Proceedings of the ACM SIGMOD International Conference on Management of Data. pp. 181-191, ACM SIGMOD 2002 Proceedings of the ACM SIGMOD International Conference on Managment of Data, Madison, WI, United States, 6/3/02.
Freire J, Haritsa JR, Ramanath M, Roy P, Siméon J. StatiX: Making XML count. In Moon MFB, Ailamaki A, editors, Proceedings of the ACM SIGMOD International Conference on Management of Data. 2002. p. 181-191
Freire, Juliana ; Haritsa, Jayant R. ; Ramanath, Maya ; Roy, Prasan ; Siméon, Jérôme. / StatiX : Making XML count. Proceedings of the ACM SIGMOD International Conference on Management of Data. editor / M.F.B. Moon ; A. Ailamaki. 2002. pp. 181-191
@inproceedings{eb587d8512844b64af63706e3b1af62e,
title = "StatiX: Making XML count",
abstract = "The availability of summary data for XML documents has many applications, from providing users with quick feedback about their queries, to cost-based storage design and query optimization. StatiX is a novel XML Schema-aware statistics framework that exploits the structure derived by regular expressions (which define elements in an XML Schema) to pinpoint places in the schema that are likely sources of structural skew. As we discuss below, this information can be used to build concise, yet accurate, statistical summaries for XML data. StatiX leverages standard XML technology for gathering statistics, notably XML Schema validators, and it uses histograms to summarize both the structure and values in an XML document. In this paper we describe the StatiX system. We develop algorithms that decompose schemas to obtain statistics at different granularities and discuss how statistics can be gathered as documents are validated. We also present an experimental evaluation which demonstrates the accuracy and scalability of our approach and show an application of these statistics to cost-based XML storage design.",
author = "Juliana Freire and Haritsa, {Jayant R.} and Maya Ramanath and Prasan Roy and J{\'e}r{\^o}me Sim{\'e}on",
year = "2002",
language = "English (US)",
pages = "181--191",
editor = "M.F.B. Moon and A. Ailamaki",
booktitle = "Proceedings of the ACM SIGMOD International Conference on Management of Data",

}

TY - GEN

T1 - StatiX

T2 - Making XML count

AU - Freire, Juliana

AU - Haritsa, Jayant R.

AU - Ramanath, Maya

AU - Roy, Prasan

AU - Siméon, Jérôme

PY - 2002

Y1 - 2002

N2 - The availability of summary data for XML documents has many applications, from providing users with quick feedback about their queries, to cost-based storage design and query optimization. StatiX is a novel XML Schema-aware statistics framework that exploits the structure derived by regular expressions (which define elements in an XML Schema) to pinpoint places in the schema that are likely sources of structural skew. As we discuss below, this information can be used to build concise, yet accurate, statistical summaries for XML data. StatiX leverages standard XML technology for gathering statistics, notably XML Schema validators, and it uses histograms to summarize both the structure and values in an XML document. In this paper we describe the StatiX system. We develop algorithms that decompose schemas to obtain statistics at different granularities and discuss how statistics can be gathered as documents are validated. We also present an experimental evaluation which demonstrates the accuracy and scalability of our approach and show an application of these statistics to cost-based XML storage design.

AB - The availability of summary data for XML documents has many applications, from providing users with quick feedback about their queries, to cost-based storage design and query optimization. StatiX is a novel XML Schema-aware statistics framework that exploits the structure derived by regular expressions (which define elements in an XML Schema) to pinpoint places in the schema that are likely sources of structural skew. As we discuss below, this information can be used to build concise, yet accurate, statistical summaries for XML data. StatiX leverages standard XML technology for gathering statistics, notably XML Schema validators, and it uses histograms to summarize both the structure and values in an XML document. In this paper we describe the StatiX system. We develop algorithms that decompose schemas to obtain statistics at different granularities and discuss how statistics can be gathered as documents are validated. We also present an experimental evaluation which demonstrates the accuracy and scalability of our approach and show an application of these statistics to cost-based XML storage design.

UR - http://www.scopus.com/inward/record.url?scp=0036373389&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0036373389&partnerID=8YFLogxK

M3 - Conference contribution

SP - 181

EP - 191

BT - Proceedings of the ACM SIGMOD International Conference on Management of Data

A2 - Moon, M.F.B.

A2 - Ailamaki, A.

ER -