Blink and it's done: Interactive queries on very large data

Sameer Agarwal, Aurojit Panda, Barzan Mozafari, Anand P. Iyer, Samuel Madden, Ion Stoica

Research output: Contribution to journalArticle

Abstract

In this demonstration, we present BlinkDB, a massively parallel, sampling-based approximate query processing framework for running interactive queries on large volumes of data. The key observation in BlinkDB is that one can make reasonable decisions in the absence of perfect answers. BlinkDB extends the Hive/HDFS stack and can handle the same set of SPJA (selection, projection, join and aggregate) queries as supported by these systems. BlinkDB provides real-time answers along with statistical error guarantees, and can scale to petabytes of data and thousands of machines in a fault-tolerant manner. Our experiments using the TPC-H benchmark and on an anonymized real-world video content distribution workload from Conviva Inc. show that BlinkDB can execute a wide range of queries up to 150× faster than Hive on MapReduce and 10-150× faster than Shark (Hive on Spark) over tens of terabytes of data stored across 100 machines, all with an error of 2 - 10%.

Original languageEnglish (US)
Pages (from-to)1902-1905
Number of pages4
JournalProceedings of the VLDB Endowment
Volume5
Issue number12
DOIs
StatePublished - Jan 1 2012

Fingerprint

Query processing
Electric sparks
Demonstrations
Sampling
Experiments

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Science(all)

Cite this

Agarwal, S., Panda, A., Mozafari, B., Iyer, A. P., Madden, S., & Stoica, I. (2012). Blink and it's done: Interactive queries on very large data. Proceedings of the VLDB Endowment, 5(12), 1902-1905. https://doi.org/10.14778/2367502.2367533

Blink and it's done : Interactive queries on very large data. / Agarwal, Sameer; Panda, Aurojit; Mozafari, Barzan; Iyer, Anand P.; Madden, Samuel; Stoica, Ion.

In: Proceedings of the VLDB Endowment, Vol. 5, No. 12, 01.01.2012, p. 1902-1905.

Research output: Contribution to journalArticle

Agarwal, S, Panda, A, Mozafari, B, Iyer, AP, Madden, S & Stoica, I 2012, 'Blink and it's done: Interactive queries on very large data', Proceedings of the VLDB Endowment, vol. 5, no. 12, pp. 1902-1905. https://doi.org/10.14778/2367502.2367533
Agarwal S, Panda A, Mozafari B, Iyer AP, Madden S, Stoica I. Blink and it's done: Interactive queries on very large data. Proceedings of the VLDB Endowment. 2012 Jan 1;5(12):1902-1905. https://doi.org/10.14778/2367502.2367533
Agarwal, Sameer ; Panda, Aurojit ; Mozafari, Barzan ; Iyer, Anand P. ; Madden, Samuel ; Stoica, Ion. / Blink and it's done : Interactive queries on very large data. In: Proceedings of the VLDB Endowment. 2012 ; Vol. 5, No. 12. pp. 1902-1905.
@article{2fa58090fde64bbb849d935ab64d867e,
title = "Blink and it's done: Interactive queries on very large data",
abstract = "In this demonstration, we present BlinkDB, a massively parallel, sampling-based approximate query processing framework for running interactive queries on large volumes of data. The key observation in BlinkDB is that one can make reasonable decisions in the absence of perfect answers. BlinkDB extends the Hive/HDFS stack and can handle the same set of SPJA (selection, projection, join and aggregate) queries as supported by these systems. BlinkDB provides real-time answers along with statistical error guarantees, and can scale to petabytes of data and thousands of machines in a fault-tolerant manner. Our experiments using the TPC-H benchmark and on an anonymized real-world video content distribution workload from Conviva Inc. show that BlinkDB can execute a wide range of queries up to 150× faster than Hive on MapReduce and 10-150× faster than Shark (Hive on Spark) over tens of terabytes of data stored across 100 machines, all with an error of 2 - 10{\%}.",
author = "Sameer Agarwal and Aurojit Panda and Barzan Mozafari and Iyer, {Anand P.} and Samuel Madden and Ion Stoica",
year = "2012",
month = "1",
day = "1",
doi = "10.14778/2367502.2367533",
language = "English (US)",
volume = "5",
pages = "1902--1905",
journal = "Proceedings of the VLDB Endowment",
issn = "2150-8097",
publisher = "Very Large Data Base Endowment Inc.",
number = "12",

}

TY - JOUR

T1 - Blink and it's done

T2 - Interactive queries on very large data

AU - Agarwal, Sameer

AU - Panda, Aurojit

AU - Mozafari, Barzan

AU - Iyer, Anand P.

AU - Madden, Samuel

AU - Stoica, Ion

PY - 2012/1/1

Y1 - 2012/1/1

N2 - In this demonstration, we present BlinkDB, a massively parallel, sampling-based approximate query processing framework for running interactive queries on large volumes of data. The key observation in BlinkDB is that one can make reasonable decisions in the absence of perfect answers. BlinkDB extends the Hive/HDFS stack and can handle the same set of SPJA (selection, projection, join and aggregate) queries as supported by these systems. BlinkDB provides real-time answers along with statistical error guarantees, and can scale to petabytes of data and thousands of machines in a fault-tolerant manner. Our experiments using the TPC-H benchmark and on an anonymized real-world video content distribution workload from Conviva Inc. show that BlinkDB can execute a wide range of queries up to 150× faster than Hive on MapReduce and 10-150× faster than Shark (Hive on Spark) over tens of terabytes of data stored across 100 machines, all with an error of 2 - 10%.

AB - In this demonstration, we present BlinkDB, a massively parallel, sampling-based approximate query processing framework for running interactive queries on large volumes of data. The key observation in BlinkDB is that one can make reasonable decisions in the absence of perfect answers. BlinkDB extends the Hive/HDFS stack and can handle the same set of SPJA (selection, projection, join and aggregate) queries as supported by these systems. BlinkDB provides real-time answers along with statistical error guarantees, and can scale to petabytes of data and thousands of machines in a fault-tolerant manner. Our experiments using the TPC-H benchmark and on an anonymized real-world video content distribution workload from Conviva Inc. show that BlinkDB can execute a wide range of queries up to 150× faster than Hive on MapReduce and 10-150× faster than Shark (Hive on Spark) over tens of terabytes of data stored across 100 machines, all with an error of 2 - 10%.

UR - http://www.scopus.com/inward/record.url?scp=84873191849&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84873191849&partnerID=8YFLogxK

U2 - 10.14778/2367502.2367533

DO - 10.14778/2367502.2367533

M3 - Article

AN - SCOPUS:84873191849

VL - 5

SP - 1902

EP - 1905

JO - Proceedings of the VLDB Endowment

JF - Proceedings of the VLDB Endowment

SN - 2150-8097

IS - 12

ER -