Online aggregation is a technique for improving the interactive behavior of database systems processing expensive analytical queries. Almost all database operations are performed in batch mode, i.e. the user issues a query and waits till the database has finished processing the entire query. On the contrary, using online aggregation, the user gets estimates of an aggregate query in an online fashion as soon as the query is issued. For example, if the final answer is 1000, after k seconds, the user gets the estimates in form of a confidence interval like [990, 1020] with 95% probability. This confidence keeps on shrinking as the system gets more and more samples.
Online aggregation was proposed in 1997 by Hellerstein, Haas and Wang[1] for group-by aggregation queries over a single table. Later, the authors showed how to evaluate joins in an online fashion.[2] In 2007, Jermaine et al. designed and implemented a prototype database system called Database-Online (or DBO) that computes group-by aggregate query over multiple tables in an online and more importantly in a scalable fashion.[3] All the approaches for online aggregation use random sampling, which is non-trivial in a distributed environment due to inspection paradox of renewal reward theory. In 2011, Pansare et al. proposed a Bayesian model to deal with the inspection paradox and implemented online aggregation for a MapReduce-like environment.[4]