Apache Druid Explained

Apache Druid
Apache Druid[1]
Author:Metamarkets
Developer:Apache Software Foundation
Programming Language:Java
Operating System:Cross-platform
License:Apache License 2.0

Druid is a column-oriented, open-source, distributed data store written in Java. Druid is designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data.[2] The name Druid comes from the shapeshifting Druid class in many role-playing games, to reflect that the architecture of the system can shift to solve different types of data problems.

Druid is commonly used in business intelligence-OLAP applications to analyze high volumes of real-time and historical data.[3] Druid is used in production by technology companies such as Alibaba, Airbnb, Cisco,[4] eBay,[5] Lyft, Netflix,[6] PayPal, Pinterest, Reddit,[7] Twitter,[8] Walmart,[9] Wikimedia Foundation[10] and Yahoo.[11]

History

Druid was started in 2011 by Eric Tschetter, Fangjin Yang, Gian Merlino and Vadim Ogievetsky[12] to power the analytics product of Metamarkets. The project was open-sourced under the GPL license in October 2012,[13] [14] [15] and moved to an Apache License in February 2015.[16] [17]

Architecture

Fully deployed, Druid runs as a cluster of specialized processes (called nodes in Druid) to support a fault-tolerant architecture[18] where data is stored redundantly, and there is no single point of failure.[19] The cluster includes external dependencies for coordination (Apache ZooKeeper), metadata storage (e.g. MySQL, PostgreSQL, or Derby), and a deep storage facility (e.g. HDFS, or Amazon S3) for permanent data backup.

Query management

Client queries first hit broker nodes, which forward them to the appropriate data nodes (either historical or real-time). Since Druid segments may be partitioned, an incoming query can require data from multiple segments and partitions (or shards) stored on different nodes in the cluster. Brokers are able to learn which nodes have the required data, and also merge partial results before returning the aggregated result.

Cluster management

Operations relating to data management in historical nodes are overseen by coordinator nodes. Apache ZooKeeper is used to register all nodes, manage certain aspects of internode communications, and provide for leader elections.

Features

Performance

In 2019, researchers compared the performance of Hive, Presto, and Druid using a denormalized Star Schema Benchmark based on the TPC-H standard. Druid was tested using both a “Druid Best” configuration using tables with hashed partitions and a “Druid Suboptimal” configuration which does not use hashed partitions.[20]

Tests were conducted by running the 13 TPC-H queries using TPC-H Scale Factor 30 (a 30GB database), Scale Factor 100 (a 100GB database), and Scale Factor 300 (a 300GB database).

!Scale Factor!Hive!Presto!Druid Best!Druid Suboptimal
30256s33s2.09s3.21s
100424s90s6.12s8.08s
300982s452s7.60s20.02s
Druid performance was measured as at least 98% faster than Hive and at least 90% faster than Presto in each scenario, even when using the Druid Suboptimized configuration.

See also

Notes and References

  1. Web site: Apache Druid at GitHub. github.com. 4 May 2021.
  2. Hemsoth, Nicole. Web site: "Druid Summons Strength in Real-Time". 2014-02-07. 2013-02-27. https://web.archive.org/web/20130227173609/http://www.datanami.com/datanami/2012-11-08/druid_summons_strength_in_real-time.html. dead., Datanami, 8 November 2012
  3. Web site: Druid Powered by Druid. druid. druid.apache.org. 2016-06-29.
  4. Web site: Under the hood of Cisco's Tetration Analytics platform. Butler. Brandon. 20 June 2016. 2016-06-23. 2024-04-26. https://web.archive.org/web/20240426190409/https://www.networkworld.com/article/952579/under-the-hood-of-cisco-s-tetration-analytics-platform.html. live.
  5. Web site: Druid at Pulsar - ebay的专栏 - 博客频道 - CSDN.NET. blog.csdn.net. 2016-06-23.
  6. Web site: The Netflix Tech Blog: Announcing Suro: Backbone of Netflix's Data Pipeline. techblog.netflix.com. 2016-06-23.
  7. Web site: Scaling Reporting at Reddit - Upvoted . 2022-09-13 . www.redditinc.com . 26 February 2021 . en-US.
  8. Web site: Interactive Analytics at MoPub: Querying Terabytes of Data in Seconds. blog.twitter.com. en-us. 2020-01-29.
  9. Web site: Event Stream Analytics at Walmart with Druid. Nayak. Amaresh. 2018-02-23. Medium. en. 2020-01-29.
  10. Web site: Conferences - O'Reilly Media.
  11. Web site: Complementing Hadoop at Yahoo: Interactive Analytics with Druid. 2016-06-23.
  12. Web site: Druid: A Real-time Analytical Data Store.
  13. Tschetter, Eric. Web site: "Introducing Druid". 2019-06-12. 2022-02-08. https://web.archive.org/web/20220208191443/https://druid.apache.org/blog/2012/10/24/introducing-druid.html. dead., druid.apache.org, 24 October 2012
  14. Higginbotham, Stacey. Web site: "Metamarkets open sources Druid, its in-memory database". 2014-02-07. 2021-09-18. https://web.archive.org/web/20210918034842/http://gigaom.com/2012/10/24/metamarkets-open-sources-druid-its-in-memory-database/. dead., GigaOM, 24 October 2012
  15. Web site: 2012-10-24 . Metamarkets Open Sources Druid, Streaming Real-Time Data Store . 2023-07-24 . Yahoo News . en-US.
  16. Web site: The Druid real-time database moves to an Apache license. 2015-02-20. 2015-08-04. Derrick. Harris. 2015-08-22. https://web.archive.org/web/20150822044850/https://gigaom.com/2015/02/20/the-druid-real-time-database-moves-to-an-apache-license/. dead.
  17. Web site: Druid Gets Open Source-ier Under the Apache License. 2015-08-04.
  18. Web site: Druid Project Documentation.
  19. Yang, Fangjin; Tschetter, Eric; Léauté, Xavier; Ray, Nelson; Merlino, Gian; Ganguli, Deep. Web site: "Druid: A Real-time Analytical Data Store" ., Metamarkets, retrieved 6 February 2014
  20. Book: Correia. José. Costa. Carlos. Santos. Maribel Yasmina. Challenging SQL-on-Hadoop Performance with Apache Druid . 2019. Abramowicz. Witold. Corchuelo. Rafael. Business Information Systems. https://link.springer.com/chapter/10.1007/978-3-030-20485-3_12. Lecture Notes in Business Information Processing. 353 . en. Cham. Springer International Publishing. 149–161. 10.1007/978-3-030-20485-3_12. 978-3-030-20485-3. 1822/66785. 190005302 . free.