What is Apache Druid? A Complete Guide

10:35

Introduction to Apache Druid

It is a real-time analytics database designed for rapid analytics on large datasets. This database is often used for powering cases where real-time ingestion, high uptime, and fast query performance are needed. Druid can be used to analyze billions of rows not only in batch but also in real-time.

It offers many integrations with technologies like Apache Kafka Security, Cloud Storage, S3, Hive, HDFS, DataSketches, Redis, etc. It also follows the immutable past and append-only future. Past events happen once and never change, but these are immutable, whereas the only append takes place for new events. It provides users with a fast and deep exploration of large-scale transaction data.

A cloud computing managed service offering model that enables users to set up, operate, manage and scale with some form of access. Click to explore about our, Internal Developer Platform

Characteristics of Apache Druid

Some of the exciting characteristics are:

Cloud-Native, making easy horizontal scaling
Supports SQL for analyzing data
REST API enabled for querying or uploading data

What are its use cases?

Some of the everyday use cases of Druid are:

Clickstream analytics
Server metrics storage
OLAP/Business intelligence
Digital Marketing/Advertising analytics
Network Telemetry analytics
Supply chain analytics
Application performance metrics

What are its key features?

Druid’s core architecture combines the ideas of different data warehouses, log search systems, and time-series databases.

Columnar Storage Format

It uses column-oriented storage, so it only loads the columns needed for a particular query. This helps in fast scans and aggregations.

Parallel Processing

It can process a query in parallel across the entire cluster, which is also known as Massively Parallel Processing.

Scalable Distributed System

Druid is mainly deployed in clusters ranging from tens to hundreds that offer ingest rates of millions of records/sec, query latencies of sub-second to a few seconds, and retention of trillions of records.

Real-time or Batch Ingestion

Druid can ingest data either in real-time (Ingested data can be queried immediately) or in batches.

Cloud-Native

It is a fault-tolerant architecture that won’t lose data. Once Druid ingests data, its copy is safely stored in deep storage (Cloud Storage, Amazon S3, Redis, HDFS, many more). Users' data can be easily recovered from this deep storage even if all the Druid’s servers fail. This replication ensures that queries are still possible while the system recovers.

Indexing

Druid uses concise and roaring compressed bitmap indexes to create indexes that help in faster filtering.

Timestamp Partitioning

Every data in Druid must have a timestamp column, as it is always partitioned by time, and every query has a time filter.

Easy Integration with Existing Pipelines

Users can easily stream data natively using Druid from message buses like Kafka, kinesis, and many more. It can also load batch files from the data lakes like HDFS and Amazon S3.

General Architecture of Apache Druid

Druid is mainly composed of the following processes:

Coordinator – This process manages data availability on the cluster.
Overlord – This process controls the assignment of data ingestion workloads.
Broker – This helps handle queries from external clients.
Historical – This process stores data that is queryable.
Middle manager – This process is responsible for ingesting the data.

Router—These processes route requests to brokers, Coordinators, and Overlords.

Explore Apache Druid Managed Services

The processes described above are organized into 3 types of servers: Master, Query, and Data.

Master

It runs the Coordinator and Overlord and manages significant data ingestion and availability. Master is responsible for ingesting jobs and coordinating data availability on the “Data Servers.”

Query

It runs Brokers and Optional Router processes. It handles queries and external clients by providing the endpoints of applications that users and clients interact with, routing queries to Data servers or other Query servers.

Data

It runs Middle Managers and Historical processes, which help execute jobs and store queryable data. In addition to these three servers and six processes, Druid also requires storage for Metadata and Deep Storage.

Metadata Storage

It stores the system's metadata (Audit, Datasource, Schemas, and so on). For experimental purposes, the environment suggested using Apache Derby. Derby is the default metadata store for Druid, but it is unsuitable for production. For production purposes, MySQL or PostgreSQL is the best choice.

Metadata storage stores all the metadata, which is very useful for the cluster of Druid to work. Derby is not used for production as it does not support a multi-node cluster with high availability. MySQL, as a metadata storage database, is used to acquire:

Long term flexibility
Scaling on budget
Good with large datasets
Good high read speed

PostgreSQL, as a metadata storage database, is used to acquire:

Complex database designs
Performing customized procedures
Diverse indexing technique
Variety of replication methods
High read and write speed.

Deep Storage

Apache Druid uses separate storage for any data ingested, which makes it fault-tolerant. Some Deep Storage Technologies are Cloud Storage, Amazon S3, HDFS, Redis, and many more.

A structure that defines the logical view of the entire defines how the data is managed and how the relations among them are associated. Click to explore about our, Types of Databases

Data Ingestion in Druid

Data in Druid is organized into segments with rows up to a few million. Loading data in Druid is known as Ingestion or Indexing. Druid fully supports batch and streaming ingestion.

Some of the technologies supported by Druid are Kinesis, Cloud Storage, Apache Kafka, and local storage. Druid requires some structure to the data it ingests. Data should generally consist of OS timestamps, metrics, and dimensions.

Zookeeper for Apache Druid

Ittransf uses Apache Zookeeper to integrate all the solutions. Users can use Zookeeper, which comes with Druid, for experiments, but one has to install Zookeeper for production. Its cluster can only be as stable as a Zookeeper. Zookeeper is responsible for most communications that keep the Druid cluster functioning, as Druid nodes are prevented from talking to each other. apache-druid-zookeeper

Duties of a Zookeeper

Zookeeper is responsible for the following operations:

Segment “publishing” protocol from Historical
Coordinator leader election
Overlord and MiddleManager task management
Segment load/drop protocol between Coordinator and Historical
Overlord leader election

How to Keep a Zookeeper Stable?

For maximum Zookeeper stability, the user has to follow the following practices:

A Zookeeper should be dedicated to Druid; avoid sharing it with any other products/applications.
Maintain an odd number of Zookeepers for increased reliability.
For highly available Zookeeper, 3-5 Zookeeper nodes are recommended. Users can install Zookeeper on their system or run 3 or 5 master servers and configure Zookeeper on them appropriately.
Share the Zookeeper’s location with a master server rather than doing so with data or query servers. This is done because query and data are far more work-intensive than the master node (coordinator and overlord).
To achieve high availability, it is recommended never to put the Zookeeper behind a load balancer.

If Zookeeper goes down, the cluster will operate. Failing Zookeeper would neither result in addition to new data segments nor can it effectively react to the loss of one of the nodes. So, the failure of Zookeeper is a degraded state.

The IDP is an excellent solution for meeting the ever-increasing demand for faster development and release cycles with total automation.Click to explore about our, Database-as-a-Service

How to monitor Apache Druid?

Users can monitor Druid using the metrics it generates. Druid generates metrics related to queries, coordination, and ingestion. These metrics are emitted as a JSON object. They are either emitted to a runtime log file or over HTTP (to a service like Kafka). The emission of a metric is disabled by default.

Fields of Metrics Emitted

Metrics emitted by Druid share a standard set of fields.

Timestamp – the time at which the metric was created
Metric – the name given to the metric
Service – the name of the service that emitted the metric
Host – the name of the host that emitted the metric
Value – the numeric value that is associated with the metric emitted

Briefing About Available Metrics

The metric emitted may have dimensions beyond the one listed. To change the emission period of Druid, which is 1 minute by default, one can use `druid.monitoring. Emission period` to change the default value. Metrics available are:

Query Metrics, mainly categorized as Broker, Historical, Real-time, Jetty and Cache
SQL Metrics (Only if SQL is enabled)
Ingestion Metrics (Kafka Indexing Service)
Real-time Metrics (Real-time process, available if Real-time Metrics Monitor is included)
Indexing Service
Coordination
JVM (Available if JVM Monitor module is included)
Event Receiver Firehose (available if Event Receiver Firehose Monitor module is included)
Sys (Available if Sys Monitor module is included)
General Health, mainly Historical

monitoring-solution

A process that describes task description, time requirements, Deliverables, and pitfalls. Download to explore the potential of Data Warehouse.

Apache Druid is the best in the market for analyzing data in clusters and providing brief insight into all the data processed. Plus, having a Zookeeper by the side eases one's work with it and allows one to rule the DataOps market. Also, there are many libraries to interact with it.

To Validate the service's running, one can use JPS commands. As Druid nodes are Java processes, they would show up when the JPS command '$ jps—m' is used. With that much ease in monitoring Druid and working with such a vast architecture of Druid, it is the last bite of ice cream for a DataOps Engineer.

Get insight on Platform as a Service (PaaS) in Cloud Computing

Discover more about CaaS Architecture and Management Solutions

Next Steps with Apache Druid

Talk to our experts about implementing compound AI system, How Industries and different departments use Agentic Workflows and Decision Intelligence to Become Decision Centric. Utilizes AI to automate and optimize IT support and operations, improving efficiency and responsiveness.

Interested in Solving your Challenges with XenonStack Team

Get Started

Interested in Solving your Challenges with XenonStack

Personalization

What is your Key focus areas? *

In Which Agentic Platform and Accelerator you are Interested? *

Which segment does your company belong to? *

At what stage is your AI use case currently in? *

What are the primary challenges in adopting AI? *

What kind of infrastructure does your organization currently using? *

Are you using any Data platform? *

Preferred Approach for AI Transformation *

In Which Domain your Solution/Organization belongs to in-terms of Data Privacy, Trustworthy AI *

Captcha Verification *

your request has been submitted successfully !

What is Apache Druid? A Complete Guide

Introduction to Apache Druid

Characteristics of Apache Druid

What are its use cases?

What are its key features?

Columnar Storage Format

Parallel Processing

Scalable Distributed System

Real-time or Batch Ingestion

Cloud-Native

Indexing

Timestamp Partitioning

Easy Integration with Existing Pipelines

General Architecture of Apache Druid

Explore Apache Druid Managed Services

Master

Query

Data

Metadata Storage

Deep Storage

Data Ingestion in Druid

Zookeeper for Apache Druid

Duties of a Zookeeper

How to Keep a Zookeeper Stable?

How to monitor Apache Druid?

Fields of Metrics Emitted

Briefing About Available Metrics

Next Steps with Apache Druid

More Ways to Explore Us

Apache Spark Architecture and Use Cases

Unified Data Ingestion Solution -Apache Gobblin

Apache ZooKeeper Security and its Architecture

Share Article

Table of Contents

Share Article

Explore Related Topics

Subscribe to our Latest Technology Insights and Resources

Get the latest articles in your inbox

Related Articles

Data Integration Tools and its Benefits

Integrating Real Time Analytics with Continuous Intelligence

IoT Testing Tools, Challenges and Its Types

From Fragmented PoCs to Production-Ready AI

Building Organizational Readiness

Business Case Discovery - PoC & Pilot

Responsible AI Enablement Program

Dr. Jagreet Kaur

Navdeep Singh Gill