Public Repository

Last pushed: 2 years ago
Short Description
A lambda architecture based on kafka flume storm/trident hadoop and druid
Full Description

A lambda architecture that is all based on open technologies, including kafka, flume, storm/trident, hadoop, druid and sql4d.

The implementation of this particular lambda architecture is an attempt to avoid the issues of what Jay Kreps of Kafka described as "coding your application twice" and "the operational burden of running and debugging two systems", while at the same time to enable the system at its entirety to provide the promises of real-time, distributed, linearly scalable, big-data, and other features that are associated with the concept of lambda architecture, which is to minimize the constraints imposed by the so-called CAP theorem.

To achieve those goals, with this implementation, kafka and druid play a central role.

First of all, kafka serves as a data sink, where it maintains several topic queues, one for real-time ETL storm/trident, one for druid ingestion, and one, or optionally several, through the flume channels, for the hadoop storage, from which druid will then ingest from time to time to index all that are stored in hadoop. Druid will also ingest the real-time stream from kafka, and merges the data, very efficiently with its indexing, with those in the hadoop storage. Finally, the sql4d uses sql-like queries to retrieve the merged results from druid's indexing.

These all are handled supposedly in a very prompt manner, in that kafka is generally claimed to be able to handle millions of records per second, and druid is to handle billions per second. In addition, the ETL performed by storm/trident are also very responsive. This real-time streaming of kafka-to-storm/trident-back-to-kafka-then-to-druid makes a very responsive pipeline. Benchmarks, however, are needed to support this overall performance/latency claims of the implementation.

Basically the implementation is like the framework, such that there are, as it were, three layers: hot, warm and cold. With the lambda architecture, the kafka, storm and druid pipeline constitutes the hot streaming for real-time, the kafka queues are the warm layer, that can retain in-memory or at local persistence for as long as weeks or months, and the hadoop storage is made up of the cold layer that contains all the archived and historical data. It is druid that efficiently indexes and merges all of them and provides a very responsive access at query times.

Docker Pull Command