Non Apache Component
Non Apache Component
Cold Data Support Yes (cold storage, backups) Not ideal (memory-first)
As Kafka clusters grow over time, so does the complexity of managing their performance. Topics
increase, brokers are added, and partition traffic becomes uneven — all of which leads to
imbalance across the cluster.
That’s where Kafka Cruise Control comes in — an open-source tool by LinkedIn that
automatically monitors, detects, and rebalances workloads across brokers.
When you initially deploy Kafka, partition distribution across brokers seems balanced. But over
time:
`
Some topics handle heavy traffic (e.g. logs, telemetry).
Result?
Latency increases
*hive is a sql engine which convert the sql query into map reduce or routine.trino, impala is a
sql engine.
What Is a Broker?
A broker is a Kafka server that stores and manages message data. It handles reads/writes from
producers and consumers. A Kafka cluster typically consists of multiple brokers working
together.
Cruise Control continuously monitors your Kafka cluster’s resource usage and intelligently
redistributes partition workloads based on actual traffic, not just partition count.
It can detect:
Broker failures
`
Replica skew
And then:
In a CDP Private Cloud Base environment, Data Analytics Studio (DAS) provides a web-based
interface for interacting with Hive. It allows users to write and run HiveQL queries, view
execution plans, explore metadata, and analyze historical query performance.
DAS is particularly useful for teams that work with large-scale Hive workloads and need visibility
into performance and resource usage. It integrates closely with Hive, Tez, HDFS, and other
components in the Hadoop ecosystem.
The DAS system in CDP is built around several components working in layered architecture.
These layers interact as queries are submitted, executed, and monitored.
Users can write Hive queries, explore metadata, track query history, and analyze
diagnostics.
The webapp itself does not execute queries; it delegates that to backend services.
The execution plan is represented as a DAG (Directed Acyclic Graph) which is passed to
Tez for execution.
`
3. Execution Layer - Apache Tez and YARN
Apache Tez takes the DAG from Hive and runs it efficiently using parallel execution.
YARN allocates the necessary containers and resources for the Tez jobs to run.
Tez reads data from and writes results back to HDFS as needed.
The DAS Event Processor captures execution events generated by Hive and Tez.
It also replicates metadata from the Hive Metastore (such as table and column
definitions).
All collected data is processed and stored into a PostgreSQL database used by the DAS
Webapp.
HDFS stores the Hive-managed datasets, including raw data, intermediate files, and
result outputs.
o Query history
o Execution diagnostics
GCS Connector
The GCS Connector (Google Cloud Storage Connector) is a library that allows Hadoop-
compatible tools like Hive, Spark, and Hadoop MapReduce to read from and write to Google
Cloud Storage (GCS) as if it were HDFS.
Key Features
`
Handles file-level consistency and streaming writes
The HBase Indexer (also known as Lily HBase Indexer) is a tool that integrates Apache HBase
with Apache Solr to enable near real-time full-text indexing and search on HBase data.
🔹 Purpose
Enables fast search and filtering on HBase datasets using Solr's search capabilities
🔹 Key Features
Supports custom mapping scripts to control how HBase rows are indexed
Can filter columns, transform values, or create derived fields before indexing
What is Hue?
Hue (Hadoop User Experience) is a web-based SQL editor and data browser included in
Cloudera Data Platform (CDP). It provides an intuitive UI for analysts and engineers to interact
with various data services like Hive, Impala, HDFS, and Spark SQL—without needing to use the
CLI.
CDP Search is a distributed search engine built on Apache Solr, integrated with the rest of the
CDP ecosystem (like HDFS, HBase, and YARN). It enables fast querying, filtering, and full-text
search across large-scale datasets.
`
What is a Schema Registry?
A Schema Registry is a service that stores and manages data structure definitions (schemas) so
different systems can share data safely and understand its format.
...agree on the structure of the data — such as what fields are present, their types (string,
number, boolean), and the format.
When you're sending data from one system to another (e.g., via Kafka), you need to make
sure:
Streams Messaging Manager (SMM) is a tool provided by Cloudera for monitoring, managing,
and debugging Apache Kafka clusters in real time.
Think of SMM as a control panel or dashboard for everything happening inside Kafka — topics,
messages, producers, consumers, brokers — all visualized in one place.
`
Kafka is powerful, but by default, it lacks a visual interface. That means:
You can’t track consumer lag (how much data they haven’t consumed yet)
Streams Replication Manager (SRM) is a Cloudera tool used to replicate Kafka topics and
messages from one Kafka cluster to another.
It allows you to synchronize data across environments — for backup, disaster recovery,
migration, or hybrid cloud use cases.
`
In simple words:
SRM copies Kafka data from one place to another, in real time.
HBase Connectors
Definition:
HBase connectors are integrations that allow HBase to connect with other tools such as
Apache Spark, Hive, Kafka, or MapReduce — so data in HBase can be read/written from those
platforms.
Definition:
The Hive Metastore is a central metadata repository for Hive. It stores information about:
Tables
Columns
Data types
Location of data in HDFS
Other tools like Spark, Impala, Presto also use HMS to read metadata
Hive on Tez
Definition:
Hive on Tez is a faster execution engine for Hive queries using Apache Tez instead of
MapReduce.It optimizes DAG-based query execution for interactive and batch queries.
Definition:
The Hive Warehouse Connector (HWC) allows Apache Spark to read/write data from Hive
tables using the Hive LLAP engine.
Use Case:
If you're running Hive on LLAP (in Cloudera or HDP), this connector helps Spark access data
more efficiently.
`
Spark Atlas Connector
Definition:
The Spark Atlas Connector helps integrate Apache Spark with Apache Atlas, so that metadata
lineage (like data flow from table A → B) can be captured automatically.
A Schema Registry is a system that stores data schemas centrally and ensures that producers
and consumers of data follow compatible formats. It is mostly used with Avro, Protobuf, or
JSON data formats in systems like Kafka or Spark.
Think of it as a "data contract manager" — ensuring data structure stays consistent across
systems
STORAGE IN CDP
`
1. Clients (Top)
These are the applications or users that send data (called producers) or read data (called
consumers).
Clients talk to Pulsar using read/write commands — like "send this message" or "get this
message."
Inside Pulsar, there are brokers. These brokers act like traffic managers.
They handle:
But brokers do not store messages permanently — they just forward them to storage.
`
BookKeeper is the part that actually stores the data.
It has storage servers called bookies. When a message comes in, the Pulsar broker sends it to
several bookies.
This way, even if one bookie fails, the message is still safe on the others.
It stores important information about the system, but not the actual messages.
`
Feature Apache Kafka RabbitMQ Apache Pulsar
used, can store for a
replay anytime.
bit.
Queue-based – one
Publish-Subscribe – Supports both models –
producer, one
one producer, many Pub-Sub (Kafka style)
Message Delivery Model consumer (or round-
consumers (can and Queue (RabbitMQ
robin if multiple
replay data). style).
consumers).
Guarantees order Guarantees order
Delivers in order
within a partition. within partitions just
unless multiple
Ordering Guarantees Consumers need to like Kafka, but with
consumers are pulling
manage parallelism better cursor-based
from the same queue.
manually. reading.
Matches Kafka
Very high – built for Lower compared to
performance, and
big data pipelines Kafka – good for
Performance/Throughput scales well due to
and high-throughput lightweight tasks, not
decoupled storage and
streaming. large streams.
processing.
Highly scalable due to
Limited scalability –
Horizontal scaling segregation of
queues can become a
Scalability via partitions. Needs compute and storage.
bottleneck. Scaling
manual tuning. Topics can scale
often needs sharding.
independently.
Pull-based – Push-based – broker Both pull and push
Consumer Model consumer pulls data pushes data to available – very
from Kafka. consumers. flexible.
Very high – with Configurable – queues Very high – due to
Fault Tolerance & replication, leader can be replicated, but BookKeeper, each
Durability election, and persistence is segment is replicated
durable logs. optional. and durable.
Multi-tenant data
Streaming logs, Background job
platforms, hybrid
analytics, IoT processing, real-time
Use Cases message & event-
telemetry, ETL tasks (emails,
driven systems, cloud-
pipelines. notifications).
native apps.
Not built-in. Needs Cloud-native by design
Minimal cloud
external tooling for – supports geo-
features. Needs
Cloud-Native Features features like geo- replication, tiered
plugins or wrappers
replication or tiered storage, and multi-
for advanced use.
storage. tenancy.
Multi-Tenant Support Weak – needs No built-in multi- Strong – has
multiple clusters or tenant support. namespaces, quotas,
complex namespace isolation policies for
`
Feature Apache Kafka RabbitMQ Apache Pulsar
setups. multi-tenancy.
Complex – Easier to manage – Simpler to operate due
ZooKeeper is good tools like to separation of
Operations &
required, tuning management UI, but concerns. No
Management
partitions is hard at can get messy under ZooKeeper needed in
scale. load. latest versions.
Newer (from Yahoo),
Very mature, large Oldest among the
but growing fast –
Maturity & Community community, wide three, mature for
adopted in cloud-first
enterprise usage. classic queuing tasks.
companies.
Java, Python, Go, Java, Python, .NET, Java, Python, Go, C++,
Language Support
Scala, C++, etc. Ruby, Go, etc. Node.js, etc.
`
As writing to the dataset is relatively cheap and easy in row format, row formatting is
recommended for any case in which you have write-heavy operations. However, read
operations can be very inefficient.
Row-Oriented Storage
Data Layout: Each row is stored together in memory (e.g., ID, Name, Age, City...).
Read Efficiency: Poor for analytical queries – must read entire rows even if only one
column is needed.
Example: To sum age, all rows are loaded, then age values extracted. Costly in large datasets.
Column-Oriented Storage
`
Data Layout: Each column is stored separately (e.g., all IDs together, all Names together).
Example: To sum age, only the age column is loaded from a single disk – fast and efficient.
Data
Database Storage How Data is Stored (Simple Explanation) Architecture Type
Model
Wide- Like HBase, but each cell has a security label. Data
Master-Slave (Master +
Accumulo Column + stored on HDFS, sorted by row + column +
TabletServers)
Security timestamp. Ideal for secure/classified data.
Document Data is stored as JSON documents with unique IDs Multi-Master (Every
CouchDB Store and version history. Schema-free. Great for offline node is writable and
(JSON) syncing. Uses append-only B-Tree for writes. syncs)
`
*Cassandra Data Storage (Short Format)
1. Partition Key
o Decides which node the data goes to
o Helps distribute data evenly across all nodes
2. SSTable
o Data is stored in sorted files called SSTables
o These files never change once written
3. LSM Tree
o Data is first written in memory
o Then saved in batches to disk (as SSTables)
o This makes writes very fast
Document-based (CouchDB)
➤ Data stored like this:
{ "name": "Alice", "age": 25, "address": "Lahore" }
`
Feature HBase (Master-Slave) Cassandra (Masterless)
If HMaster goes down, it can pause unless backup is If any node goes down, others keep
Failure Handling
ready working
Systems where data accuracy & consistency are Systems where uptime & speed matter
Best For
critical most
Example Use
Bank transaction history, audit logs Messaging apps, IoT data, activity streams
Case
`
Feature Apache Geode Apache Ignite Explanation of Term
SQLLine, Control Center, GFSH: Geode CLI. JMX: Java monitoring. SQLLine:
Tooling & GFSH, JMX, Spring
Prometheus, Spring SQL CLI. Prometheus: Metric collector. Control
Monitoring Data Geode
Data Ignite Center: Ignite’s UI.
Language & API Java, C++, .NET, Java, C++, .NET, Python, API: Interface for developers. JDBC/ODBC: SQL
Support REST Node.js, JDBC, ODBC access via tools. REST: HTTP-based interface.
High-speed
HTAP, distributed SQL, HTAP: Hybrid Transaction + Analytical Processing.
Use Cases caching, session
real-time analytics, ML Caching: Fast access to frequently used data.
replication
System of ❌ Not suitable as ✅ Can act as a full System of Record: The authoritative source of data
Record main database persistent database that applications trust.
Steep (OQL + Moderate (SQL + Learning Curve: How easy it is to learn and use the
Learning Curve
custom configs) familiar tools) system for new developers.
`
Key Features of Avro
Feature Description
Storage Type Row-based (stores one full record at a time)
Schema Support Schema is written in JSON, but data is stored in binary
Schema Evolution ✅ Supports changes like adding/removing fields
Compression Compact binary format (smaller than JSON)
Language Support Works with many languages like Python, Java, C++, etc.
🔶 How It Works
1. Define a Schema in JSON (e.g., a user has name and age).
2. Serialize the data using that schema into compact binary format.
3. Deserialize the binary data later using the same (or updated) schema.
🔶 Advantages
✅ Fast and compact
✅ Handles schema changes well
✅ Works across many languages
✅ Great with streaming tools like Kafka
🔶 Limitations
❌ Not human-readable (it's binary)
❌ Not ideal for analytical queries (use Parquet or ORC instead)
❌ Requires schema management
`
🔶 Key Features
Feature Description
Storage Type Columnar → Data is stored by columns, not rows
Compression Supports high compression (like Snappy, Gzip) to reduce file size
Read Speed Very fast, especially when reading only a few columns
Write Speed Slower than row-based formats, because it arranges columns separately
Supports Nesting Yes → Can store complex data (like arrays, maps, structures)
Splittable Yes → Large files can be split for parallel processing
🔶 Limitations
❌ Not human-readable (binary) – requires tools to inspect
❌ Not as query-efficient as columnar formats like Parquet or ORC
❌ Slower for read-heavy analytics workflows
`
Feature Description
`
Normally, when moving data between systems (e.g., Pandas ➡️ Spark ➡️ R), the data needs to be
serialized/deserialized. Arrow eliminates that step by using a standardized memory format, allowing all tools to
directly access the same data without conversion.