0% found this document useful (0 votes)
23 views22 pages

Non Apache Component

The document compares traditional databases with Apache Geode, highlighting differences in storage, data models, query languages, and durability. It also discusses Kafka Cruise Control for managing Kafka cluster performance, the architecture of Data Analytics Studio (DAS) in CDP, and various tools like Schema Registry and Streams Messaging Manager for data management. Additionally, it covers the integration of HBase with other tools, the functionality of Hive, and the comparison of Apache Kafka, RabbitMQ, and Apache Pulsar.

Uploaded by

Areesha Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views22 pages

Non Apache Component

The document compares traditional databases with Apache Geode, highlighting differences in storage, data models, query languages, and durability. It also discusses Kafka Cruise Control for managing Kafka cluster performance, the architecture of Data Analytics Studio (DAS) in CDP, and various tools like Schema Registry and Streams Messaging Manager for data management. Additionally, it covers the integration of HBase with other tools, the functionality of Hive, and the comparison of Apache Kafka, RabbitMQ, and Apache Pulsar.

Uploaded by

Areesha Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Traditional Database

Apache Geode (Data Grid)


(RDBMS/NoSQL)

Primary Storage Memory-first (RAM is primary;


Disk-first (writes to disk always)
Medium disk optional)

Rows & tables (RDBMS), documents Key-value pairs or objects


Data Model
(NoSQL) (regions)

OQL (SQL-like, but not complete


Query Language Full SQL (ANSI-92/99/2003)
SQL)

Schema-less (stores Java/JSON-like


Schema Enforcement Strict (columns + data types)
objects)

Durability (the “D” in


Always durable Optional (can be turned on/off)
ACID)

Designed for Long-term data storage Fast, transient, in-memory access

Secondary Indexes Supported extensively Limited support

Cold Data Support Yes (cold storage, backups) Not ideal (memory-first)

Traditiononal data base v/s GEODE

Kafka Cruise Control — Smarter Kafka Cluster Optimization

As Kafka clusters grow over time, so does the complexity of managing their performance. Topics
increase, brokers are added, and partition traffic becomes uneven — all of which leads to
imbalance across the cluster.

That’s where Kafka Cruise Control comes in — an open-source tool by LinkedIn that
automatically monitors, detects, and rebalances workloads across brokers.

Why Do We Need It?

When you initially deploy Kafka, partition distribution across brokers seems balanced. But over
time:

`
 Some topics handle heavy traffic (e.g. logs, telemetry).

 Others stay light.

 New brokers only get new partitions.

 Older brokers remain overloaded.

Kafka’s default logic doesn’t adjust for these changes.

Result?

 Some brokers hit disk and CPU limits

 Others stay mostly idle

 Latency increases

 Risk of consumer lag and broker failure rises

*hive is a sql engine which convert the sql query into map reduce or routine.trino, impala is a
sql engine.

Sql ->hive->map reduce -> yarn->hdfs

What Is a Broker?

A broker is a Kafka server that stores and manages message data. It handles reads/writes from
producers and consumers. A Kafka cluster typically consists of multiple brokers working
together.

What Cruise Control Actually Does

Cruise Control continuously monitors your Kafka cluster’s resource usage and intelligently
redistributes partition workloads based on actual traffic, not just partition count.

It can detect:

 CPU or disk pressure on specific brokers

 Imbalanced network traffic

 Broker failures

`
 Replica skew

And then:

 Generate rebalance plans

 Respect Kafka’s rack-awareness and replication rules

 Execute these plans automatically or with your approval

Understanding Data Analytics Studio (DAS) in CDP

In a CDP Private Cloud Base environment, Data Analytics Studio (DAS) provides a web-based
interface for interacting with Hive. It allows users to write and run HiveQL queries, view
execution plans, explore metadata, and analyze historical query performance.

DAS is particularly useful for teams that work with large-scale Hive workloads and need visibility
into performance and resource usage. It integrates closely with Hive, Tez, HDFS, and other
components in the Hadoop ecosystem.

High-Level Architecture of DAS

The DAS system in CDP is built around several components working in layered architecture.
These layers interact as queries are submitted, executed, and monitored.

1. Web UI Layer - DAS Webapp

 This is the user-facing component accessed via a web browser.

 Users can write Hive queries, explore metadata, track query history, and analyze
diagnostics.

 The webapp itself does not execute queries; it delegates that to backend services.

2. Query Layer - HiveServer2 (Interactive Mode)

 Queries from the DAS Webapp are sent to HiveServer2.

 Hive parses and validates the HiveQL.

 It compiles the query into a logical and physical execution plan.

 The execution plan is represented as a DAG (Directed Acyclic Graph) which is passed to
Tez for execution.

`
3. Execution Layer - Apache Tez and YARN

 Apache Tez takes the DAG from Hive and runs it efficiently using parallel execution.

 YARN allocates the necessary containers and resources for the Tez jobs to run.

 Tez reads data from and writes results back to HDFS as needed.

4. Event and Metadata Processing - DAS Event Processor

 The DAS Event Processor captures execution events generated by Hive and Tez.

 It also replicates metadata from the Hive Metastore (such as table and column
definitions).

 All collected data is processed and stored into a PostgreSQL database used by the DAS
Webapp.

5. Data Storage - HDFS and PostgreSQL

 HDFS stores the Hive-managed datasets, including raw data, intermediate files, and
result outputs.

 PostgreSQL serves as the internal metadata store for DAS. It contains:

o Query history

o Execution diagnostics

o Replicated table metadata

o User-accessible metadata for the web interface

GCS Connector

The GCS Connector (Google Cloud Storage Connector) is a library that allows Hadoop-
compatible tools like Hive, Spark, and Hadoop MapReduce to read from and write to Google
Cloud Storage (GCS) as if it were HDFS.

Key Features

 Integrates GCS with Hadoop FileSystem APIs (gs:// prefix)

 Supports read/write operations from Hive, Spark, Pig, and Hadoop

 Allows use of GCS buckets as input/output in distributed jobs

`
 Handles file-level consistency and streaming writes

 Works with Hadoop tools in both on-prem and Dataproc clusters

HBase Indexer – Overview

The HBase Indexer (also known as Lily HBase Indexer) is a tool that integrates Apache HBase
with Apache Solr to enable near real-time full-text indexing and search on HBase data.

🔹 Purpose

 Automatically indexes HBase table data into Solr as it is written or updated

 Enables fast search and filtering on HBase datasets using Solr's search capabilities

🔹 Key Features

 Real-time indexing of HBase write operations (PUTs)

 Integration with Solr (via SolrJ client or HTTP)

 Supports custom mapping scripts to control how HBase rows are indexed

 Can filter columns, transform values, or create derived fields before indexing

 Works well for search-heavy use cases built on top of HBase

What is Hue?

Hue (Hadoop User Experience) is a web-based SQL editor and data browser included in
Cloudera Data Platform (CDP). It provides an intuitive UI for analysts and engineers to interact
with various data services like Hive, Impala, HDFS, and Spark SQL—without needing to use the
CLI.

What is CDP Search?

CDP Search is a distributed search engine built on Apache Solr, integrated with the rest of the
CDP ecosystem (like HDFS, HBase, and YARN). It enables fast querying, filtering, and full-text
search across large-scale datasets.

`
What is a Schema Registry?

A Schema Registry is a service that stores and manages data structure definitions (schemas) so
different systems can share data safely and understand its format.

It ensures that both:

 The data sender (producer)

 The data receiver (consumer)

...agree on the structure of the data — such as what fields are present, their types (string,
number, boolean), and the format.

🔹 Why is Schema Registry Needed?

When you're sending data from one system to another (e.g., via Kafka), you need to make
sure:

 The data format is consistent

 New versions of data (schemas) don't break existing systems

 Everyone knows what the fields mean

Without Schema Registry:

 One system may send data in a new format

 The other system crashes or misinterprets it

What is Streams Messaging Manager (SMM)?

Streams Messaging Manager (SMM) is a tool provided by Cloudera for monitoring, managing,
and debugging Apache Kafka clusters in real time.

Think of SMM as a control panel or dashboard for everything happening inside Kafka — topics,
messages, producers, consumers, brokers — all visualized in one place.

🔹 Why is SMM Used?

`
Kafka is powerful, but by default, it lacks a visual interface. That means:

 You don’t easily see which topics exist

 You can’t track consumer lag (how much data they haven’t consumed yet)

 You can't monitor who is producing or consuming what

SMM solves these problems by giving you:

 A web UI to explore the Kafka cluster

 Real-time monitoring of topics, partitions, lags, brokers, consumers

 Tools for debugging Kafka issues

What is Streams Messaging Manager (SMM)?


 Streams Messaging Manager (SMM) is a web-based monitoring and management tool
developed by Cloudera. It is used to visualize, monitor, and troubleshoot Apache Kafka
in real time.
 Imagine SMM as a control panel for Kafka, where you can see everything that’s
happening in your cluster — like who’s sending messages, who’s consuming them,
which topics exist, and whether any part of the system is falling behind.

Why Use SMM?

Kafka by itself is powerful but has no built-in user interface.

SMM solves that by giving you:

 A visual dashboard for topics, producers, consumers


 Real-time monitoring of message flow and performance
 Tools to debug lag and throughput issues
 Message browsing for Avro, JSON, and plain text data

What is Streams Replication Manager (SRM)?

Streams Replication Manager (SRM) is a Cloudera tool used to replicate Kafka topics and
messages from one Kafka cluster to another.
It allows you to synchronize data across environments — for backup, disaster recovery,
migration, or hybrid cloud use cases.

`
In simple words:

SRM copies Kafka data from one place to another, in real time.

HBase Connectors

Definition:
HBase connectors are integrations that allow HBase to connect with other tools such as
Apache Spark, Hive, Kafka, or MapReduce — so data in HBase can be read/written from those
platforms.

Hive Meta Store (HMS)

Definition:
The Hive Metastore is a central metadata repository for Hive. It stores information about:

 Tables
 Columns
 Data types
 Location of data in HDFS

Other tools like Spark, Impala, Presto also use HMS to read metadata

Hive on Tez

Definition:
Hive on Tez is a faster execution engine for Hive queries using Apache Tez instead of
MapReduce.It optimizes DAG-based query execution for interactive and batch queries.

Hive Warehouse Connector

Definition:
The Hive Warehouse Connector (HWC) allows Apache Spark to read/write data from Hive
tables using the Hive LLAP engine.

Use Case:
If you're running Hive on LLAP (in Cloudera or HDP), this connector helps Spark access data
more efficiently.

`
Spark Atlas Connector

Definition:
The Spark Atlas Connector helps integrate Apache Spark with Apache Atlas, so that metadata
lineage (like data flow from table A → B) can be captured automatically.

What is a Schema Registry?

A Schema Registry is a system that stores data schemas centrally and ensures that producers
and consumers of data follow compatible formats. It is mostly used with Avro, Protobuf, or
JSON data formats in systems like Kafka or Spark.

Think of it as a "data contract manager" — ensuring data structure stays consistent across
systems

STORAGE IN CDP

Apache Bookkeeper V/S Apache Pulsar Tiered Storage

`
1. Clients (Top)

These are the applications or users that send data (called producers) or read data (called
consumers).
Clients talk to Pulsar using read/write commands — like "send this message" or "get this
message."

2. Apache Pulsar (Middle Left)

Inside Pulsar, there are brokers. These brokers act like traffic managers.
They handle:

 All incoming messages from clients


 Deciding where to store the message
 Giving messages to clients when asked

But brokers do not store messages permanently — they just forward them to storage.

3. Apache BookKeeper (Middle Right)

`
BookKeeper is the part that actually stores the data.
It has storage servers called bookies. When a message comes in, the Pulsar broker sends it to
several bookies.
This way, even if one bookie fails, the message is still safe on the others.

4. Apache ZooKeeper (Bottom)

ZooKeeper is used to manage and monitor the whole system.


It keeps track of:

 Which broker is doing what


 Which bookie is storing which data
 If something goes wrong, ZooKeeper helps recover the system

It stores important information about the system, but not the actual messages.

APACHE KAFKA V/S RABBITMQ V/S APACHE PULSAR

Feature Apache Kafka RabbitMQ Apache Pulsar


Distributed Log-
Traditional Message Hybrid of Log-based +
based Messaging
Queue – focuses on Message Queue –
System – focuses on
System Type tasks getting combines Kafka-like
ordered data
processed and logs with RabbitMQ-
streams with replay
removed. style queues.
ability.
Uses brokers only – Uses segregated
Uses central broker
brokers store and architecture – brokers
with internal queues
Core Architecture serve messages. All serve traffic, storage is
– queues are tightly
logic is packed into fully offloaded to
bound to brokers.
Kafka brokers. BookKeeper.
Messages are stored
Messages are deleted Messages are stored in
as immutable logs in
after being segments using
Storage Mechanism topic-partitions on
consumed, unless BookKeeper. Retention
disk. Can replay any
specifically persisted. and replay are built-in.
old message.
Message Retention Time or size-based – Messages are Time or size-based just
data stays even after removed after like Kafka – but more
it's consumed. delivery (default). If efficient as storage is
Consumers can persistent queues are handled separately.

`
Feature Apache Kafka RabbitMQ Apache Pulsar
used, can store for a
replay anytime.
bit.
Queue-based – one
Publish-Subscribe – Supports both models –
producer, one
one producer, many Pub-Sub (Kafka style)
Message Delivery Model consumer (or round-
consumers (can and Queue (RabbitMQ
robin if multiple
replay data). style).
consumers).
Guarantees order Guarantees order
Delivers in order
within a partition. within partitions just
unless multiple
Ordering Guarantees Consumers need to like Kafka, but with
consumers are pulling
manage parallelism better cursor-based
from the same queue.
manually. reading.
Matches Kafka
Very high – built for Lower compared to
performance, and
big data pipelines Kafka – good for
Performance/Throughput scales well due to
and high-throughput lightweight tasks, not
decoupled storage and
streaming. large streams.
processing.
Highly scalable due to
Limited scalability –
Horizontal scaling segregation of
queues can become a
Scalability via partitions. Needs compute and storage.
bottleneck. Scaling
manual tuning. Topics can scale
often needs sharding.
independently.
Pull-based – Push-based – broker Both pull and push
Consumer Model consumer pulls data pushes data to available – very
from Kafka. consumers. flexible.
Very high – with Configurable – queues Very high – due to
Fault Tolerance & replication, leader can be replicated, but BookKeeper, each
Durability election, and persistence is segment is replicated
durable logs. optional. and durable.
Multi-tenant data
Streaming logs, Background job
platforms, hybrid
analytics, IoT processing, real-time
Use Cases message & event-
telemetry, ETL tasks (emails,
driven systems, cloud-
pipelines. notifications).
native apps.
Not built-in. Needs Cloud-native by design
Minimal cloud
external tooling for – supports geo-
features. Needs
Cloud-Native Features features like geo- replication, tiered
plugins or wrappers
replication or tiered storage, and multi-
for advanced use.
storage. tenancy.
Multi-Tenant Support Weak – needs No built-in multi- Strong – has
multiple clusters or tenant support. namespaces, quotas,
complex namespace isolation policies for

`
Feature Apache Kafka RabbitMQ Apache Pulsar
setups. multi-tenancy.
Complex – Easier to manage – Simpler to operate due
ZooKeeper is good tools like to separation of
Operations &
required, tuning management UI, but concerns. No
Management
partitions is hard at can get messy under ZooKeeper needed in
scale. load. latest versions.
Newer (from Yahoo),
Very mature, large Oldest among the
but growing fast –
Maturity & Community community, wide three, mature for
adopted in cloud-first
enterprise usage. classic queuing tasks.
companies.
Java, Python, Go, Java, Python, .NET, Java, Python, Go, C++,
Language Support
Scala, C++, etc. Ruby, Go, etc. Node.js, etc.

Apache pulsar * apache kafka


apache pulser tied storage

Feature Apache Pulsar Apache BookKeeper


A messaging system that sends and
A storage system that saves the actual
What it is receives real-time data between
messages sent through Pulsar
applications
Handles producers and consumers, Stores the messages durably and safely
Main job
manages topics, routes messages using a distributed setup
User- Yes – Producers and consumers directly No – It works in the background; only
facing? interact with Pulsar Pulsar brokers talk to it
Data Doesn’t store data itself – it sends Stores data in a format called ledgers,
Storage messages to BookKeeper spread across multiple servers (bookies)
Fault Uses BookKeeper’s replication to ensure Replicates messages across bookies so
tolerance messages are not lost that if one fails, data is still safe
Pulsar is a full system that uses BookKeeper is just the storage backend
Part of
BookKeeper for storage that Pulsar depends on

Columnar database vs. row-oriented database

`
As writing to the dataset is relatively cheap and easy in row format, row formatting is
recommended for any case in which you have write-heavy operations. However, read
operations can be very inefficient.

Row-Oriented Storage

 Data Layout: Each row is stored together in memory (e.g., ID, Name, Age, City...).

 Write Efficiency: Very efficient – appends are fast and simple.

 Read Efficiency: Poor for analytical queries – must read entire rows even if only one
column is needed.

 Ideal Use Case: Write-heavy operations, OLTP systems, wide datasets.

 Drawback: Inefficient for queries on specific columns due to non-contiguous column


data.

Example: To sum age, all rows are loaded, then age values extracted. Costly in large datasets.

Column-Oriented Storage

`
 Data Layout: Each column is stored separately (e.g., all IDs together, all Names together).

 Write Efficiency: Slower – updating or inserting requires writing to multiple locations.

 Read Efficiency: Highly efficient – only relevant columns are read.

 Ideal Use Case: Read-heavy, analytical queries, OLAP systems.

 Compression: Superior compression due to data type uniformity per column.

 Encoding: Optimized (e.g., different encoding for integers, strings).

Example: To sum age, only the age column is loaded from a single disk – fast and efficient.

Comparison of Apache HBase, Cassandra, Accumulo, Kudu, and CouchDB

Data
Database Storage How Data is Stored (Simple Explanation) Architecture Type
Model

Wide- Data is stored like a big Excel sheet: rows grouped


Master-Slave (HMaster
HBase Column into column families; each cell can have versions
+ RegionServers)
Store (timestamps). Data is saved on HDFS.

Wide- Data is split across nodes using partition keys,


Masterless (Peer-to-
Cassandra Column written to SSTables with LSM Trees. Every node is
peer ring)
Store equal; data is evenly distributed.

Wide- Like HBase, but each cell has a security label. Data
Master-Slave (Master +
Accumulo Column + stored on HDFS, sorted by row + column +
TabletServers)
Security timestamp. Ideal for secure/classified data.

Data is stored by columns, not rows. Same-column


Columnar Master-Slave (Kudu
Kudu values are stored together for fast analytics. Allows
Storage Master + TabletServers)
updates & deletes, uses its own storage (not HDFS).

Document Data is stored as JSON documents with unique IDs Multi-Master (Every
CouchDB Store and version history. Schema-free. Great for offline node is writable and
(JSON) syncing. Uses append-only B-Tree for writes. syncs)

`
*Cassandra Data Storage (Short Format)
1. Partition Key
o Decides which node the data goes to
o Helps distribute data evenly across all nodes
2. SSTable
o Data is stored in sorted files called SSTables
o These files never change once written
3. LSM Tree
o Data is first written in memory
o Then saved in batches to disk (as SSTables)
o This makes writes very fast

Row-based storage (HBase/Cassandra)


➤ Data stored like this:
Row1 → [Name, Age, Address]
Row2 → [Name, Age, Address]

Column-based storage (Kudu)


➤ Data stored like this:
Column: Name → [Alice, Bob, Carol]
Column: Age → [25, 30, 28]

Document-based (CouchDB)
➤ Data stored like this:
{ "name": "Alice", "age": 25, "address": "Lahore" }

Master-Slave vs Masterless (HBase vs Cassandra)

Feature HBase (Master-Slave) Cassandra (Masterless)


Has one Master (HMaster) and many Slaves All nodes are equal – no master, all are
Architecture
(RegionServers) peers
Every node can handle read/write
Who Controls? HMaster controls assignment and coordination
independently
Data is split across all nodes using partition
Data Storage RegionServers store data on HDFS
key

`
Feature HBase (Master-Slave) Cassandra (Masterless)
If HMaster goes down, it can pause unless backup is If any node goes down, others keep
Failure Handling
ready working
Systems where data accuracy & consistency are Systems where uptime & speed matter
Best For
critical most
Example Use
Bank transaction history, audit logs Messaging apps, IoT data, activity streams
Case

Why Accumulo is Strong in Security


 Every cell has a label: Unlike row-level or table-level controls, this ensures that sensitive information is
protected even within the same row.
 Visibility expressions: Logical expressions (&, |, !) allow complex access control logic.
 Auditable and Transparent: Enforces who can see what, down to the byte level.

Apache Geode vs Apache Ignite

Feature Apache Geode Apache Ignite Explanation of Term


In-memory: Stores data in RAM for speed. Data grid:
In-Memory Data In-Memory Computing
System Type Distributed key-value store. Computing platform:
Grid Platform
Includes cache, compute, SQL, persistence, ML.
Client-Server: Clients connect to designated servers.
Cluster Client-Server with Peer-to-Peer with
Locator: Finds servers. Peer-to-Peer: All nodes equal,
Architecture Locators Automatic Discovery
no central server.
Region: Geode’s map-like storage unit. Cache/Table:
Regions (key-value, Caches as SQL Tables
Data Model Ignite’s container that supports SQL. Schema-aware:
object-oriented) (schema-aware)
Understands field names/types.
OQL (Object Query OQL: Geode’s object-based query language. ANSI
Query Full ANSI SQL (SELECT,
Language, limited SQL: Industry-standard SQL. DML: Data changes
Language JOIN, DML, Subqueries)
SQL-like) (INSERT, UPDATE, DELETE).
Optional Write- WAL: Writes changes to log first for recovery.
Persistence Native Persistence (acts
Ahead Log (WAL), Snapshot: Point-in-time backup. Native Persistence:
Model as durable database)
Snapshots Stores all data to disk.
Function Execution: Sends Java code to node.
Full Compute Grid
Compute Function Execution Compute Grid: Distributed job execution.
(MapReduce, Closures,
Model (basic) MapReduce/Closures: Parallel computation
Distributed Tasks)
patterns.
Machine Learning (ML): Algorithms that learn
Machine Built-in ML algorithms &
No built-in ML patterns from data. Built-in: No need for external
Learning training APIs
tools.
Streaming: Processing real-time continuous data.
Streaming External tools like Native streaming APIs,
Kafka/Flink: Tools for data pipelines. Native: Built
Support Kafka, Spark Kafka, Flink integration
into Ignite.

`
Feature Apache Geode Apache Ignite Explanation of Term
SQLLine, Control Center, GFSH: Geode CLI. JMX: Java monitoring. SQLLine:
Tooling & GFSH, JMX, Spring
Prometheus, Spring SQL CLI. Prometheus: Metric collector. Control
Monitoring Data Geode
Data Ignite Center: Ignite’s UI.
Language & API Java, C++, .NET, Java, C++, .NET, Python, API: Interface for developers. JDBC/ODBC: SQL
Support REST Node.js, JDBC, ODBC access via tools. REST: HTTP-based interface.
High-speed
HTAP, distributed SQL, HTAP: Hybrid Transaction + Analytical Processing.
Use Cases caching, session
real-time analytics, ML Caching: Fast access to frequently used data.
replication
System of ❌ Not suitable as ✅ Can act as a full System of Record: The authoritative source of data
Record main database persistent database that applications trust.
Steep (OQL + Moderate (SQL + Learning Curve: How easy it is to learn and use the
Learning Curve
custom configs) familiar tools) system for new developers.

Apache Pinot vs Apache Druid

Difference Apache Pinot Apache Druid Why It Matters


Designed for ultra-low Fast, but can be slower than 🟢 If you need instant results for user-
Query Speed &
latency (sub-100ms) even at Pinot for high concurrency or facing apps (like dashboards), Pinot
Latency
high concurrency complex queries is better.
🔄 Pinot can combine data from
Full support for SQL joins Only supports small lookup
Join Support different sources like a normal SQL
(inner, outer, etc.) joins (no full joins)
DB, Druid cannot.
✍️ Pinot is more flexible for data
Data Update Supports upserts (update + Does not support upserts (only
corrections or updates. Druid is
(Upserts) insert) natively append data)
append-only.
⚡ Pinot can speed up different types
Advanced indexes: star-tree, Basic indexes: bitmap,
Indexing of queries (e.g. group-by, geo,
bloom, text, geo, vector dictionary, no star-tree
search).
Limited SQL support; JSON 📊 Pinot feels like a modern database.
Full ANSI SQL support using
SQL Support queries needed for complex Druid has its own query language
Apache Calcite
logic quirks.
Real-time analytics for 🎯 Pinot = great for external user-
Time-series metrics, logs,
Use Case Focus external users (e.g., LinkedIn, facing dashboards, Druid = better for
internal dashboards
Uber dashboards) internal monitoring.
Pinot fits better for dynamic data
Streaming Real-time ingestion only (no
Real-time ingestion + upserts pipelines where freshness and
Data Handling updates)
updates matter.

`
Key Features of Avro
Feature Description
Storage Type Row-based (stores one full record at a time)
Schema Support Schema is written in JSON, but data is stored in binary
Schema Evolution ✅ Supports changes like adding/removing fields
Compression Compact binary format (smaller than JSON)
Language Support Works with many languages like Python, Java, C++, etc.

🔶 How It Works
1. Define a Schema in JSON (e.g., a user has name and age).
2. Serialize the data using that schema into compact binary format.
3. Deserialize the binary data later using the same (or updated) schema.

🔶 Why Use Avro?


Use Case Reason
Kafka Messaging Sends small and fast messages with schema
ETL Pipelines Efficient writing of row-based data
Microservices Communicate between systems using compact data
Schema Changes Easily add/remove fields without breaking code

🔶 Advantages
✅ Fast and compact
✅ Handles schema changes well
✅ Works across many languages
✅ Great with streaming tools like Kafka

🔶 Limitations
❌ Not human-readable (it's binary)
❌ Not ideal for analytical queries (use Parquet or ORC instead)
❌ Requires schema management

What is Apache Parquet?


Apache Parquet is a column-based file format used to store large amounts of data efficiently. It’s designed mainly
for analytics and reporting, not for frequent updates or real-time use.
It’s widely used in tools like:
 Apache Spark
 Apache Hive
 Amazon Athena
 Google BigQuery
 Power BI (via connectors)

`
🔶 Key Features
Feature Description
Storage Type Columnar → Data is stored by columns, not rows
Compression Supports high compression (like Snappy, Gzip) to reduce file size
Read Speed Very fast, especially when reading only a few columns
Write Speed Slower than row-based formats, because it arranges columns separately
Supports Nesting Yes → Can store complex data (like arrays, maps, structures)
Splittable Yes → Large files can be split for parallel processing

🔶 How Parquet Works


1. Data is stored column by column, not row by row.
2. When you query, only specific columns are read (not full rows).
3. It uses compression and encoding for space and performance.
4. Works very well with big data tools.

🔶 When Should You Use Parquet?


Use Case Why It Works Well
Big Data Analytics Fast to read huge amounts of data for analysis
Data Warehousing Efficient for tools that scan only a few columns
BI Dashboards Reduces memory by loading only the necessary fields
Batch Processing Ideal when writing once and reading many times

🔶 Advantages of Apache Avro


✅ Compact Binary Format – Smaller than JSON, faster to serialize/deserialize
✅ Rich Data Types – Supports nested records, arrays, maps, enums, etc.
✅ Efficient Schema Evolution – Handles changing schemas with minimal overhead
✅ Tooling Support – Well-integrated with Kafka, Hive, Spark, NiFi, etc.
✅ Splittable – Works well with Hadoop MapReduce and parallel file readers

🔶 Limitations
❌ Not human-readable (binary) – requires tools to inspect
❌ Not as query-efficient as columnar formats like Parquet or ORC
❌ Slower for read-heavy analytics workflows

What is Apache ORC?


Apache ORC (Optimized Row Columnar) is a columnar file format designed specifically for Apache Hive and the
Hadoop ecosystem. It’s built to store, compress, and process large-scale data efficiently.

🔶 Key Features of ORC


Feature Description
Format Type Column-based storage

`
Feature Description

Best Use Case Hive-based data warehouses and analytics workloads


Compression Very high compression (supports Zlib, LZO, Snappy)
Query Performance Very fast – only the relevant columns are read
Schema Evolution Limited support (not as flexible as Avro)
Indexing Stores lightweight indexes with each stripe for faster access

🔶 ORC File Structure


An ORC file is structured like this:
 Stripes: The file is divided into stripes, each containing row groups.
 Row Indexes: Allow fast lookup of specific data.
 Footer: Stores metadata like schema, compression type, etc.

🔶 When is ORC Most Useful?


Scenario Why ORC is better
Used with Hive Native support + faster queries
Analytical queries on large tables High compression and efficient column reads
Read-heavy workloads Optimized for selective column access
Hadoop processing Efficient for distributed/parallel processing

What is Apache Arrow?


Apache Arrow is an in-memory columnar data format designed for high-performance data analytics. Unlike file
formats like Parquet or ORC (used for storing data on disk), Arrow is optimized for processing data in memory,
enabling fast analytics, cross-language data exchange, and zero-copy reads.
It's widely used in tools like Pandas, Dask, Spark, DuckDB, and languages like Python, R, C++, and Java.

🔶 Key Features of Apache Arrow


Feature Description
Format Type Columnar (in-memory)
Storage Medium RAM (not disk)
Primary Use Fast data processing & analytics
Language Interoperability Supports data sharing between Python, R, C++, Java, etc. without serialization
Zero-Copy Reads No data copying – data is accessed directly in memory
Highly Optimized Built for SIMD & CPU cache efficiency

🔶 Why Apache Arrow?


Apache Arrow solves a major problem in data science and analytics: slow and inefficient data exchange between
tools.

`
Normally, when moving data between systems (e.g., Pandas ➡️ Spark ➡️ R), the data needs to be
serialized/deserialized. Arrow eliminates that step by using a standardized memory format, allowing all tools to
directly access the same data without conversion.

🔶 How Arrow Works (Simplified)


 Data is stored in columnar format in memory.
 Each column is tightly packed and aligned for fast access.
 Supports vectorized processing (SIMD) – e.g., apply operations to entire arrays at once.

Feature Apache ORC Apache Parquet Apache Avro Apache Arrow


Storage Format
Columnar Columnar Row-based Columnar (In-memory)
Type
Spark, Presto, Kafka, streaming In-memory data processing &
Best Use Case Hive, Hadoop analytics
Hive, Athena data, serialization sharing across systems
🔥 Excellent (lightweight
Compression ❌ Not compressed (in-
indexing + ✅ Great ⚠️Moderate
Efficiency memory)
compression)
🔥 High (column ❌ Slower due to 🔥 Extremely Fast (zero-copy
Read Performance ✅ High
pruning + indexes) row-wise reads reads)
✅ Fast (writes in memory, not
Write Performance Moderate Moderate ✅ Very fast
disk)
Schema Evolution ❌ Not designed for schema
⚠️Limited ⚠️Partial ✅ Full support
Support evolution
✅ Yes (efficient for
Splittable ✅ Yes ✅ Yes ❌ No (in-memory structure)
parallel processing)
Great for streaming Designed for interoperability
Wide (Spark, Hive,
Interoperability Hive-centric platforms (Kafka, between systems (Pandas, R,
Presto, etc.)
Flink) Spark, etc.)
Row-level Access ❌ Not ideal ❌ Not ideal ✅ Yes ✅ Yes (in-memory access)
Column-level
✅ Optimized ✅ Optimized ❌ Not available ✅ Available (in memory)
Access
Rich (including Rich (including
Rich types & vectors for
Data Type Support complex & nested complex & nested Limited nested types
machine learning, analytics
types) types)
Storage + Batch Storage + Batch Serialization + In-memory analytics + interop
Usage Focus
Analytics Analytics Messaging + zero-copy

You might also like