Unit 3: Analyzing Data with Hadoop
1. Data Format
Before analyzing big data, it must be in a proper format. Common formats include:
- Text Files: Simple, human-readable, but not efficient for large datasets.
- Sequence Files: Binary files storing sequences of key-value pairs.
- Avro: Row-based storage format used for serializing data.
- Parquet: Column-based storage ideal for analytical queries.
Example: A retail store might store customer purchase history in Parquet format for faster analysis.
2. Analyzing Data with Hadoop
Hadoop is used for large-scale data analysis. The key steps are:
- Scaling Out: Uses many inexpensive machines (nodes) to process data in parallel.
- Hadoop Streaming: Write MapReduce programs in languages like Python, Perl, etc.
- Hadoop Pipes: A C++ interface for Hadoop MapReduce.
Example: Processing log files from thousands of users using Hadoop Streaming and MapReduce.
3. Hadoop Distributed File System (HDFS)
HDFS is the storage layer of Hadoop. It breaks down big data files into blocks and distributes them:
- Blocks: Default size is 128MB/256MB.
- Replication: Each block is copied across nodes (default = 3).
- Java Interface: APIs to interact with HDFS.
- Data Flow: Input -> Split -> Map -> Shuffle & Sort -> Reduce -> Output.
Example: A 500MB file is split and replicated across nodes for fault tolerance.
4. Hadoop I/O
Handles serialization, compression, and data integrity:
- Data Integrity: Uses checksums to prevent corruption.
- Compression: Reduces disk space (e.g., Snappy, Gzip).
- Serialization: Converts data for storage/transmission.
- Writable Interface: Custom serialization in Hadoop.
5. Avro
Apache Avro is a framework for data serialization:
- File-Based Data Structures: Stores schema with data.
- Schema Evolution: Supports forward/backward compatibility.
Use Case: Used in pipelines, Kafka, and service communication.
Summary Table
Topic: Data Formats | Key Points: Text, Sequence, Avro, Parquet
Topic: Hadoop Analysis | Key Points: Scaling, Streaming, Pipes
Topic: HDFS | Key Points: Blocks, Replication, Java Interface
Topic: Hadoop I/O | Key Points: Compression, Serialization, Data Integrity
Topic: Avro | Key Points: Self-describing, schema evolution