0% found this document useful (0 votes)
98 views2 pages

Unit 3 Analyzing Data With Hadoop Notes

The document outlines key concepts related to analyzing data with Hadoop, including various data formats such as Text, Sequence, Avro, and Parquet. It details the Hadoop framework for large-scale data analysis, focusing on HDFS for storage, data processing methods, and the importance of Hadoop I/O for serialization and compression. Additionally, it highlights Apache Avro for data serialization and schema evolution.

Uploaded by

Krishnendu Ghosh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views2 pages

Unit 3 Analyzing Data With Hadoop Notes

The document outlines key concepts related to analyzing data with Hadoop, including various data formats such as Text, Sequence, Avro, and Parquet. It details the Hadoop framework for large-scale data analysis, focusing on HDFS for storage, data processing methods, and the importance of Hadoop I/O for serialization and compression. Additionally, it highlights Apache Avro for data serialization and schema evolution.

Uploaded by

Krishnendu Ghosh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Unit 3: Analyzing Data with Hadoop

1. Data Format

Before analyzing big data, it must be in a proper format. Common formats include:

- Text Files: Simple, human-readable, but not efficient for large datasets.

- Sequence Files: Binary files storing sequences of key-value pairs.

- Avro: Row-based storage format used for serializing data.

- Parquet: Column-based storage ideal for analytical queries.

Example: A retail store might store customer purchase history in Parquet format for faster analysis.

2. Analyzing Data with Hadoop

Hadoop is used for large-scale data analysis. The key steps are:

- Scaling Out: Uses many inexpensive machines (nodes) to process data in parallel.

- Hadoop Streaming: Write MapReduce programs in languages like Python, Perl, etc.

- Hadoop Pipes: A C++ interface for Hadoop MapReduce.

Example: Processing log files from thousands of users using Hadoop Streaming and MapReduce.

3. Hadoop Distributed File System (HDFS)

HDFS is the storage layer of Hadoop. It breaks down big data files into blocks and distributes them:

- Blocks: Default size is 128MB/256MB.

- Replication: Each block is copied across nodes (default = 3).

- Java Interface: APIs to interact with HDFS.

- Data Flow: Input -> Split -> Map -> Shuffle & Sort -> Reduce -> Output.

Example: A 500MB file is split and replicated across nodes for fault tolerance.

4. Hadoop I/O

Handles serialization, compression, and data integrity:

- Data Integrity: Uses checksums to prevent corruption.


- Compression: Reduces disk space (e.g., Snappy, Gzip).

- Serialization: Converts data for storage/transmission.

- Writable Interface: Custom serialization in Hadoop.

5. Avro

Apache Avro is a framework for data serialization:

- File-Based Data Structures: Stores schema with data.

- Schema Evolution: Supports forward/backward compatibility.

Use Case: Used in pipelines, Kafka, and service communication.

Summary Table

Topic: Data Formats | Key Points: Text, Sequence, Avro, Parquet

Topic: Hadoop Analysis | Key Points: Scaling, Streaming, Pipes

Topic: HDFS | Key Points: Blocks, Replication, Java Interface

Topic: Hadoop I/O | Key Points: Compression, Serialization, Data Integrity

Topic: Avro | Key Points: Self-describing, schema evolution

You might also like