0% found this document useful (0 votes)

149 views3 pages

Detailed Big Data and Hadoop Notes

The document provides detailed notes on Big Data and Hadoop, covering topics such as Big Data Analytics, the history and ecosystem of Hadoop, and the Hadoop Distributed File System (HDFS). It also discusses MapReduce job anatomy, various Hadoop ecosystem tools like Pig, Hive, and HBase, and introduces data analytics concepts including supervised and unsupervised learning. Additionally, it highlights IBM's integration of Hadoop into enterprise environments for enhanced data management.

Uploaded by

manveerjoc21

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

149 views3 pages

Detailed Big Data and Hadoop Notes

Uploaded by

manveerjoc21

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Detailed Exam Notes: Big Data and Hadoop

Unit I: Introduction to Big Data and Hadoop

1. Big Data Analytics:

- Big Data refers to datasets that are too large or complex to process using traditional methods.

- Big Data Analytics involves analyzing such datasets to uncover hidden patterns, correlations,

and insights.

- Types of Data:

- Structured: Tabular data with rows and columns (e.g., databases).

- Semi-structured: Data with some structure, like JSON or XML.

- Unstructured: Data with no predefined format (e.g., images, videos, emails).

2. History of Hadoop:

- Hadoop was inspired by Google's MapReduce and GFS (Google File System).

- Doug Cutting and Mike Cafarella created Hadoop, which became an open-source framework.

- Yahoo played a major role in the development and adoption of Hadoop.

3. Hadoop Ecosystem:

- Comprises tools that work together to process and analyze Big Data.

- Core components: HDFS (storage), MapReduce (processing).

- Supporting tools: Hive, Pig, HBase, Sqoop, Flume, Oozie, and Zookeeper.

4. IBM Big Data Strategy:

- IBM Infosphere BigInsights integrates Hadoop into enterprise environments for better data

management.

- Provides advanced tools like text analytics, machine learning, and enterprise-grade security.
Unit II: HDFS (Hadoop Distributed File System)

1. HDFS Concepts:

- HDFS is a distributed storage system designed to store very large datasets across multiple

nodes.

- Data is divided into blocks (default size: 128 MB) and stored across a cluster of nodes.

- Features include fault tolerance, high throughput, and scalability.

2. Data Ingestion:

- Flume: Used for collecting, aggregating, and moving large amounts of log data into HDFS.

- Sqoop: Transfers data between HDFS and relational databases like MySQL.

3. Hadoop I/O:

- Compression: Reduces the size of data to save storage and improve performance.

- Serialization: Converts data into a format that can be stored or transmitted (e.g., Avro, Thrift).

Unit III: MapReduce

1. Anatomy of MapReduce Job:

- Splits input data into smaller chunks.

- Mapper processes chunks in parallel and generates key-value pairs.

- Reducer combines and aggregates intermediate outputs from mappers.

2. Shuffle and Sort:

- Sorts mapper outputs by key and distributes them to reducers.

3. Job Scheduling:

- Ensures tasks are executed efficiently.

- Types of schedulers: FIFO (First In First Out), Fair Scheduler, Capacity Scheduler.
Unit IV: Hadoop Ecosystem Tools

1. Pig:

- High-level scripting platform for data transformation and analysis.

- Uses Pig Latin, a language simpler than Java.

- Suitable for tasks like ETL (Extract, Transform, Load).

2. Hive:

- A data warehouse infrastructure on top of Hadoop.

- HiveQL allows querying data using an SQL-like syntax.

- Integrates with HDFS for large-scale data analysis.

3. HBase:

- A NoSQL database built on top of HDFS for real-time processing.

- Stores data in a columnar format, making it faster than relational databases (RDBMS).

Unit V: Data Analytics with R and Machine Learning

1. Supervised Learning:

- Models are trained using labeled data (input-output pairs).

- Examples: Regression (predicting values) and Classification (categorizing data).

2. Unsupervised Learning:

- Works on unlabeled data to identify patterns and relationships.

- Examples: Clustering (grouping similar items) and Dimensionality Reduction.

3. Collaborative Filtering:

- Used in recommender systems (e.g., Amazon, Netflix).

- Based on user behavior or item similarity to suggest relevant items.

Big Data and Hadoop Notes
No ratings yet
Big Data and Hadoop Notes
3 pages
Big Data Notes With Diagrams
No ratings yet
Big Data Notes With Diagrams
3 pages
BD by Maaz
No ratings yet
BD by Maaz
19 pages
Big Data Analytics Unit Wise Short Note
No ratings yet
Big Data Analytics Unit Wise Short Note
6 pages
Big Data Analytics
No ratings yet
Big Data Analytics
61 pages
Big Data
No ratings yet
Big Data
8 pages
Understanding Big Data and Hadoop Basics
No ratings yet
Understanding Big Data and Hadoop Basics
17 pages
BDA Simple 1 To 4
No ratings yet
BDA Simple 1 To 4
11 pages
Big Data Hadoop Complete Final Spaced
No ratings yet
Big Data Hadoop Complete Final Spaced
15 pages
Comprehensive Guide to Hadoop and Big Data
No ratings yet
Comprehensive Guide to Hadoop and Big Data
2 pages
Sub Unit 3
No ratings yet
Sub Unit 3
9 pages
Big Data Important Questions AKTU
No ratings yet
Big Data Important Questions AKTU
3 pages
Big Data Computing Notes
No ratings yet
Big Data Computing Notes
17 pages
RMK Group Data Analytics Guide
No ratings yet
RMK Group Data Analytics Guide
72 pages
BDA Unit 2
No ratings yet
BDA Unit 2
8 pages
I Am Preparing For A Big Data Analytics University...
No ratings yet
I Am Preparing For A Big Data Analytics University...
15 pages
Sdcbdasparkweek1 1
No ratings yet
Sdcbdasparkweek1 1
9 pages
GAME
No ratings yet
GAME
2 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
Introduction to Big Data & Hadoop
No ratings yet
Introduction to Big Data & Hadoop
45 pages
TIE - 21CS71 SIMP With Key Answers
No ratings yet
TIE - 21CS71 SIMP With Key Answers
19 pages
Big Data
No ratings yet
Big Data
3 pages
Bigdata - Important Topics For Exam
No ratings yet
Bigdata - Important Topics For Exam
1 page
BDH (1 5) ChatGPT
No ratings yet
BDH (1 5) ChatGPT
26 pages
Big Data & Hadoop Study Guide
No ratings yet
Big Data & Hadoop Study Guide
2 pages
Big Data Question Bank
No ratings yet
Big Data Question Bank
3 pages
Big Data SV Publication
No ratings yet
Big Data SV Publication
142 pages
22cs702 Data Analytics Unit-2.Dcm
No ratings yet
22cs702 Data Analytics Unit-2.Dcm
73 pages
Big Data
No ratings yet
Big Data
27 pages
CT2 BDTT
No ratings yet
CT2 BDTT
6 pages
Bda QB Soln
No ratings yet
Bda QB Soln
22 pages
Bda U2
No ratings yet
Bda U2
68 pages
Big Data Curriculum for CS & CSE Students
No ratings yet
Big Data Curriculum for CS & CSE Students
2 pages
Big Data Analytics with PySpark Course
No ratings yet
Big Data Analytics with PySpark Course
2 pages
BDA SansON Iat1
No ratings yet
BDA SansON Iat1
17 pages
SYLLABUS
No ratings yet
SYLLABUS
2 pages
Unit 4 - Class Notes
No ratings yet
Unit 4 - Class Notes
6 pages
Big Data Analytics - Notes
No ratings yet
Big Data Analytics - Notes
13 pages
MCA - BigData Notes
No ratings yet
MCA - BigData Notes
136 pages
Question Bank Big Data Analytics
No ratings yet
Question Bank Big Data Analytics
2 pages
Fillatre Big Data
No ratings yet
Fillatre Big Data
98 pages
Bda Ese
No ratings yet
Bda Ese
21 pages
UNIT-I Introduction To Hadoop - A20
No ratings yet
UNIT-I Introduction To Hadoop - A20
24 pages
Last Min Preparation - Big Data
No ratings yet
Last Min Preparation - Big Data
5 pages
BgiData QB
100% (1)
BgiData QB
3 pages
Big Data Notes Units II III IV
No ratings yet
Big Data Notes Units II III IV
3 pages
Bigdata Imp Ques
No ratings yet
Bigdata Imp Ques
5 pages
Big Data BCS061 Complete Question Bank With RealWorld
No ratings yet
Big Data BCS061 Complete Question Bank With RealWorld
5 pages
Two Marks
No ratings yet
Two Marks
39 pages
Role of Hadoop in Big Data Analysis
No ratings yet
Role of Hadoop in Big Data Analysis
9 pages
Attachment
No ratings yet
Attachment
11 pages
Unit 4 1
No ratings yet
Unit 4 1
7 pages
Unit # 2
No ratings yet
Unit # 2
23 pages
Bda (M-4)
No ratings yet
Bda (M-4)
8 pages
Asit Kumar Das - M5 SPARK
No ratings yet
Asit Kumar Das - M5 SPARK
24 pages
20IT503 - Big Data Analytics - Unit4
No ratings yet
20IT503 - Big Data Analytics - Unit4
73 pages
VTE Mathematics
No ratings yet
VTE Mathematics
6 pages
Grade 7 TVE Dressmaking Test 2
No ratings yet
Grade 7 TVE Dressmaking Test 2
2 pages
CH 6 Elements, Compounds and Mixtures
No ratings yet
CH 6 Elements, Compounds and Mixtures
9 pages
State Level: Mathematics Competition, 2019
No ratings yet
State Level: Mathematics Competition, 2019
9 pages
Harr Clinical Chemistry Flashcards Quizlet
No ratings yet
Harr Clinical Chemistry Flashcards Quizlet
44 pages
DBMS Module Wise Questions
No ratings yet
DBMS Module Wise Questions
2 pages
Heavy Duty Torque Wrench Specifications
No ratings yet
Heavy Duty Torque Wrench Specifications
1 page
NX Nastran Element Library Reference Manual
No ratings yet
NX Nastran Element Library Reference Manual
276 pages
Obasa's Blog: Topic One: Output Devices
No ratings yet
Obasa's Blog: Topic One: Output Devices
1 page
Functional Occlusion: Presented By-Dr. Ruchi Saxena Dept. of Orthodontics
50% (2)
Functional Occlusion: Presented By-Dr. Ruchi Saxena Dept. of Orthodontics
186 pages
MSW Logo Intro
No ratings yet
MSW Logo Intro
20 pages
Machine Learning with Scikit-Learn
0% (1)
Machine Learning with Scikit-Learn
4 pages
Effect of Demonstration Teaching
No ratings yet
Effect of Demonstration Teaching
3 pages
MAT 0541 1204G Integral Calculus and Differential Equations
No ratings yet
MAT 0541 1204G Integral Calculus and Differential Equations
1 page
Tutorial Questions - Ground Improvement
No ratings yet
Tutorial Questions - Ground Improvement
3 pages
Comp 1
No ratings yet
Comp 1
2 pages
3D Max Course for Animators
No ratings yet
3D Max Course for Animators
14 pages
Finite State Machines
100% (2)
Finite State Machines
55 pages
Open Source GPGPU Design for Researchers
No ratings yet
Open Source GPGPU Design for Researchers
1 page
741 Op-Amp Inverting & Non-Inverting Amplifier Design
No ratings yet
741 Op-Amp Inverting & Non-Inverting Amplifier Design
16 pages
Wa0009.
No ratings yet
Wa0009.
6 pages
Python Notes
No ratings yet
Python Notes
19 pages
Steam Boiler
No ratings yet
Steam Boiler
61 pages
Requests Readthedocs Io en v2.9.1
No ratings yet
Requests Readthedocs Io en v2.9.1
103 pages
Introduction to Selection in Scratch
No ratings yet
Introduction to Selection in Scratch
25 pages
Francis Turbine Guide Vane Study
No ratings yet
Francis Turbine Guide Vane Study
8 pages
What Is Entity Framework?
No ratings yet
What Is Entity Framework?
4 pages
Gen Math
No ratings yet
Gen Math
15 pages
Syllabus 3 Exam Reports
No ratings yet
Syllabus 3 Exam Reports
115 pages
4 Way Hacksaw Machine
67% (3)
4 Way Hacksaw Machine
8 pages

Detailed Big Data and Hadoop Notes

Uploaded by

Detailed Big Data and Hadoop Notes

Uploaded by

Detailed Exam Notes: Big Data and Hadoop

Unit I: Introduction to Big Data and Hadoop

1. Big Data Analytics:

- Structured: Tabular data with rows and columns (e.g., databases).

- Semi-structured: Data with some structure, like JSON or XML.

- Unstructured: Data with no predefined format (e.g., images, videos, emails).

- Yahoo played a major role in the development and adoption of Hadoop.

- Core components: HDFS (storage), MapReduce (processing).

4. IBM Big Data Strategy:

- Features include fault tolerance, high throughput, and scalability.

Unit III: MapReduce

1. Anatomy of MapReduce Job:

- Splits input data into smaller chunks.

- Mapper processes chunks in parallel and generates key-value pairs.

- Reducer combines and aggregates intermediate outputs from mappers.

2. Shuffle and Sort:

- Sorts mapper outputs by key and distributes them to reducers.

- Ensures tasks are executed efficiently.

- High-level scripting platform for data transformation and analysis.

- Uses Pig Latin, a language simpler than Java.

- Suitable for tasks like ETL (Extract, Transform, Load).

- A data warehouse infrastructure on top of Hadoop.

- HiveQL allows querying data using an SQL-like syntax.

- Integrates with HDFS for large-scale data analysis.

- A NoSQL database built on top of HDFS for real-time processing.

Unit V: Data Analytics with R and Machine Learning

- Models are trained using labeled data (input-output pairs).

- Examples: Regression (predicting values) and Classification (categorizing data).

- Works on unlabeled data to identify patterns and relationships.

- Examples: Clustering (grouping similar items) and Dimensionality Reduction.

- Used in recommender systems (e.g., Amazon, Netflix).

- Based on user behavior or item similarity to suggest relevant items.

You might also like