0% found this document useful (0 votes)

17 views9 pages

Big Data Unit-1

Uploaded by

t88699857

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views9 pages

Big Data Unit-1

Uploaded by

t88699857

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

DISTRIBUTED FILE SYSTEM (DFS) AND BIG DATA

 A Distributed File System (e.g., HDFS in Hadoop) is a storage system that enables
large-scale data storage and processing across a network of multiple computers.
 It divides large datasets into smaller chunks and distributes them across multiple nodes in
a cluster.

Features:

o Fault Tolerance: Data is replicated across nodes, ensuring reliability even if

some nodes fail.
o Scalability: Easily scales to handle petabytes or exabytes of data by adding more
nodes.
o Parallel Processing: Enables faster processing as data can be accessed and
processed simultaneously on multiple nodes.

IMPORTANCE OF BIG DATA

 Big Data refers to massive volumes of structured, unstructured, or semi-structured data

that traditional systems cannot efficiently process.
 It is crucial for:
o Enhanced Decision-Making: Provides insights through analytics.
o Predictive Analysis: Facilitates forecasting trends and behaviors.
o Innovation: Drives the development of new products and services.
o Efficiency: Optimizes operations and reduces costs.

FOUR VS OF BIG DATA

1. Volume: Refers to the vast amount of data generated every second (e.g., social media
posts, IoT sensors, transaction logs).
2. Velocity: The speed at which data is generated and processed in real time or near real-
time.
3. Variety: The diverse types of data, including structured (databases), semi-structured
(JSON, XML), and unstructured (text, images, videos).
4. Veracity: The trustworthiness and quality of the data, which can be impacted by
inaccuracies or inconsistencies.

DRIVERS FOR BIG DATA

1. Technological Advancements:
o IoT devices generating continuous streams of data.
o Cloud computing offering scalable storage and processing.
2. Business Needs:
o Need for data-driven insights to gain a competitive edge.
o Real-time customer engagement and personalized experiences.
3. Data Explosion:
o Increase in data sources such as social media, mobile applications, and sensors.
4. Regulatory Requirements:
o Compliance with data retention and analysis regulations.
5. Cost Reduction:
o Affordable cloud storage and distributed processing tools like Hadoop and Spark.

BIG DATA ANALYTICS

 The process of examining large datasets to uncover patterns, correlations, and actionable
insights.
 Types of Big Data Analytics:
o Descriptive Analytics: Summarizes past data to understand what happened.
o Predictive Analytics: Uses historical data to predict future outcomes.
o Prescriptive Analytics: Provides recommendations for decision-making.
o Diagnostic Analytics: Identifies causes behind trends or anomalies.
 Tools for Big Data Analytics:
o Apache Hadoop: Distributed storage and processing framework.
o Apache Spark: Fast in-memory analytics engine.
o Tableau, Power BI: Visualization and reporting tools.

BIG DATA APPLICATIONS

1. Healthcare:
o Predicting disease outbreaks and personalizing treatments.
o Analyzing patient records for preventive care.
2. Finance:
o Fraud detection using transaction patterns.
o Risk management and algorithmic trading.
3. Retail:
o Enhancing customer experience with personalized recommendations.
o Inventory and supply chain optimization.
4. Manufacturing:
o Predictive maintenance using IoT data.
o Improving production efficiency and quality control.
5. Transportation:
o Real-time traffic management and route optimization.
o Predictive analytics for vehicle maintenance.
6. Entertainment:
o Content recommendation engines (e.g., Netflix, Spotify).
o Social media trend analysis.
7. Energy:
o Smart grids for efficient energy distribution.
o Analyzing consumption patterns for energy-saving initiatives.
MAPREDUCE OVERVIEW
 MapReduce is a programming model for processing large-scale data in a distributed
environment, developed by Google.
 It has two main phases:
1. Map Phase: Processes input data and transforms it into key-value pairs.
2. Reduce Phase: Aggregates the key-value pairs based on keys and produces the
output.
 Features of MapReduce:
o Scalability and fault tolerance.
o Parallel processing of large datasets.
o Simplifies coding for distributed systems.

COMMON ALGORITHMS USING MAPREDUCE

1. Word Count:
o Map: Emit each word with a count of 1.
o Reduce: Aggregate counts for each word.
2. Inverted Index (used in search engines):
o Map: Emit words as keys and document IDs as values.
o Reduce: Group document IDs for each word.
3. Sorting:
o Map: Emit data as key-value pairs where the key is the sorting criterion.
o Reduce: Consolidate sorted keys.
4. Distributed Grep:
o Map: Emit lines matching a pattern.
o Reduce: Consolidate matching lines.
5. Join Operations (e.g., database joins):
o Map: Emit common join keys from datasets.
o Reduce: Combine data from different datasets based on the join key.
6. PageRank:
o Used in ranking web pages.
o Map: Compute contributions of each page to its linked pages.
o Reduce: Aggregate contributions to update ranks iteratively.
7. Matrix Multiplication:
o Map: Emit elements of matrices with their respective indices.
o Reduce: Compute products for grouped indices.

MATRIX-VECTOR MULTIPLICATION

Matrix-vector multiplication involves multiplying a matrix A (with dimensions mxn) by vector

v(with n Components) calculated as

b=a.v

Each component of b is the dot product of a row of A with the vector v.

Step 1: Create Tables

1. Matrix Table The matrix table should store each element of the matrix along with its
row and column indices.

CREATE TABLE matrix ( row_id INT, col_id INT, value DOUBLE)

ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

2. Vector Table The vector table should store the index and value of each vector element.

CREATE TABLE vector (col_id INT, value DOUBLE)

ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

Step 2: Load Data

Load the matrix and vector data into the respective tables.

1. Load matrix data:

LOAD DATA INPATH '/home/cloudera/matrix.csv' INTO TABLE matrix;

Example matrix data in CSV:

1,1,2.0
1,2,3.0
2,1,4.0
2,2,5.0

2. Load vector data:

LOAD DATA INPATH '/home/cloudera/vector.csv' INTO TABLE vector;

Example vector data in CSV:

1,0.5
2,1.5

Step 3: Perform Matrix-Vector Multiplication

To multiply the matrix and the vector, you need to:

1. Join the matrix table with the vector table on the col_id.
2. Compute the product for each element.
3. Aggregate the results by row_id.
Here’s the query:

SELECT
m.row_id,
SUM(m.value * v.value) AS result
FROM
matrix m
JOIN
vector v
ON
m.col_id = v.col_id
GROUP BY
m.row_id;

Step 4: View the Results

The result of this query will be the new vector where each row corresponds to the computed
value for a row in the matrix:

row_id result
1 5.5
2 10.5

MOVING DATA INTO HADOOP

Data Sources

 Structured Data: From relational databases, logs, or CSV files.

 Semi-Structured Data: JSON, XML, or Avro.
 Unstructured Data: Text files, images, or raw data.

Methods for Importing Data

1. HDFS (Hadoop Distributed File System):

o Primary storage system for Hadoop.
o Use the hdfs dfs -put command to upload data to HDFS

Command

hdfs dfs –put /home/cloudera/filename.txt filename.txt

MOVING DATA OUT OF HADOOP

Export Methods

1. HDFS to Local File system:

o Use the hdfs dfs -get command to retrieve data.
Command

hdfs dfs –get /user/cloudera/filename.txt /home/cloudera/filename.txt

INPUTS AND OUTPUTS OF MAPREDUCE

Input to MapReduce

 InputFormat:
o Defines how input data is split and read into the system.
o Common implementations:
 TextInputFormat: Processes plain text files, line by line.
 KeyValueTextInputFormat: Reads input as key-value pairs.
 SequenceFileInputFormat: For binary input files in sequence format.
 CustomInputFormat: Custom formats based on application needs.
 Input Splits:
o Data is divided into logical splits, each processed by a separate Mapper.
o Example: Splitting a large file into chunks of 128 MB.
 RecordReader:
o Converts input splits into key-value pairs for the Mapper.

Output from MapReduce

 OutputFormat:
o Defines how the output of the Reducer is written.
o Common implementations:
 TextOutputFormat: Outputs data as plain text files.
 SequenceFileOutputFormat: Writes data in a binary sequence file.
 MultipleOutputs: Writes output to multiple files.
 CustomOutputFormat: Tailored to specific needs.
 Reducer Output:
o Data written back to HDFS or sent to other systems (e.g., Hive tables).

DATA SERIALIZATION

Data serialization is the process of converting structured or semi-structured data into a format
that can be easily stored, transmitted, and processed. In Big Data systems, serialization plays a
critical role in enabling efficient communication between distributed systems, storing data
compactly, and facilitating data processing.

 Serialization converts data into a format suitable for storage or transmission.

 Hadoop uses Writable objects for serialization.
o Example: Text, IntWritable, LongWritable.
 Benefits:
o Compact and efficient for distributed environments.
Why is Serialization Important in Big Data?

 Efficient Storage: Reduces the size of data for storage in distributed systems.
 Fast Transmission: Ensures data can be sent over networks with minimal overhead.
 Interoperability: Enables communication between systems with different architectures.
 Ease of Processing: Serialized formats are structured and easier to parse.
o

PROBLEMS WITH TRADITIONAL LARGE-SCALE SYSTEMS

 Scalability Issues:
o Cannot handle exponential data growth.
 High Cost:
o Expensive hardware and infrastructure.
 Limited Fault Tolerance:
o Failure of a node affects the entire system.
 Complex Data Integration:
o Difficulty in processing structured, unstructured, and semi-structured data.

REQUIREMENTS FOR A NEW APPROACH

1. Distributed architecture to process large datasets efficiently.

2. Fault tolerance for hardware failures.
3. Scalability to handle increasing data volumes.
4. Support for diverse data formats and sources.

HADOOP: A NEW APPROACH

 Scaling:
o Horizontal scaling by adding commodity hardware.
 Distributed Framework:
o Consists of HDFS for storage and MapReduce for processing.
o Fault tolerance through data replication.

HADOOP VS. RDBMS

Aspect Hadoop RDBMS

Data Type Unstructured, Semi-structured Structured (tables)
Processing Batch processing Real-time transaction-based
Scalability Horizontally scalable Limited by hardware
Fault Tolerance High (replication) Low
Cost Low (commodity hardware) High
Flexibility Supports diverse data sources Rigid schema
BRIEF HISTORY OF HADOOP

1. Origins of Hadoop

 2003:
o Google publishes the GFS paper:
 Google File System (GFS) outlines a distributed file system designed to
handle large-scale data processing across commodity hardware.
 Inspired the core design of Hadoop's storage system.
 2004:
o Google publishes the MapReduce paper:
 Describes a programming model and processing framework for large-scale
data.
 It becomes the foundation of Hadoop's processing engine.

2. Early Development of Hadoop

 2005:
o Apache Nutch Integration:
 Hadoop begins as a sub-project of Apache Nutch, an open-source web
crawler.
 Developers Doug Cutting and Mike Cafarella incorporate distributed
processing concepts from Google's MapReduce and GFS papers.
 The project is named Hadoop, after Doug Cutting's son's toy elephant.
 2006:
o Hadoop Becomes an Independent Project:
 Hadoop is split from Nutch and becomes its own project under the Apache
Software Foundation (ASF).
 Focuses on building a scalable framework for storing and processing large
datasets.

3. Hadoop's First Milestone

 2008:
o Yahoo adopts Hadoop:
 Yahoo uses Hadoop to power its web search engine.
 Yahoo's cluster achieves the first major milestone:
 Sorts 1 TB of data in 209 seconds using Hadoop.
o Hadoop becomes a top-level Apache project.

4. Growth of the Hadoop Ecosystem

 2009–2011:
o Hadoop's popularity grows, and its ecosystem expands:
 HDFS (Hadoop Distributed File System): Becomes a robust, fault-
tolerant storage system.
 MapReduce: Becomes the de facto standard for batch processing.
 New projects join the ecosystem:
 Hive: SQL-like querying for Hadoop.
 Pig: Scripting language for data transformation.
 HBase: NoSQL database built on Hadoop.
 Zookeeper: Coordination service for distributed systems.

5. Commercial Adoption and Enhancements

 2011–2013:
o Major tech companies like Facebook, Twitter, and LinkedIn adopt Hadoop.
o Commercial distributions emerge, offering enterprise-ready Hadoop solutions:
 Cloudera (2008)
 Hortonworks (2011)
 MapR (2011)
o The Hadoop ecosystem diversifies with tools for data ingestion (Flume, Sqoop),
real-time processing (Storm), and visualization.

6. Introduction of YARN

 2013:
o YARN (Yet Another Resource Negotiator) is introduced in Hadoop 2.0.
 Decouples resource management from MapReduce.
 Allows Hadoop to support other processing models (e.g., Apache Spark,
Flink).
 Marks a significant shift, making Hadoop more versatile.

7. Recent Developments

 2014–Present:
o Hadoop transitions from being just a batch processing framework to a platform
for diverse workloads:
 Integration with Apache Spark for faster, in-memory processing.
 Support for cloud-based deployment on AWS, Azure, and GCP.
 Increasing use of columnar storage formats like Parquet and ORC for
analytics.

Bda Sem 7 Book
No ratings yet
Bda Sem 7 Book
188 pages
Understanding Data Science Concepts
No ratings yet
Understanding Data Science Concepts
29 pages
Big Data and Hadoop
No ratings yet
Big Data and Hadoop
8 pages
Bda Summer 2024 Solution
No ratings yet
Bda Summer 2024 Solution
26 pages
Experiment No - 1 Bda
No ratings yet
Experiment No - 1 Bda
10 pages
Module2 D MapReduceParadigm
No ratings yet
Module2 D MapReduceParadigm
90 pages
Bda U2
No ratings yet
Bda U2
68 pages
BDT Viva Questions
No ratings yet
BDT Viva Questions
2 pages
Big Data Training in Chennai - Big Data Course in Chennai
No ratings yet
Big Data Training in Chennai - Big Data Course in Chennai
1 page
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
Big Data Hadoop Complete Final Spaced
No ratings yet
Big Data Hadoop Complete Final Spaced
15 pages
Unit 1
No ratings yet
Unit 1
19 pages
IE494 - Big - Data - Processing - Course - File - Autumn24 - PMJ - PM Jat
No ratings yet
IE494 - Big - Data - Processing - Course - File - Autumn24 - PMJ - PM Jat
5 pages
07 BigData DataAnalysis
No ratings yet
07 BigData DataAnalysis
66 pages
Agenda: Big Data Systems
No ratings yet
Agenda: Big Data Systems
25 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Big Data Analytics
No ratings yet
Big Data Analytics
61 pages
Data Science Essentials & Big Data Concepts
No ratings yet
Data Science Essentials & Big Data Concepts
20 pages
Hadoop and Spark for Big Data Analysis
No ratings yet
Hadoop and Spark for Big Data Analysis
48 pages
Big Data Analytics and MapReduce Overview
No ratings yet
Big Data Analytics and MapReduce Overview
35 pages
I Am Preparing For A Big Data Analytics University...
No ratings yet
I Am Preparing For A Big Data Analytics University...
15 pages
Module2 D MapReduceParadigm
No ratings yet
Module2 D MapReduceParadigm
84 pages
Big Data
No ratings yet
Big Data
8 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Big Data Complete Notes
100% (3)
Big Data Complete Notes
33 pages
Big Data Analytics for B.Tech Students
No ratings yet
Big Data Analytics for B.Tech Students
119 pages
Hadoop and Spark Overview
No ratings yet
Hadoop and Spark Overview
34 pages
Big Data Analytics for B.Tech Students
No ratings yet
Big Data Analytics for B.Tech Students
134 pages
Bigdata Overview PDF
No ratings yet
Bigdata Overview PDF
98 pages
BIA BigData Overview
No ratings yet
BIA BigData Overview
38 pages
Understanding Big Data and Hadoop Basics
No ratings yet
Understanding Big Data and Hadoop Basics
17 pages
Cloud Security UNIT 5
No ratings yet
Cloud Security UNIT 5
4 pages
BigdatMid1 Shcema
No ratings yet
BigdatMid1 Shcema
7 pages
Lecture 3 MR Model and Systems
No ratings yet
Lecture 3 MR Model and Systems
67 pages
BDA Class3
No ratings yet
BDA Class3
15 pages
Hadoop for Scalable Data Management
No ratings yet
Hadoop for Scalable Data Management
58 pages
Cloud Computing Unit 3
No ratings yet
Cloud Computing Unit 3
10 pages
Ashish Presentation Stage1 Modify LR
No ratings yet
Ashish Presentation Stage1 Modify LR
24 pages
Big Data Analytics Unit-1
No ratings yet
Big Data Analytics Unit-1
39 pages
Big Data Analytics 0th Lecture
No ratings yet
Big Data Analytics 0th Lecture
19 pages
Big Data & Hadoop Overview
No ratings yet
Big Data & Hadoop Overview
44 pages
Data Science & Big Data Analytics Course
No ratings yet
Data Science & Big Data Analytics Course
47 pages
Fillatre Big Data
No ratings yet
Fillatre Big Data
98 pages
BDA Unit - II
No ratings yet
BDA Unit - II
66 pages
Module - 1
No ratings yet
Module - 1
84 pages
BDH Answer Bank
No ratings yet
BDH Answer Bank
21 pages
BDA Syllabus - Sem VII - Mumbai University
No ratings yet
BDA Syllabus - Sem VII - Mumbai University
3 pages
Big Data Analysis PDF 2
No ratings yet
Big Data Analysis PDF 2
18 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
MCAD2232 (PRESS) BIG DATA and Its Applications
No ratings yet
MCAD2232 (PRESS) BIG DATA and Its Applications
140 pages
2 Data Science
No ratings yet
2 Data Science
27 pages
Cloud Security UNIT 5
No ratings yet
Cloud Security UNIT 5
6 pages
Big Data Black Book PDF
15% (20)
Big Data Black Book PDF
2 pages
Syllabus
No ratings yet
Syllabus
7 pages
Unit-3 CC
No ratings yet
Unit-3 CC
10 pages
BDA 2 Marks
No ratings yet
BDA 2 Marks
13 pages
Game Devs: Master Procedural Design
No ratings yet
Game Devs: Master Procedural Design
2 pages
UCSD UPS Program Orientation Guide
No ratings yet
UCSD UPS Program Orientation Guide
4 pages
Best Water Purifiers in India (2020) - Buyer's Guide & Reviews!
No ratings yet
Best Water Purifiers in India (2020) - Buyer's Guide & Reviews!
54 pages
Assignment A242
No ratings yet
Assignment A242
3 pages
DPC Project File
No ratings yet
DPC Project File
16 pages
Activity Definition and Sequencing Worksheet 1.2
No ratings yet
Activity Definition and Sequencing Worksheet 1.2
3 pages
Crodamol Eo LQ MV
No ratings yet
Crodamol Eo LQ MV
8 pages
Pharmacology A Practical Manual For Medical Students First Edition 9789354662737 Compress
No ratings yet
Pharmacology A Practical Manual For Medical Students First Edition 9789354662737 Compress
194 pages
Jaipuria Institute of Management
No ratings yet
Jaipuria Institute of Management
11 pages
Finance Cluster Exam - Test 1078
No ratings yet
Finance Cluster Exam - Test 1078
31 pages
Lateral Pile Response in Weak Sandstone
No ratings yet
Lateral Pile Response in Weak Sandstone
11 pages
Personality PDF
No ratings yet
Personality PDF
10 pages
Nas121 PDF
No ratings yet
Nas121 PDF
1 page
15 - 516x Week0 1 Program Overview en
No ratings yet
15 - 516x Week0 1 Program Overview en
2 pages
The Trillion Dollar Crop - by Richard M. Davis
No ratings yet
The Trillion Dollar Crop - by Richard M. Davis
250 pages
f6 VNM Examreport j15
No ratings yet
f6 VNM Examreport j15
5 pages
Tshwane Update 2023 3rd Edition
No ratings yet
Tshwane Update 2023 3rd Edition
6 pages
2016suspension 2009RegisteredCorporations
No ratings yet
2016suspension 2009RegisteredCorporations
127 pages
International Standard Banking Practice: Documents and The Need For Completion of A Box, Field or Space
No ratings yet
International Standard Banking Practice: Documents and The Need For Completion of A Box, Field or Space
1 page
Air Ambulance
No ratings yet
Air Ambulance
1 page
MPRA Paper 33643
No ratings yet
MPRA Paper 33643
35 pages
11) Building Code of Pakistan
No ratings yet
11) Building Code of Pakistan
267 pages
Job Application Letter-Xii
No ratings yet
Job Application Letter-Xii
4 pages
Xu 1996
No ratings yet
Xu 1996
17 pages
Plant Watering System
No ratings yet
Plant Watering System
5 pages
Navy Federal Personal Loan October 2023
No ratings yet
Navy Federal Personal Loan October 2023
2 pages
Bundle of a First Course in Differential Equations With Modeling Applications 12e Metric Edition Dennis G Zill
No ratings yet
Bundle of a First Course in Differential Equations With Modeling Applications 12e Metric Edition Dennis G Zill
344 pages
Quanti
No ratings yet
Quanti
6 pages
YUVA's Resume
No ratings yet
YUVA's Resume
1 page
Video Games, Competition and Exercise
No ratings yet
Video Games, Competition and Exercise
1 page

Big Data Unit-1

Uploaded by

Big Data Unit-1

Uploaded by

DISTRIBUTED FILE SYSTEM (DFS) AND BIG DATA

o Fault Tolerance: Data is replicated across nodes, ensuring reliability even if

IMPORTANCE OF BIG DATA

 Big Data refers to massive volumes of structured, unstructured, or semi-structured data

FOUR VS OF BIG DATA

DRIVERS FOR BIG DATA

BIG DATA ANALYTICS

BIG DATA APPLICATIONS

COMMON ALGORITHMS USING MAPREDUCE

Matrix-vector multiplication involves multiplying a matrix A (with dimensions mxn) by vector

Each component of b is the dot product of a row of A with the vector v.

CREATE TABLE matrix ( row_id INT, col_id INT, value DOUBLE)

CREATE TABLE vector (col_id INT, value DOUBLE)

Step 2: Load Data

1. Load matrix data:

LOAD DATA INPATH '/home/cloudera/matrix.csv' INTO TABLE matrix;

Example matrix data in CSV:

2. Load vector data:

LOAD DATA INPATH '/home/cloudera/vector.csv' INTO TABLE vector;

Example vector data in CSV:

Step 3: Perform Matrix-Vector Multiplication

To multiply the matrix and the vector, you need to:

Step 4: View the Results

MOVING DATA INTO HADOOP

 Structured Data: From relational databases, logs, or CSV files.

Methods for Importing Data

1. HDFS (Hadoop Distributed File System):

hdfs dfs –put /home/cloudera/filename.txt filename.txt

MOVING DATA OUT OF HADOOP

1. HDFS to Local File system:

hdfs dfs –get /user/cloudera/filename.txt /home/cloudera/filename.txt

INPUTS AND OUTPUTS OF MAPREDUCE

Output from MapReduce

 Serialization converts data into a format suitable for storage or transmission.

PROBLEMS WITH TRADITIONAL LARGE-SCALE SYSTEMS

REQUIREMENTS FOR A NEW APPROACH

1. Distributed architecture to process large datasets efficiently.

HADOOP: A NEW APPROACH

HADOOP VS. RDBMS

Aspect Hadoop RDBMS

2. Early Development of Hadoop

3. Hadoop's First Milestone

4. Growth of the Hadoop Ecosystem

5. Commercial Adoption and Enhancements

You might also like