0% found this document useful (0 votes)
17 views9 pages

Big Data Unit-1

Uploaded by

t88699857
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views9 pages

Big Data Unit-1

Uploaded by

t88699857
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

DISTRIBUTED FILE SYSTEM (DFS) AND BIG DATA

 A Distributed File System (e.g., HDFS in Hadoop) is a storage system that enables
large-scale data storage and processing across a network of multiple computers.
 It divides large datasets into smaller chunks and distributes them across multiple nodes in
a cluster.

Features:

o Fault Tolerance: Data is replicated across nodes, ensuring reliability even if


some nodes fail.
o Scalability: Easily scales to handle petabytes or exabytes of data by adding more
nodes.
o Parallel Processing: Enables faster processing as data can be accessed and
processed simultaneously on multiple nodes.

IMPORTANCE OF BIG DATA

 Big Data refers to massive volumes of structured, unstructured, or semi-structured data


that traditional systems cannot efficiently process.
 It is crucial for:
o Enhanced Decision-Making: Provides insights through analytics.
o Predictive Analysis: Facilitates forecasting trends and behaviors.
o Innovation: Drives the development of new products and services.
o Efficiency: Optimizes operations and reduces costs.

FOUR VS OF BIG DATA

1. Volume: Refers to the vast amount of data generated every second (e.g., social media
posts, IoT sensors, transaction logs).
2. Velocity: The speed at which data is generated and processed in real time or near real-
time.
3. Variety: The diverse types of data, including structured (databases), semi-structured
(JSON, XML), and unstructured (text, images, videos).
4. Veracity: The trustworthiness and quality of the data, which can be impacted by
inaccuracies or inconsistencies.

DRIVERS FOR BIG DATA

1. Technological Advancements:
o IoT devices generating continuous streams of data.
o Cloud computing offering scalable storage and processing.
2. Business Needs:
o Need for data-driven insights to gain a competitive edge.
o Real-time customer engagement and personalized experiences.
3. Data Explosion:
o Increase in data sources such as social media, mobile applications, and sensors.
4. Regulatory Requirements:
o Compliance with data retention and analysis regulations.
5. Cost Reduction:
o Affordable cloud storage and distributed processing tools like Hadoop and Spark.

BIG DATA ANALYTICS

 The process of examining large datasets to uncover patterns, correlations, and actionable
insights.
 Types of Big Data Analytics:
o Descriptive Analytics: Summarizes past data to understand what happened.
o Predictive Analytics: Uses historical data to predict future outcomes.
o Prescriptive Analytics: Provides recommendations for decision-making.
o Diagnostic Analytics: Identifies causes behind trends or anomalies.
 Tools for Big Data Analytics:
o Apache Hadoop: Distributed storage and processing framework.
o Apache Spark: Fast in-memory analytics engine.
o Tableau, Power BI: Visualization and reporting tools.

BIG DATA APPLICATIONS

1. Healthcare:
o Predicting disease outbreaks and personalizing treatments.
o Analyzing patient records for preventive care.
2. Finance:
o Fraud detection using transaction patterns.
o Risk management and algorithmic trading.
3. Retail:
o Enhancing customer experience with personalized recommendations.
o Inventory and supply chain optimization.
4. Manufacturing:
o Predictive maintenance using IoT data.
o Improving production efficiency and quality control.
5. Transportation:
o Real-time traffic management and route optimization.
o Predictive analytics for vehicle maintenance.
6. Entertainment:
o Content recommendation engines (e.g., Netflix, Spotify).
o Social media trend analysis.
7. Energy:
o Smart grids for efficient energy distribution.
o Analyzing consumption patterns for energy-saving initiatives.
MAPREDUCE OVERVIEW
 MapReduce is a programming model for processing large-scale data in a distributed
environment, developed by Google.
 It has two main phases:
1. Map Phase: Processes input data and transforms it into key-value pairs.
2. Reduce Phase: Aggregates the key-value pairs based on keys and produces the
output.
 Features of MapReduce:
o Scalability and fault tolerance.
o Parallel processing of large datasets.
o Simplifies coding for distributed systems.

COMMON ALGORITHMS USING MAPREDUCE

1. Word Count:
o Map: Emit each word with a count of 1.
o Reduce: Aggregate counts for each word.
2. Inverted Index (used in search engines):
o Map: Emit words as keys and document IDs as values.
o Reduce: Group document IDs for each word.
3. Sorting:
o Map: Emit data as key-value pairs where the key is the sorting criterion.
o Reduce: Consolidate sorted keys.
4. Distributed Grep:
o Map: Emit lines matching a pattern.
o Reduce: Consolidate matching lines.
5. Join Operations (e.g., database joins):
o Map: Emit common join keys from datasets.
o Reduce: Combine data from different datasets based on the join key.
6. PageRank:
o Used in ranking web pages.
o Map: Compute contributions of each page to its linked pages.
o Reduce: Aggregate contributions to update ranks iteratively.
7. Matrix Multiplication:
o Map: Emit elements of matrices with their respective indices.
o Reduce: Compute products for grouped indices.

MATRIX-VECTOR MULTIPLICATION

Matrix-vector multiplication involves multiplying a matrix A (with dimensions mxn) by vector


v(with n Components) calculated as

b=a.v

Each component of b is the dot product of a row of A with the vector v.


Step 1: Create Tables

1. Matrix Table The matrix table should store each element of the matrix along with its
row and column indices.

CREATE TABLE matrix ( row_id INT, col_id INT, value DOUBLE)


ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

2. Vector Table The vector table should store the index and value of each vector element.

CREATE TABLE vector (col_id INT, value DOUBLE)


ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

Step 2: Load Data

Load the matrix and vector data into the respective tables.

1. Load matrix data:

LOAD DATA INPATH '/home/cloudera/matrix.csv' INTO TABLE matrix;

Example matrix data in CSV:

1,1,2.0
1,2,3.0
2,1,4.0
2,2,5.0

2. Load vector data:

LOAD DATA INPATH '/home/cloudera/vector.csv' INTO TABLE vector;

Example vector data in CSV:

1,0.5
2,1.5

Step 3: Perform Matrix-Vector Multiplication

To multiply the matrix and the vector, you need to:

1. Join the matrix table with the vector table on the col_id.
2. Compute the product for each element.
3. Aggregate the results by row_id.
Here’s the query:

SELECT
m.row_id,
SUM(m.value * v.value) AS result
FROM
matrix m
JOIN
vector v
ON
m.col_id = v.col_id
GROUP BY
m.row_id;

Step 4: View the Results

The result of this query will be the new vector where each row corresponds to the computed
value for a row in the matrix:

row_id result
1 5.5
2 10.5

MOVING DATA INTO HADOOP


Data Sources

 Structured Data: From relational databases, logs, or CSV files.


 Semi-Structured Data: JSON, XML, or Avro.
 Unstructured Data: Text files, images, or raw data.

Methods for Importing Data

1. HDFS (Hadoop Distributed File System):


o Primary storage system for Hadoop.
o Use the hdfs dfs -put command to upload data to HDFS

Command

hdfs dfs –put /home/cloudera/filename.txt filename.txt

MOVING DATA OUT OF HADOOP

Export Methods

1. HDFS to Local File system:


o Use the hdfs dfs -get command to retrieve data.
Command

hdfs dfs –get /user/cloudera/filename.txt /home/cloudera/filename.txt

INPUTS AND OUTPUTS OF MAPREDUCE

Input to MapReduce

 InputFormat:
o Defines how input data is split and read into the system.
o Common implementations:
 TextInputFormat: Processes plain text files, line by line.
 KeyValueTextInputFormat: Reads input as key-value pairs.
 SequenceFileInputFormat: For binary input files in sequence format.
 CustomInputFormat: Custom formats based on application needs.
 Input Splits:
o Data is divided into logical splits, each processed by a separate Mapper.
o Example: Splitting a large file into chunks of 128 MB.
 RecordReader:
o Converts input splits into key-value pairs for the Mapper.

Output from MapReduce

 OutputFormat:
o Defines how the output of the Reducer is written.
o Common implementations:
 TextOutputFormat: Outputs data as plain text files.
 SequenceFileOutputFormat: Writes data in a binary sequence file.
 MultipleOutputs: Writes output to multiple files.
 CustomOutputFormat: Tailored to specific needs.
 Reducer Output:
o Data written back to HDFS or sent to other systems (e.g., Hive tables).

DATA SERIALIZATION

Data serialization is the process of converting structured or semi-structured data into a format
that can be easily stored, transmitted, and processed. In Big Data systems, serialization plays a
critical role in enabling efficient communication between distributed systems, storing data
compactly, and facilitating data processing.

 Serialization converts data into a format suitable for storage or transmission.


 Hadoop uses Writable objects for serialization.
o Example: Text, IntWritable, LongWritable.
 Benefits:
o Compact and efficient for distributed environments.
Why is Serialization Important in Big Data?

 Efficient Storage: Reduces the size of data for storage in distributed systems.
 Fast Transmission: Ensures data can be sent over networks with minimal overhead.
 Interoperability: Enables communication between systems with different architectures.
 Ease of Processing: Serialized formats are structured and easier to parse.
o

PROBLEMS WITH TRADITIONAL LARGE-SCALE SYSTEMS

 Scalability Issues:
o Cannot handle exponential data growth.
 High Cost:
o Expensive hardware and infrastructure.
 Limited Fault Tolerance:
o Failure of a node affects the entire system.
 Complex Data Integration:
o Difficulty in processing structured, unstructured, and semi-structured data.

REQUIREMENTS FOR A NEW APPROACH

1. Distributed architecture to process large datasets efficiently.


2. Fault tolerance for hardware failures.
3. Scalability to handle increasing data volumes.
4. Support for diverse data formats and sources.

HADOOP: A NEW APPROACH

 Scaling:
o Horizontal scaling by adding commodity hardware.
 Distributed Framework:
o Consists of HDFS for storage and MapReduce for processing.
o Fault tolerance through data replication.

HADOOP VS. RDBMS

Aspect Hadoop RDBMS


Data Type Unstructured, Semi-structured Structured (tables)
Processing Batch processing Real-time transaction-based
Scalability Horizontally scalable Limited by hardware
Fault Tolerance High (replication) Low
Cost Low (commodity hardware) High
Flexibility Supports diverse data sources Rigid schema
BRIEF HISTORY OF HADOOP

1. Origins of Hadoop

 2003:
o Google publishes the GFS paper:
 Google File System (GFS) outlines a distributed file system designed to
handle large-scale data processing across commodity hardware.
 Inspired the core design of Hadoop's storage system.
 2004:
o Google publishes the MapReduce paper:
 Describes a programming model and processing framework for large-scale
data.
 It becomes the foundation of Hadoop's processing engine.

2. Early Development of Hadoop

 2005:
o Apache Nutch Integration:
 Hadoop begins as a sub-project of Apache Nutch, an open-source web
crawler.
 Developers Doug Cutting and Mike Cafarella incorporate distributed
processing concepts from Google's MapReduce and GFS papers.
 The project is named Hadoop, after Doug Cutting's son's toy elephant.
 2006:
o Hadoop Becomes an Independent Project:
 Hadoop is split from Nutch and becomes its own project under the Apache
Software Foundation (ASF).
 Focuses on building a scalable framework for storing and processing large
datasets.

3. Hadoop's First Milestone

 2008:
o Yahoo adopts Hadoop:
 Yahoo uses Hadoop to power its web search engine.
 Yahoo's cluster achieves the first major milestone:
 Sorts 1 TB of data in 209 seconds using Hadoop.
o Hadoop becomes a top-level Apache project.

4. Growth of the Hadoop Ecosystem

 2009–2011:
o Hadoop's popularity grows, and its ecosystem expands:
 HDFS (Hadoop Distributed File System): Becomes a robust, fault-
tolerant storage system.
 MapReduce: Becomes the de facto standard for batch processing.
 New projects join the ecosystem:
 Hive: SQL-like querying for Hadoop.
 Pig: Scripting language for data transformation.
 HBase: NoSQL database built on Hadoop.
 Zookeeper: Coordination service for distributed systems.

5. Commercial Adoption and Enhancements

 2011–2013:
o Major tech companies like Facebook, Twitter, and LinkedIn adopt Hadoop.
o Commercial distributions emerge, offering enterprise-ready Hadoop solutions:
 Cloudera (2008)
 Hortonworks (2011)
 MapR (2011)
o The Hadoop ecosystem diversifies with tools for data ingestion (Flume, Sqoop),
real-time processing (Storm), and visualization.

6. Introduction of YARN

 2013:
o YARN (Yet Another Resource Negotiator) is introduced in Hadoop 2.0.
 Decouples resource management from MapReduce.
 Allows Hadoop to support other processing models (e.g., Apache Spark,
Flink).
 Marks a significant shift, making Hadoop more versatile.

7. Recent Developments

 2014–Present:
o Hadoop transitions from being just a batch processing framework to a platform
for diverse workloads:
 Integration with Apache Spark for faster, in-memory processing.
 Support for cloud-based deployment on AWS, Azure, and GCP.
 Increasing use of columnar storage formats like Parquet and ORC for
analytics.

You might also like