0% found this document useful (0 votes)

20 views10 pages

Custom Notes

Uploaded by

Arjun Mundhada

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views10 pages

Custom Notes

Uploaded by

Arjun Mundhada

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

What is MapReduce?

MapReduce is a programming model and an associated implementation for

processing and generating large data sets with a parallel, distributed
algorithm on a cluster. It was introduced by Google in a 2004 paper by Jeff
Dean and Sanjay Ghemawat.

The model is inspired by the map and reduce functions commonly used in
functional programming. The basic idea is to split the input data into smaller
chunks and process them in parallel, then combine the results to produce
the final output. The two main functions in the MapReduce model are the
"map" function and the "reduce" function.

The "map" function takes an input and produces a set of intermediate key-
value pairs. The "reduce" function takes all the values associated with the
same key, and combines them in some way, typically by summing or
averaging them, to produce a single output value. The output of the reduce
function is typically a smaller set of key-value pairs, which can be further
processed by additional reduce functions.

MapReduce is a powerful model for distributed data processing and it can be

implemented using various open-source frameworks like Apache Hadoop,
Apache Spark, and others. It's widely used for handling big data sets and can
be applied to various use cases like data mining, log processing, machine
learning, and more.

YARN?

YARN (Yet Another Resource Negotiator) is a resource management framework in

Apache Hadoop, used for managing resources and scheduling applications in a
large-scale, distributed computing environment. It enables Hadoop to run a variety
of processing frameworks (such as MapReduce, Apache Spark, Apache Tez, etc.) in
a single platform, allowing multiple users to share a cluster and use it for different
purposes.

YARN stands for “Yet Another Resource Negotiator“. It was introduced in

Hadoop 2.0 to remove the bottleneck on Job Tracker which was present in
Hadoop 1.0. YARN was described as a “Redesigned Resource Manager” at the
time of its launching, but it has now evolved to be known as large-scale
distributed operating system used for Big Data processing.

YARN architecture basically separates resource management layer from the

processing layer. In Hadoop 1.0 version, the responsibility of Job tracker is split
between the resource manager and application manager.
YARN also allows different data processing engines like graph processing,
interactive processing, stream processing as well as batch processing to run
and process data stored in HDFS (Hadoop Distributed File System) thus
making the system much more efficient. Through its various components, it can
dynamically allocate various resources and schedule the application
processing. For large volume data processing, it is quite necessary to manage
the available resources properly so that every application can leverage them.
YARN Features: YARN gained popularity because of the following features-

 Scalability: The scheduler in Resource manager of YARN architecture

allows Hadoop to extend and manage thousands of nodes and clusters.
 Compatibility: YARN supports the existing map-reduce applications without
disruptions thus making it compatible with Hadoop 1.0 as well.
 Cluster Utilization:Since YARN supports Dynamic utilization of cluster in
Hadoop, which enables optimized Cluster Utilization.
 Multi-tenancy: It allows multiple engine access thus giving organizations a
benefit of multi-tenancy.
 Client: It submits map-reduce jobs.
 Resource Manager: It is the master daemon of YARN and is responsible
for resource assignment and management among all the applications.
Whenever it receives a processing request, it forwards it to the
corresponding node manager and allocates resources for the completion
of the request accordingly. It has two major components:
 Scheduler: It performs scheduling based on the allocated
application and available resources. It is a pure scheduler, means it
does not perform other tasks such as monitoring or tracking and
does not guarantee a restart if a task fails. The YARN scheduler
supports plugins such as Capacity Scheduler and Fair Scheduler to
partition the cluster resources.
 Application manager: It is responsible for accepting the
application and negotiating the first container from the resource
manager. It also restarts the Application Master container if a task
fails.
 Node Manager: It take care of individual node on Hadoop cluster and
manages application and workflow and that particular node. Its primary
job is to keep-up with the Resource Manager. It registers with the
Resource Manager and sends heartbeats with the health status of the
node. It monitors resource usage, performs log management and also
kills a container based on directions from the resource manager. It is also
responsible for creating the container process and start it on the request
of Application master.
 Application Master: An application is a single job submitted to a
framework. The application master is responsible for negotiating
resources with the resource manager, tracking the status and monitoring
progress of a single application. The application master requests the
container from the node manager by sending a Container Launch
Context(CLC) which includes everything an application needs to run.
Once the application is started, it sends the health report to the resource
manager from time-to-time.
 Container: It is a collection of physical resources such as RAM, CPU
cores and disk on a single node. The containers are invoked by Container
Launch Context(CLC) which is a record that contains information such as
environment variables, security tokens, dependencies etc.
Application workflow in Hadoop YARN:
1. Client submits an application
2. The Resource Manager allocates a container to start the Application
Manager
3. The Application Manager registers itself with the Resource Manager
4. The Application Manager negotiates containers from the Resource Manager
5. The Application Manager notifies the Node Manager to launch containers
6. Application code is executed in the container
7. Client contacts Resource Manager/Application Manager to monitor
application’s status
8. Once the processing is complete, the Application Manager un-registers with
the Resource Manager
Big data file formats

Some things to consider when choosing the format are:

 The structure of your data: Some formats accept nested

data such as JSON, Avro or Parquet and others do not.
Even, the ones that do, may not be highly optimized for it.
Avro is the most efficient format for nested data, I
recommend not to use Parquet nested types because they
are very inefficient. Process nested JSON is also very CPU
intensive. In general, it is recommended to flat the data
when ingesting it.

 Performance: Some formats such as Avro and Parquet

perform better than other such JSON. Even between Avro
and Parquet for different use cases one will be better than
others. For example, since Parquet is a column based
format it is great to query your data lake using SQL
whereas Avro is better for ETL row level transformation.

 Easy to read: Consider if you need people to read the

data or not. JSON or CSV are text formats and are human
readable whereas more performant formats such parquet
or Avro are binary.

 Compression: Some formats offer higher compression

rates than others.

 Schema evolution: Adding or removing fields is far more

complicated in a data lake than in a database. Some
formats like Avro or Parquet provide some degree of
schema evolution which allows you to change the data
schema and still query the data. Tools such Delta
Lake format provide even better tools to deal with
changes in Schemas.

 Compatibility: JSON or CSV are widely adopted and

compatible with almost any tool while more performant
options have less integration points.

File formats

 Avro: Great for storing row data, very efficient. It has a

schema and supports evolution. Great integration with
Kafka. Supports file splitting. Use it for row level
operations or in Kafka. Great to write data, slower to
read.
 Parquet: Columnar storage. It has schema support. It
works very well with Hive and Spark as a way to store
columnar data in deep storage that is queried using SQL.
Because it stores data in columns, query engines will only
read files that have the selected columns and not the
entire data set as opposed to Avro. Use it as a reporting
layer.

 ORC: Similar to Parquet, it offers better compression. It

also provides better schema evolution support as well, but
it is less popular.

Conclusion

As we can see, CSV and JSON are easy to use, human

readable and common formats but lack many of the
capabilities of other formats, making it too slow to be used to
query the data lake. ORC and Parquet are widely used in
the Hadoop ecosystem to query data whereas Avro is also
used outside of Hadoop, especially together with Kafka for
ingestion, it is very good for row level ETL processing. Row
oriented formats have better schema evolution capabilities
than column oriented formats making them a great option for
data ingestion.
Detailed

PARQUET file format

Parquet is an open-source file format for Hadoop.

Parquet helps to achieve efficient storage and performance. This is a column-oriented

format, where the values of each column of the same type in the records are stored
together.

To understand parquet file format, let’s take an example as below:

Example: There is a table which consists of CustId, First_Name, Last_Name and City,
then all the values for the CustId column will be stored together, values for First_Name
column will be together, and another column will also be stored in a similar way. If we
take the same record schema as mentioned above having below four fields, the table
will look like:

CustId First_Name Last_Name City

956321
FN1 LN1 Delhi
8
955812
FN2 LN2 Kolkata
0

For this table, the data in a row-wise storage format will be stored as follows:

FN Delh FN
9563218 LN1 9558120 LN2 Kolkata
1 i 2

Whereas the same data in a column-oriented storage format will look like this:

956321
9558120 FN1 FN2 LN1 LN2 Delhi Kolkata
8

 The columnar storage format is relatively more efficient, and the requirement is to
fetch column-based data by querying a few columns from a table.

The column-oriented file format increases the query performance as it takes less time to
fetch the required column value. Also, less IO is required as the required columns are
adjacent to each other, thus minimizing IO.

AVRO file format

Avro is a row-based storage format, which is widely used for serialization.

Avro depends on the schema, which is stored in JSON format, this makes it easier to
read and understand by any program. The data is stored in a binary format, thus making
it compact and efficient.
One of the prime features of Avro is that it supports dynamic data schemas that change
over time. Since this format supports schema evolution, it can easily handle schema
changes like missing fields, added fields, and changed fields.

 Avro format is preferred for loading data lake landing, because downstream
systems can easily retrieve table schemas from files, and any source schema
changes can be easily handled.
 Due to its efficient serialization and deserialization property, it offers good
performance.

ORC file format

The Optimized Row Columnar (ORC) file format provides a highly efficient way to store
data. This format was designed to overcome the limitations of other file formats. It
improves the overall performance when Hive (A SQL kind of interface, built on top of
Hadoop) reads, writes, and processes the data.

ORC stores collections of rows in one file and within the collection, the row data is
stored in a columnar format.

Link of file formats: [Link]

abc961dd73bf

Why we use aws redshift over aws s3?

AWS Redshift and S3 are both storage solutions offered by AWS, but they
have different use cases and strengths. Some reasons to use Redshift over
S3 include:

1. Performance: Redshift is optimized for fast query performance, making it

well-suited for data warehousing and business intelligence workloads.
2. Structure: Redshift uses a columnar database model, which allows it to store
data in a way that is optimized for analytics and data retrieval. In contrast,
S3 is an object storage service and does not have a built-in structure for
querying data.
3. Integration: Redshift integrates with other AWS services such as QuickSight,
Glue, and Athena, allowing you to easily perform analytics on your data.
4. Cost: Redshift provides cost savings for storing and querying large amounts
of data. Redshift automatically compresses and stores data in a columnar
format, which reduces the amount of storage required compared to a row-
based storage format.

In conclusion, Redshift is a more specialized storage solution for use cases

where query performance and analytics are important, while S3 is a more
general-purpose storage solution that can be used for a wide range of use
cases.

What is airflow?

Airflow is an open-source platform for programmatically authoring, scheduling, and

monitoring workflows. In the context of big data, Airflow can be used to manage and
automate tasks related to data ingestion, transformation, and analysis, making it
easier for data engineers and scientists to work with large datasets. Airflow's
features, such as its intuitive UI, support for multiple data sources, and ability to
handle complex dependencies, make it a popular choice for big data workflows.

What is step function?

Step Functions is a serverless workflow service provided by AWS (Amazon Web

Services) that makes it easy to coordinate distributed applications and
microservices using visual workflows. In the context of big data, Step Functions can
be used to automate and manage big data processing pipelines, enabling data
engineers and scientists to build and run complex data processing tasks using a
visual interface. With Step Functions, you can model your entire big data processing
workflow as a series of steps, including data ingestion, transformation, and analysis,
and automate the coordination and execution of these steps. This makes it easier to
manage and monitor big data processing workflows, and provides a flexible and
scalable solution for big data processing

Scalable Big Data Architecture with Java
No ratings yet
Scalable Big Data Architecture with Java
31 pages
Apache Hadoop YARN
No ratings yet
Apache Hadoop YARN
24 pages
YARN Essentials - Sample Chapter
No ratings yet
YARN Essentials - Sample Chapter
12 pages
Unit 2 Notes BDA
No ratings yet
Unit 2 Notes BDA
10 pages
Mod 5
No ratings yet
Mod 5
46 pages
Big Data Notes Unit-3
No ratings yet
Big Data Notes Unit-3
7 pages
BigData Unit-4 Complete
No ratings yet
BigData Unit-4 Complete
97 pages
Unit IV
No ratings yet
Unit IV
14 pages
Unit-3 BDA
No ratings yet
Unit-3 BDA
30 pages
Big Data 101-125
No ratings yet
Big Data 101-125
25 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
14 pages
Hadoop Class 2 PDF
No ratings yet
Hadoop Class 2 PDF
18 pages
Unit 3
No ratings yet
Unit 3
18 pages
Understanding YARN in Apache Hadoop
No ratings yet
Understanding YARN in Apache Hadoop
2 pages
Hadoop Map Reduce Performance Evaluation and Improvement Using Compression Algorithms On Single Cluster
No ratings yet
Hadoop Map Reduce Performance Evaluation and Improvement Using Compression Algorithms On Single Cluster
12 pages
Big Data Notes
No ratings yet
Big Data Notes
12 pages
Module 2 HDFS
No ratings yet
Module 2 HDFS
33 pages
Big Data Characteristics and YARN Benefits
No ratings yet
Big Data Characteristics and YARN Benefits
24 pages
BDMA Part 3
No ratings yet
BDMA Part 3
22 pages
Hadoop
No ratings yet
Hadoop
7 pages
Overview of Hadoop Modules
100% (1)
Overview of Hadoop Modules
40 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
6 pages
BD U-4 (Anupam Sir)
No ratings yet
BD U-4 (Anupam Sir)
23 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
Day 2 S1 Intro - To - Hadoop - Ashok
No ratings yet
Day 2 S1 Intro - To - Hadoop - Ashok
27 pages
Unit 2lecturenotes 240530095215 Bebaac62
No ratings yet
Unit 2lecturenotes 240530095215 Bebaac62
98 pages
MapReduce 1 vs 2 in Hadoop Framework
No ratings yet
MapReduce 1 vs 2 in Hadoop Framework
19 pages
Big Data Tools for Engineers
No ratings yet
Big Data Tools for Engineers
5 pages
Wa0005.
No ratings yet
Wa0005.
84 pages
Unit 3
No ratings yet
Unit 3
90 pages
Unit 2
No ratings yet
Unit 2
17 pages
Big Data
No ratings yet
Big Data
16 pages
Unit 4
No ratings yet
Unit 4
21 pages
Unit 4
No ratings yet
Unit 4
85 pages
Q. What Is Big Data?
No ratings yet
Q. What Is Big Data?
8 pages
What Is The Hadoop Ecosystem?
No ratings yet
What Is The Hadoop Ecosystem?
4 pages
Big Data Unit 3 Own
No ratings yet
Big Data Unit 3 Own
20 pages
BDA praON Iat1
No ratings yet
BDA praON Iat1
12 pages
Best Practices For Resource Management in Hadoop: James Kochuba, SAS Institute Inc., Cary, NC
No ratings yet
Best Practices For Resource Management in Hadoop: James Kochuba, SAS Institute Inc., Cary, NC
10 pages
Hadoop ISE 2
No ratings yet
Hadoop ISE 2
25 pages
HADOOP
No ratings yet
HADOOP
19 pages
YARN Tutorial: Architecture & Use Cases
No ratings yet
YARN Tutorial: Architecture & Use Cases
14 pages
Hadoop Platform & Services
No ratings yet
Hadoop Platform & Services
41 pages
Big Data Battle: Hadoop vs Spark
No ratings yet
Big Data Battle: Hadoop vs Spark
6 pages
Unit-2 Bda Kalyan - Pagenumber
No ratings yet
Unit-2 Bda Kalyan - Pagenumber
15 pages
Overview of Hadoop Architecture and Use Cases
No ratings yet
Overview of Hadoop Architecture and Use Cases
6 pages
Hadoop for Big Data Beginners
No ratings yet
Hadoop for Big Data Beginners
87 pages
Unit V FRAMEWORKS AND VISUALIZATION
No ratings yet
Unit V FRAMEWORKS AND VISUALIZATION
71 pages
Adoop Cosystem: S W S A, T L at 68
No ratings yet
Adoop Cosystem: S W S A, T L at 68
22 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
103 pages
Second Exam Summary
No ratings yet
Second Exam Summary
44 pages
Understanding YARN in Hadoop 2
No ratings yet
Understanding YARN in Hadoop 2
16 pages
Introduction To Hadoop (1) - 1
No ratings yet
Introduction To Hadoop (1) - 1
39 pages
Download
No ratings yet
Download
7 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
8 pages
6 Yarn
No ratings yet
6 Yarn
10 pages
Yarn and Its Failures
No ratings yet
Yarn and Its Failures
22 pages
Big Data Exam Help
No ratings yet
Big Data Exam Help
7 pages
2023-2024 Mass, Weight and Density - PPTX Updated
No ratings yet
2023-2024 Mass, Weight and Density - PPTX Updated
40 pages
Digital Literacy and Technology Basics
No ratings yet
Digital Literacy and Technology Basics
13 pages
U Were Meant For Me
No ratings yet
U Were Meant For Me
3 pages
Mold & Lyme Toxins: Health Impact
100% (2)
Mold & Lyme Toxins: Health Impact
4 pages
Resource Governor
No ratings yet
Resource Governor
70 pages
Project Report
No ratings yet
Project Report
85 pages
Silver Sol Scientific Research
100% (1)
Silver Sol Scientific Research
3 pages
MATH 101 M-Relations Functions
No ratings yet
MATH 101 M-Relations Functions
35 pages
Mba Strategic Management Thesis PDF
100% (3)
Mba Strategic Management Thesis PDF
7 pages
VMware Cloud Director Availability in AVS
No ratings yet
VMware Cloud Director Availability in AVS
24 pages
Bacteria: Friend or Foe?
No ratings yet
Bacteria: Friend or Foe?
6 pages
2 - Haiding Shipyard-Company Profile
No ratings yet
2 - Haiding Shipyard-Company Profile
29 pages
Types of Learning
No ratings yet
Types of Learning
3 pages
22 Startup Circuit Design181011
No ratings yet
22 Startup Circuit Design181011
13 pages
PTP 800 Order Guide Global
No ratings yet
PTP 800 Order Guide Global
29 pages
Times Internet GST Invoice Details
No ratings yet
Times Internet GST Invoice Details
1 page
Fluidized Bed Dryer Appratus
No ratings yet
Fluidized Bed Dryer Appratus
6 pages
Lis Vol 1 (No Key)
No ratings yet
Lis Vol 1 (No Key)
59 pages
Lei Ilima Girls Club Project: Calendar of Events
No ratings yet
Lei Ilima Girls Club Project: Calendar of Events
4 pages
Learn Lua in 15 Minutes
No ratings yet
Learn Lua in 15 Minutes
10 pages
A Catalogue of Works Pertaining To The Explanation of The Creed in Carolingian Manuscripts Susan Keefe Download
100% (10)
A Catalogue of Works Pertaining To The Explanation of The Creed in Carolingian Manuscripts Susan Keefe Download
78 pages
SAP CRM: Add Table Attributes
100% (1)
SAP CRM: Add Table Attributes
17 pages
Fable II Gargoyle Locations
No ratings yet
Fable II Gargoyle Locations
5 pages
Sem 7 Project PPT Updated
No ratings yet
Sem 7 Project PPT Updated
14 pages
English 10th Standard Sample Papers
No ratings yet
English 10th Standard Sample Papers
13 pages
CoinGecko 2025 Q1 Crypto Industry Report
No ratings yet
CoinGecko 2025 Q1 Crypto Industry Report
50 pages
CORPORATE GOVERNANCE Mock Examination
No ratings yet
CORPORATE GOVERNANCE Mock Examination
3 pages
2024 Camp Program Overview
No ratings yet
2024 Camp Program Overview
11 pages
ATP 2025 GR 4 Soc Sci Final
No ratings yet
ATP 2025 GR 4 Soc Sci Final
8 pages
Cebu Doctors' University College of Medicine Medical Education Unit
No ratings yet
Cebu Doctors' University College of Medicine Medical Education Unit
7 pages

Custom Notes

Uploaded by

Custom Notes

Uploaded by

What is MapReduce?

MapReduce is a programming model and an associated implementation for

MapReduce is a powerful model for distributed data processing and it can be

YARN (Yet Another Resource Negotiator) is a resource management framework in

YARN stands for “Yet Another Resource Negotiator“. It was introduced in

YARN architecture basically separates resource management layer from the

 Scalability: The scheduler in Resource manager of YARN architecture

Some things to consider when choosing the format are:

 The structure of your data: Some formats accept nested

 Performance: Some formats such as Avro and Parquet

 Easy to read: Consider if you need people to read the

 Compression: Some formats offer higher compression

 Schema evolution: Adding or removing fields is far more

 Compatibility: JSON or CSV are widely adopted and

 Avro: Great for storing row data, very efficient. It has a

 ORC: Similar to Parquet, it offers better compression. It

As we can see, CSV and JSON are easy to use, human

PARQUET file format

Parquet is an open-source file format for Hadoop.

Parquet helps to achieve efficient storage and performance. This is a column-oriented

To understand parquet file format, let’s take an example as below:

CustId First_Name Last_Name City

AVRO file format

Avro is a row-based storage format, which is widely used for serialization.

ORC file format

Link of file formats: [Link]

Why we use aws redshift over aws s3?

1. Performance: Redshift is optimized for fast query performance, making it

In conclusion, Redshift is a more specialized storage solution for use cases

Airflow is an open-source platform for programmatically authoring, scheduling, and

What is step function?

Step Functions is a serverless workflow service provided by AWS (Amazon Web

You might also like