0% found this document useful (0 votes)
20 views10 pages

Custom Notes

Uploaded by

Arjun Mundhada
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views10 pages

Custom Notes

Uploaded by

Arjun Mundhada
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

What is MapReduce?

MapReduce is a programming model and an associated implementation for


processing and generating large data sets with a parallel, distributed
algorithm on a cluster. It was introduced by Google in a 2004 paper by Jeff
Dean and Sanjay Ghemawat.

The model is inspired by the map and reduce functions commonly used in
functional programming. The basic idea is to split the input data into smaller
chunks and process them in parallel, then combine the results to produce
the final output. The two main functions in the MapReduce model are the
"map" function and the "reduce" function.

The "map" function takes an input and produces a set of intermediate key-
value pairs. The "reduce" function takes all the values associated with the
same key, and combines them in some way, typically by summing or
averaging them, to produce a single output value. The output of the reduce
function is typically a smaller set of key-value pairs, which can be further
processed by additional reduce functions.

MapReduce is a powerful model for distributed data processing and it can be


implemented using various open-source frameworks like Apache Hadoop,
Apache Spark, and others. It's widely used for handling big data sets and can
be applied to various use cases like data mining, log processing, machine
learning, and more.

YARN?

YARN (Yet Another Resource Negotiator) is a resource management framework in


Apache Hadoop, used for managing resources and scheduling applications in a
large-scale, distributed computing environment. It enables Hadoop to run a variety
of processing frameworks (such as MapReduce, Apache Spark, Apache Tez, etc.) in
a single platform, allowing multiple users to share a cluster and use it for different
purposes.

YARN stands for “Yet Another Resource Negotiator“. It was introduced in


Hadoop 2.0 to remove the bottleneck on Job Tracker which was present in
Hadoop 1.0. YARN was described as a “Redesigned Resource Manager” at the
time of its launching, but it has now evolved to be known as large-scale
distributed operating system used for Big Data processing.

YARN architecture basically separates resource management layer from the


processing layer. In Hadoop 1.0 version, the responsibility of Job tracker is split
between the resource manager and application manager.
YARN also allows different data processing engines like graph processing,
interactive processing, stream processing as well as batch processing to run
and process data stored in HDFS (Hadoop Distributed File System) thus
making the system much more efficient. Through its various components, it can
dynamically allocate various resources and schedule the application
processing. For large volume data processing, it is quite necessary to manage
the available resources properly so that every application can leverage them.
YARN Features: YARN gained popularity because of the following features-

 Scalability: The scheduler in Resource manager of YARN architecture


allows Hadoop to extend and manage thousands of nodes and clusters.
 Compatibility: YARN supports the existing map-reduce applications without
disruptions thus making it compatible with Hadoop 1.0 as well.
 Cluster Utilization:Since YARN supports Dynamic utilization of cluster in
Hadoop, which enables optimized Cluster Utilization.
 Multi-tenancy: It allows multiple engine access thus giving organizations a
benefit of multi-tenancy.
 Client: It submits map-reduce jobs.
 Resource Manager: It is the master daemon of YARN and is responsible
for resource assignment and management among all the applications.
Whenever it receives a processing request, it forwards it to the
corresponding node manager and allocates resources for the completion
of the request accordingly. It has two major components:
 Scheduler: It performs scheduling based on the allocated
application and available resources. It is a pure scheduler, means it
does not perform other tasks such as monitoring or tracking and
does not guarantee a restart if a task fails. The YARN scheduler
supports plugins such as Capacity Scheduler and Fair Scheduler to
partition the cluster resources.
 Application manager: It is responsible for accepting the
application and negotiating the first container from the resource
manager. It also restarts the Application Master container if a task
fails.
 Node Manager: It take care of individual node on Hadoop cluster and
manages application and workflow and that particular node. Its primary
job is to keep-up with the Resource Manager. It registers with the
Resource Manager and sends heartbeats with the health status of the
node. It monitors resource usage, performs log management and also
kills a container based on directions from the resource manager. It is also
responsible for creating the container process and start it on the request
of Application master.
 Application Master: An application is a single job submitted to a
framework. The application master is responsible for negotiating
resources with the resource manager, tracking the status and monitoring
progress of a single application. The application master requests the
container from the node manager by sending a Container Launch
Context(CLC) which includes everything an application needs to run.
Once the application is started, it sends the health report to the resource
manager from time-to-time.
 Container: It is a collection of physical resources such as RAM, CPU
cores and disk on a single node. The containers are invoked by Container
Launch Context(CLC) which is a record that contains information such as
environment variables, security tokens, dependencies etc.
Application workflow in Hadoop YARN:
1. Client submits an application
2. The Resource Manager allocates a container to start the Application
Manager
3. The Application Manager registers itself with the Resource Manager
4. The Application Manager negotiates containers from the Resource Manager
5. The Application Manager notifies the Node Manager to launch containers
6. Application code is executed in the container
7. Client contacts Resource Manager/Application Manager to monitor
application’s status
8. Once the processing is complete, the Application Manager un-registers with
the Resource Manager
Big data file formats

Some things to consider when choosing the format are:

 The structure of your data: Some formats accept nested


data such as JSON, Avro or Parquet and others do not.
Even, the ones that do, may not be highly optimized for it.
Avro is the most efficient format for nested data, I
recommend not to use Parquet nested types because they
are very inefficient. Process nested JSON is also very CPU
intensive. In general, it is recommended to flat the data
when ingesting it.

 Performance: Some formats such as Avro and Parquet


perform better than other such JSON. Even between Avro
and Parquet for different use cases one will be better than
others. For example, since Parquet is a column based
format it is great to query your data lake using SQL
whereas Avro is better for ETL row level transformation.

 Easy to read: Consider if you need people to read the


data or not. JSON or CSV are text formats and are human
readable whereas more performant formats such parquet
or Avro are binary.

 Compression: Some formats offer higher compression


rates than others.

 Schema evolution: Adding or removing fields is far more


complicated in a data lake than in a database. Some
formats like Avro or Parquet provide some degree of
schema evolution which allows you to change the data
schema and still query the data. Tools such Delta
Lake format provide even better tools to deal with
changes in Schemas.

 Compatibility: JSON or CSV are widely adopted and


compatible with almost any tool while more performant
options have less integration points.

File formats

 Avro: Great for storing row data, very efficient. It has a


schema and supports evolution. Great integration with
Kafka. Supports file splitting. Use it for row level
operations or in Kafka. Great to write data, slower to
read.
 Parquet: Columnar storage. It has schema support. It
works very well with Hive and Spark as a way to store
columnar data in deep storage that is queried using SQL.
Because it stores data in columns, query engines will only
read files that have the selected columns and not the
entire data set as opposed to Avro. Use it as a reporting
layer.

 ORC: Similar to Parquet, it offers better compression. It


also provides better schema evolution support as well, but
it is less popular.

Conclusion

As we can see, CSV and JSON are easy to use, human


readable and common formats but lack many of the
capabilities of other formats, making it too slow to be used to
query the data lake. ORC and Parquet are widely used in
the Hadoop ecosystem to query data whereas Avro is also
used outside of Hadoop, especially together with Kafka for
ingestion, it is very good for row level ETL processing. Row
oriented formats have better schema evolution capabilities
than column oriented formats making them a great option for
data ingestion.
Detailed

PARQUET file format

Parquet is an open-source file format for Hadoop.

Parquet helps to achieve efficient storage and performance. This is a column-oriented


format, where the values of each column of the same type in the records are stored
together.

To understand parquet file format, let’s take an example as below:

Example: There is a table which consists of CustId, First_Name, Last_Name and City,
then all the values for the CustId column will be stored together, values for First_Name
column will be together, and another column will also be stored in a similar way. If we
take the same record schema as mentioned above having below four fields, the table
will look like:

CustId First_Name Last_Name City


956321
FN1 LN1 Delhi
8
955812
FN2 LN2 Kolkata
0

For this table, the data in a row-wise storage format will be stored as follows:

FN Delh FN
9563218 LN1 9558120 LN2 Kolkata
1 i 2

Whereas the same data in a column-oriented storage format will look like this:

956321
9558120 FN1 FN2 LN1 LN2 Delhi Kolkata
8

 The columnar storage format is relatively more efficient, and the requirement is to
fetch column-based data by querying a few columns from a table.

The column-oriented file format increases the query performance as it takes less time to
fetch the required column value. Also, less IO is required as the required columns are
adjacent to each other, thus minimizing IO.

AVRO file format

Avro is a row-based storage format, which is widely used for serialization.


Avro depends on the schema, which is stored in JSON format, this makes it easier to
read and understand by any program. The data is stored in a binary format, thus making
it compact and efficient.
One of the prime features of Avro is that it supports dynamic data schemas that change
over time. Since this format supports schema evolution, it can easily handle schema
changes like missing fields, added fields, and changed fields.

 Avro format is preferred for loading data lake landing, because downstream
systems can easily retrieve table schemas from files, and any source schema
changes can be easily handled.
 Due to its efficient serialization and deserialization property, it offers good
performance.

ORC file format

The Optimized Row Columnar (ORC) file format provides a highly efficient way to store
data. This format was designed to overcome the limitations of other file formats. It
improves the overall performance when Hive (A SQL kind of interface, built on top of
Hadoop) reads, writes, and processes the data.

ORC stores collections of rows in one file and within the collection, the row data is
stored in a columnar format.

Link of file formats: [Link]


abc961dd73bf

Why we use aws redshift over aws s3?


AWS Redshift and S3 are both storage solutions offered by AWS, but they
have different use cases and strengths. Some reasons to use Redshift over
S3 include:

1. Performance: Redshift is optimized for fast query performance, making it


well-suited for data warehousing and business intelligence workloads.
2. Structure: Redshift uses a columnar database model, which allows it to store
data in a way that is optimized for analytics and data retrieval. In contrast,
S3 is an object storage service and does not have a built-in structure for
querying data.
3. Integration: Redshift integrates with other AWS services such as QuickSight,
Glue, and Athena, allowing you to easily perform analytics on your data.
4. Cost: Redshift provides cost savings for storing and querying large amounts
of data. Redshift automatically compresses and stores data in a columnar
format, which reduces the amount of storage required compared to a row-
based storage format.

In conclusion, Redshift is a more specialized storage solution for use cases


where query performance and analytics are important, while S3 is a more
general-purpose storage solution that can be used for a wide range of use
cases.

What is airflow?

Airflow is an open-source platform for programmatically authoring, scheduling, and


monitoring workflows. In the context of big data, Airflow can be used to manage and
automate tasks related to data ingestion, transformation, and analysis, making it
easier for data engineers and scientists to work with large datasets. Airflow's
features, such as its intuitive UI, support for multiple data sources, and ability to
handle complex dependencies, make it a popular choice for big data workflows.

What is step function?

Step Functions is a serverless workflow service provided by AWS (Amazon Web


Services) that makes it easy to coordinate distributed applications and
microservices using visual workflows. In the context of big data, Step Functions can
be used to automate and manage big data processing pipelines, enabling data
engineers and scientists to build and run complex data processing tasks using a
visual interface. With Step Functions, you can model your entire big data processing
workflow as a series of steps, including data ingestion, transformation, and analysis,
and automate the coordination and execution of these steps. This makes it easier to
manage and monitor big data processing workflows, and provides a flexible and
scalable solution for big data processing

You might also like