0% found this document useful (0 votes)

21 views31 pages

DATA228 Lecture Notes Week 5

The document discusses distributed computing, focusing on Hadoop's YARN as a general-purpose distributed compute system. It outlines the elements and characteristics of distributed compute systems, including task scheduling, resource management, and the differences between stateless and stateful applications. Additionally, it describes YARN's architecture, including its ResourceManager and NodeManager, and the various scheduling strategies it employs to optimize resource usage and application performance.

Uploaded by

sreenidhi.hayagreevan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views31 pages

DATA228 Lecture Notes Week 5

Uploaded by

sreenidhi.hayagreevan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

DATA 228

Big Data Technologies and Applications (Fall 2024)

Sangjin Lee
Hadoop: YARN, Hadoop’s
distributed compute

Ch pter 4, “H doop: the De initive Guide” 4th Edition, Tom White

a
a
f
What is distributed compute/computing?

A distributed compute system is system which solves computing problem utilizing set of
multiple devices over network.
a
a
a
a
Distributed compute

• Sometimes referred to s Scheduler or Orchestr tor

a
a
Elements of distributed compute
Fundamentals

• Run t sks in p r llel cross multiple m chines

• Schedule t sks in n e icient nd f ir m nner

• Monitor nd ccount for resource us ge cross multiple m chines

• Sc le horizont lly by dding more m chines to the cluster

• Recover from ll m nners of f ilures: node f ilures, t sk f ilures, network f ilures, etc.
a
a
a
a
a
a
a
a
a
a
a
a
a
ff
a
a
a
a
a
a
a
a
a
a
a
a
a
Elements of distributed compute
More advanced elements

• Support notion of n “ pplic tion”

• Support both st teless nd st teful types of pplic tions

• (Big-d t -speci ic) Schedule t sks to be s loc l to d t s possible

• (St teless- pp-speci ic) Provide support for tr ic ingress for pplic tions
a
a
a
a
a
f
a
a
f
a
a
a
a
a
a
a
a
a
ff
a
a
a
a
a
a
Elements of distributed compute
Stateless and stateful

St teless St teful

Coordin tion No coordin tion required Coordin tion required mong worklo ds

Dur tion Tends to run long (long-running) Tends to complete its job nd shut down

St te T sks don’t need to m int in st te St te is critic l to t sks

Big d t jobs, b tch jobs, C ss ndr , AI/ML

Ex mples Web services, microservices
jobs requiring GPUs
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
Examples of distributed compute systems

• Not quite but pointing the w y: servlet cont iners (Tomc t, Jetty), pplic tion servers

• Mesos

• Kubernetes

• Orchestr tes cont iner worklo ds

• The le der in st teless distributed compute systems

• Supports more complex pplic tion types (st teful, GPU, etc.)

• Sp wned n ecosystem of supporting technologies: Helm, cont inerd, Istio, Envoy, etc. (in CNCF)

• Old competition: Mesos

a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
Examples of distributed compute systems

• YARN

• Orchestr tes mostly big-d t or d t -rel ted worklo ds

• C n support other types of pplic tions

• D t - w re scheduling
a
a
a
a
a
a
a
a
a
a
a
a
a
a
Distributed compute in Hadoop
Characteristics

• Accomplish d t processing over l rge d t (TBs or PBs) within short mount of time

• Use resources (memory, CPUs, network, nd I/O) e iciently to ccomplish them

• H ndle d t -w re scheduling

• H ndle t sk scheduling in bursts

a
a
a
a
a
a
a
a
a
a
a
a
ff
a
a
a
YARN
Basics

• H doop’s gener l-purpose distributed compute system

• It is NOT d t computing fr mework itself

• D t computing fr meworks (M pReduce, Sp rk, Tez, etc.) re YARN pplic tions

• D t pr ctitioners don’t inter ct with YARN directly for the most p rt

• Supports pplic tions nd cont iners (t sks)

a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
YARN
Architecture

• ResourceM n ger nd NodeM n gers

a
a
a
a
a
YARN
ResourceManager

• “One” for single cluster

• Processes resource requests

• Resource requests

• Amount of compute resources being requested: virtu l cores (vCPUs), memory (MBs), GPUs, etc.

• Loc lity constr ints: speci ic node, speci ic r ck, or o -r ck (= nywhere)

• ResourceM n ger schedules cont iners b sed on resource requests

• ResourceM n ger tries to schedule b sed on the loc lity constr ints
a
a
a
a
a
a
a
f
a
a
a
f
a
a
a
ff
a
a
a
YARN
ResourceManager

• M int ins st te of v il ble nd lloc ted resources on nodes

• M int ins st te of ll running pplic tions nd cont iners

• ResourceM n ger requires l rge mount of memory

• ResourceM n ger c n be sc l bility bottleneck

a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
YARN
ResourceManager high availability (HA)

• Active nd st ndby ResourceM n gers

• ResourceM n ger st te is persisted in n RMSt teStore ( ilesystems, ZooKeeper, etc.)

• M nu l nd utom tic f ilover

• HA c n recover ll running pplic tions during f ilover

a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
f
YARN
NodeManager

• Responsible for l unching nd m n ging cont iners

• Runs he lth checks on the node it runs nd communic tes the st te to RM

• Reports resource us ge st tus to RM

• NodeM n ger c n be rest rted without ecting running cont iners

a
a
a
a
a
a
a
a
a
a
a
a
a
ff
a
a
a
a
YARN
Application taxonomy

• Applic tion

• A single entity th t represents the distributed compute job s whole

• Cont iners

• Individu l compute t sks s p rt of the pplic tion

• Not the s me s the (Docker/Kubernetes) cont iner

• Applic tion m ster (AM)

• Speci l cont iner th t m n ges other cont iners for the pplic tion
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
YARN
YARN application

• Client requests n Applic tion M ster

(AM) to ResourceM n ger (RM)

• RM selects node to l unch the AM

cont iner

• AM m y request more cont iners to RM

a
a
a
a
a
a
a
a
a
a
YARN
Schedulers

• Scheduler: ResourceM n ger component/ lgorithm th t lloc tes cont iners b sed on
cert in policy

• A scheduler tries to optimize resource us ge (utiliz tion) nd timeliness of pplic tion

completions (throughput)

• The centr l consider tion is multi-ten ncy

• YARN supports 3 schedulers: FIFO ( irst-in- irst-out) scheduler, C p city scheduler, nd F ir

scheduler
a
a
a
a
a
a
f
a
a
f
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
YARN
Schedulers
YARN
Schedulers

• FIFO scheduler

• Simplest scheduler

• It’s FIFO ( irst-in- irst-out) queue

• Applic tions run in the order of submission: the next pplic tion gets scheduled fter the
previous pplic tions h ve been completed

• Not suit ble for multi-ten nt cluster

a
a
a
a
f
a
a
f
a
a
a
a
a
YARN
Schedulers

• C p city scheduler (def ult)

• C p city is p rtitioned with multiple dedic ted queues (e.g. te ms, groups of pps, etc.)

• Queues get resource gu r ntees even if other queues re contended

• Improves over ll throughput comp red to FIFO

• Provides te ms with predict ble c p city

• M y str nd resources if utiliz tion cross queues is uneven

• Queue-sizing becomes very import nt

a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
YARN
Schedulers

• F ir scheduler

• Dyn mic lly b l nces resources between ll running pps so th t ll pps get equ l sh re
of resources over time: “f ir sh re”

• Tries to b l nce between h ving ll pps m ke good progress nd letting l rge pps inish in
timely m nner

• Queues re on p per unnecess ry, but they c n be used to incre se predict bility

• It c n be the best of both worlds in l rge multi-ten nted clusters

• Some loss of predict bility

a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
f
a
a
YARN
Demo

Exploring schedulers using

sleep pp (job)
a
a
YARN
Demo: sleep job

• The j r for the sleep job (hadoop mapreduce client jobclient 3.4.0 tests.jar) is v il ble on C nv s

• Sleep job p r meters

• m: number of m ppers

• r: number of reducers

• mt: m pper sleep dur tion (ms)

• rt: reducer sleep dur tion (ms)

• bin/hadoop jar share/hadoop/mapreduce/hadoop mapreduce client jobclient 3.4.0 tests.jar

sleep m 2 r 1 mt 90000 rt 90000
-
-
-
-
a
a
-
a
a
-
a
-
a
a
-
-
-
-
-
-
-
-
-
a
-
a
a
-
a
a
YARN
Demo: sleep job

• AM size: 2 GB memory nd 1 vCore

• MR cont iner size: 1 GB memory nd 1 vCore e ch

• (My) YARN cluster size: 8 GB memory nd 8 vCores

a
a
a
a
a
YARN
Demo: sleep job: rst app

fi
YARN
Demo: sleep job: rst app (mappers running)

fi
YARN
Demo: sleep job: second app submitted
YARN
Demo: sleep job: second app running mappers
YARN
Demo: sleep job: second app completes running

UNIT-4 BIG DATA (NoSql)
No ratings yet
UNIT-4 BIG DATA (NoSql)
38 pages
7.hadoop YARN
No ratings yet
7.hadoop YARN
26 pages
Understanding YARN in Hadoop 2
No ratings yet
Understanding YARN in Hadoop 2
16 pages
BD U-4 (Anupam Sir)
No ratings yet
BD U-4 (Anupam Sir)
23 pages
Adoop Cosystem: S W S A, T L at 68
No ratings yet
Adoop Cosystem: S W S A, T L at 68
22 pages
Hadoop Platform & Services
No ratings yet
Hadoop Platform & Services
41 pages
ECS765P - W3 - Hadoop Principles and Components
No ratings yet
ECS765P - W3 - Hadoop Principles and Components
47 pages
Module 4 - Yarn Schedulers
No ratings yet
Module 4 - Yarn Schedulers
21 pages
Mod 5
No ratings yet
Mod 5
46 pages
06 - YARN in Hadoop - An Introduction
No ratings yet
06 - YARN in Hadoop - An Introduction
41 pages
Chapter 10
No ratings yet
Chapter 10
45 pages
Scalable Big Data Architecture with Java
No ratings yet
Scalable Big Data Architecture with Java
31 pages
DC - Co 1 All in 1 PDF
No ratings yet
DC - Co 1 All in 1 PDF
197 pages
Bigdata and Hadoop - Unit III
No ratings yet
Bigdata and Hadoop - Unit III
24 pages
CH 2
No ratings yet
CH 2
6 pages
Big Data Technologies Overview
No ratings yet
Big Data Technologies Overview
37 pages
DATA228 Lecture Notes Week 3
No ratings yet
DATA228 Lecture Notes Week 3
21 pages
Haoop Architecture
No ratings yet
Haoop Architecture
34 pages
Bigdata Lecture 4
No ratings yet
Bigdata Lecture 4
23 pages
Yarn and Its Failures
No ratings yet
Yarn and Its Failures
22 pages
Unit 2 Notes BDA
No ratings yet
Unit 2 Notes BDA
10 pages
Unit-3 BDA
No ratings yet
Unit-3 BDA
30 pages
Seminar
No ratings yet
Seminar
20 pages
Big Data-Week 3 - 1
No ratings yet
Big Data-Week 3 - 1
22 pages
YARN: Advanced Cluster Management
No ratings yet
YARN: Advanced Cluster Management
34 pages
Hadoop MapReduce & YARN Overview
No ratings yet
Hadoop MapReduce & YARN Overview
26 pages
Hadoop
No ratings yet
Hadoop
10 pages
Introduction to Hadoop & DFS
No ratings yet
Introduction to Hadoop & DFS
34 pages
Unit - 4 Yarn
No ratings yet
Unit - 4 Yarn
20 pages
Big Data Challenges and Hadoop Architecture
No ratings yet
Big Data Challenges and Hadoop Architecture
36 pages
Apache Hadoop Next Generation Compute Platform: Bikas Saha @bikassaha
No ratings yet
Apache Hadoop Next Generation Compute Platform: Bikas Saha @bikassaha
22 pages
Computing Cluster Design Guide
No ratings yet
Computing Cluster Design Guide
168 pages
Unit-4: Illustrate Mapreduce Architecture With Diagram
No ratings yet
Unit-4: Illustrate Mapreduce Architecture With Diagram
7 pages
YARN Tutorial: Architecture & Use Cases
No ratings yet
YARN Tutorial: Architecture & Use Cases
14 pages
BigData Unit-4 Complete
No ratings yet
BigData Unit-4 Complete
97 pages
ADSU1VFTVF25
No ratings yet
ADSU1VFTVF25
118 pages
2 Desa Sincrono 1 Clase 03 BI Ok
No ratings yet
2 Desa Sincrono 1 Clase 03 BI Ok
48 pages
Hadoop
No ratings yet
Hadoop
25 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
Lecture 06
No ratings yet
Lecture 06
26 pages
Module 4 - Yarn
No ratings yet
Module 4 - Yarn
34 pages
Hadoop YARN Architecture
No ratings yet
Hadoop YARN Architecture
5 pages
Lecture 11 Cloud Systems
No ratings yet
Lecture 11 Cloud Systems
80 pages
7 Yarn
No ratings yet
7 Yarn
17 pages
Chapter - 6 - Hadoop
No ratings yet
Chapter - 6 - Hadoop
51 pages
Part2 HDFS
No ratings yet
Part2 HDFS
33 pages
Hadoop
No ratings yet
Hadoop
16 pages
Bda Mod 2 Answers (Except 1st One)
No ratings yet
Bda Mod 2 Answers (Except 1st One)
4 pages
2 - Yarn
No ratings yet
2 - Yarn
59 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
103 pages
Unit 5
No ratings yet
Unit 5
101 pages
Hadoop 1
No ratings yet
Hadoop 1
26 pages
Big Data Processing and Tools Guide
No ratings yet
Big Data Processing and Tools Guide
11 pages
Unit 3
No ratings yet
Unit 3
18 pages
Hadoop YARN Architecture Guide
No ratings yet
Hadoop YARN Architecture Guide
9 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
38 pages
Scheduling in YARN
No ratings yet
Scheduling in YARN
7 pages
I10064664-E1 - Statistics Study Guide PDF
No ratings yet
I10064664-E1 - Statistics Study Guide PDF
81 pages
Heat Transfer in Packed Bed
No ratings yet
Heat Transfer in Packed Bed
12 pages
Mikal Sharrieff's Education & Skills Profile
No ratings yet
Mikal Sharrieff's Education & Skills Profile
1 page
Critical - Thinking Notes
No ratings yet
Critical - Thinking Notes
13 pages
English File 4th Edition Upper Intermediate Student's Book
No ratings yet
English File 4th Edition Upper Intermediate Student's Book
170 pages
Additive Manufacturing of Linear Shaped Charges To Address Run Up
No ratings yet
Additive Manufacturing of Linear Shaped Charges To Address Run Up
89 pages
Brochure TMC Industry Eng
No ratings yet
Brochure TMC Industry Eng
4 pages
Assignement 3 (Ateebtahir)
No ratings yet
Assignement 3 (Ateebtahir)
5 pages
Cambridge Primary Science Teacher's Resource Book 6 With CD-ROM, Fiona Baxter and Liz Dilley, Cambridge University Press - Public
67% (24)
Cambridge Primary Science Teacher's Resource Book 6 With CD-ROM, Fiona Baxter and Liz Dilley, Cambridge University Press - Public
38 pages
Conditional Cash Transfer Literature Review
100% (3)
Conditional Cash Transfer Literature Review
6 pages
What Is - Body Doubling - in The Context of ADHD
No ratings yet
What Is - Body Doubling - in The Context of ADHD
3 pages
Costanza Et Al 1997 The Value of The World's Ecosystem Services and Natural Capital PDF
100% (2)
Costanza Et Al 1997 The Value of The World's Ecosystem Services and Natural Capital PDF
8 pages
Moments of Inertia of Built Up Sections
0% (2)
Moments of Inertia of Built Up Sections
7 pages
AI Pres
No ratings yet
AI Pres
2 pages
Đề Cương Ôn Thi Giữa Học Kỳ i Gr 11
No ratings yet
Đề Cương Ôn Thi Giữa Học Kỳ i Gr 11
4 pages
Epidemiologi MRSA Di Indonesia-Dikonversi
No ratings yet
Epidemiologi MRSA Di Indonesia-Dikonversi
9 pages
30 Behaviors That Will Make You Unstoppable PDF
100% (1)
30 Behaviors That Will Make You Unstoppable PDF
126 pages
The Challenge of Fate - Thorwald Dethlefsen
91% (23)
The Challenge of Fate - Thorwald Dethlefsen
124 pages
Leadership Skills in Student Development
No ratings yet
Leadership Skills in Student Development
43 pages
Unlocking True Happiness
No ratings yet
Unlocking True Happiness
5 pages
Influence of A Consistent Minority On The Responses of A Majority in A Color Perception Task
No ratings yet
Influence of A Consistent Minority On The Responses of A Majority in A Color Perception Task
17 pages
Anti-Skid Epoxy Coating Guide
No ratings yet
Anti-Skid Epoxy Coating Guide
5 pages
Rock Classification for Engineers
No ratings yet
Rock Classification for Engineers
78 pages
Thesis Artinya
100% (3)
Thesis Artinya
6 pages
CANUSA CPS Data Sheet
No ratings yet
CANUSA CPS Data Sheet
4 pages
The Effectiveness of Lithium Chloride in Eliciting A Taste Aversi
No ratings yet
The Effectiveness of Lithium Chloride in Eliciting A Taste Aversi
7 pages
Biruk Negash
No ratings yet
Biruk Negash
84 pages
Universidad Estatal de Milagro
100% (1)
Universidad Estatal de Milagro
6 pages
Self Control Boot Camp
No ratings yet
Self Control Boot Camp
142 pages
StudentStudymet703756 2
No ratings yet
StudentStudymet703756 2
11 pages

DATA228 Lecture Notes Week 5

Uploaded by

DATA228 Lecture Notes Week 5

Uploaded by

DATA 228

Big Data Technologies and Applications (Fall 2024)

Ch pter 4, “H doop: the De initive Guide” 4th Edition, Tom White

• Sometimes referred to s Scheduler or Orchestr tor

• Run t sks in p r llel cross multiple m chines

• Schedule t sks in n e icient nd f ir m nner

• Monitor nd ccount for resource us ge cross multiple m chines

• Sc le horizont lly by dding more m chines to the cluster

• Support notion of n “ pplic tion”

• Support both st teless nd st teful types of pplic tions

• (Big-d t -speci ic) Schedule t sks to be s loc l to d t s possible

St te T sks don’t need to m int in st te St te is critic l to t sks

Big d t jobs, b tch jobs, C ss ndr , AI/ML

• Orchestr tes cont iner worklo ds

• The le der in st teless distributed compute systems

• Old competition: Mesos

• Orchestr tes mostly big-d t or d t -rel ted worklo ds

• C n support other types of pplic tions

• Use resources (memory, CPUs, network, nd I/O) e iciently to ccomplish them

• H ndle t sk scheduling in bursts

• H doop’s gener l-purpose distributed compute system

• It is NOT d t computing fr mework itself

• D t computing fr meworks (M pReduce, Sp rk, Tez, etc.) re YARN pplic tions

• D t pr ctitioners don’t inter ct with YARN directly for the most p rt

• Supports pplic tions nd cont iners (t sks)

• ResourceM n ger nd NodeM n gers

• “One” for single cluster

• Processes resource requests

• Loc lity constr ints: speci ic node, speci ic r ck, or o -r ck (= nywhere)

• ResourceM n ger schedules cont iners b sed on resource requests

• M int ins st te of v il ble nd lloc ted resources on nodes

• M int ins st te of ll running pplic tions nd cont iners

• ResourceM n ger requires l rge mount of memory

• ResourceM n ger c n be sc l bility bottleneck

• Active nd st ndby ResourceM n gers

• ResourceM n ger st te is persisted in n RMSt teStore ( ilesystems, ZooKeeper, etc.)

• M nu l nd utom tic f ilover

• HA c n recover ll running pplic tions during f ilover

• Responsible for l unching nd m n ging cont iners

• Runs he lth checks on the node it runs nd communic tes the st te to RM

• Reports resource us ge st tus to RM

• NodeM n ger c n be rest rted without ecting running cont iners

• A single entity th t represents the distributed compute job s whole

• Individu l compute t sks s p rt of the pplic tion

• Not the s me s the (Docker/Kubernetes) cont iner

• Applic tion m ster (AM)

• Client requests n Applic tion M ster

• RM selects node to l unch the AM

• AM m y request more cont iners to RM

• A scheduler tries to optimize resource us ge (utiliz tion) nd timeliness of pplic tion

• The centr l consider tion is multi-ten ncy

• YARN supports 3 schedulers: FIFO ( irst-in- irst-out) scheduler, C p city scheduler, nd F ir

• It’s FIFO ( irst-in- irst-out) queue

• Not suit ble for multi-ten nt cluster

• C p city scheduler (def ult)

• Queues get resource gu r ntees even if other queues re contended

• Improves over ll throughput comp red to FIFO

• Provides te ms with predict ble c p city

• M y str nd resources if utiliz tion cross queues is uneven

• Queue-sizing becomes very import nt

• It c n be the best of both worlds in l rge multi-ten nted clusters

• Some loss of predict bility

Exploring schedulers using

• Sleep job p r meters

• mt: m pper sleep dur tion (ms)

• rt: reducer sleep dur tion (ms)

• bin/hadoop jar share/hadoop/mapreduce/hadoop mapreduce client jobclient 3.4.0 tests.jar

• AM size: 2 GB memory nd 1 vCore

• MR cont iner size: 1 GB memory nd 1 vCore e ch

• (My) YARN cluster size: 8 GB memory nd 8 vCores

You might also like