0% found this document useful (0 votes)

678 views13 pages

Hadoop Admin

This document discusses the challenges of managing and administering MapReduce and Hadoop systems. While Hadoop provides powerful features for processing large datasets, its configuration and tuning requires expertise that can be difficult to acquire. The author proposes approaches to simplify Hadoop usage and administration, such as automatically learning from job histories, exploiting the MapReduce design for adaptation, and using cloud resources to test configurations.

Uploaded by

rsreddy.ch5919

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

678 views13 pages

Hadoop Admin

Uploaded by

rsreddy.ch5919

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Simplifying Hadoop Usage and

Administration
Or, With Great Power Comes Great
Responsibility in MapReduce Systems
Shivnath Babu
Duke University

?
Time

Relational DBMS

19751985

New & useful

technology

19851995

Features +++++
Open source ++

19952005

Manageability Crisis,
Research +++

20052010

Claims of self-managing,
Hard to add new features

MapReduce/Hadoop

New & useful

technology
Features +++++
Open source ++

2020

Different Aspects of Manageability

Testing
Tuning
Diagnosis
Applying fixes
Configuring
Benchmarking
Capacity planning
Disaster/failure
recovery automation
Detection/repair of
data corruption

Roles (often overlap)

User (writes MapReduce

programs, Pig scripts,
HiveQL queries, etc.)
Developer
Administrator

Lifecycle of a MapReduce Job

Map function

Reduce function

Run this program as a

MapReduce job

Lifecycle of a MapReduce Job

Time

Input
Splits

Map
Wave 1

Map
Wave 2

Reduce
Wave 1

Reduce
Wave 2

How are the number of splits, number of map and reduce

tasks, memory allocation to tasks, etc., determined?

Job Configuration Parameters

190+ parameters in
Hadoop
Set manually or defaults
are used
Are defaults or rules-ofthumb good enough?

Running time (seconds)

Running time (minutes)

Experiments

On EC2 and
local clusters

Illustrative Result: 50GB Terasort

17-node cluster, 64+32 concurrent map+reduce slots
mapred.reduce. io.sort. io.sort.record.
tasks
factor
percent
10
10
0.15
Based on
popular
rule-ofthumb

500

0.15

300

0.15

300

500

0.15

Running
time

Performance at default and rule-of-thumb settings can be poor

Cross-parameter interactions are significant

Declarative HiveQL/Pig
operations
Job configuration
parameters

Space of execution
choices

Multi-job
workflows

Complexity

Problem Space
Energy
considerations
Cost in pay-as-you-go
environment

Current approaches:
Predominantly manual
Post-mortem analysis

Performance
objectives

Is this where
we want to be?

Challenges
Features of Hadoop from
a usability perspective

These features are very

useful when dealing with

Ability to specify
schema late
Easy integration with
programming lang.
Pluggability

Multiple data formats

Mix of structured and
unstructured data
Multiple computational
engines (e.g., R, DBMS)
Changes/evolution

Input data formats

Storage engines
Schedulers
Instrumentation

But, they pose nontrivial

manageability challenges

Some Thoughts on Possible Solutions

Exploit opportunities to learn
Schema can be learned from Pig Latin scripts, HiveQL queries,
MapReduce jobs
Profile-driven optimization from the compiler world
High ratio of repeated jobs to new jobs is common

Exploit the MapReduce/Hadoop design

Common sort-partition-merge skeleton
Design for robustness gives many mechanisms for adaptation &
observation (speculative execution, storing intermediate data)
Multiple map waves
Fine-grained and pluggable scheduler

Some Thoughts on Possible Solutions

Automate try-it-out and trial-and-error approaches
For example, use 5% of cluster resources to run MapReduce
tasks with a different configuration
Exploit clouds pay-as-you-go resources, EC2 spot instances

?
Time

Relational DBMS

19751985

New & useful

technology

19851995

Features +++++
Open source ++

19952005

Manageability Crisis,
Research +++

20052010

Claims of self-managing,
Hard to add new features

MapReduce/Hadoop

New & useful

technology
Features +++++
Open source ++

2020

BDH Admin Ebook
No ratings yet
BDH Admin Ebook
807 pages
Apache Hadoop Training
No ratings yet
Apache Hadoop Training
377 pages
Apache Hadoop Developer Training
100% (1)
Apache Hadoop Developer Training
394 pages
Hadoop Administration
No ratings yet
Hadoop Administration
97 pages
Hadoop and Pig Problem Solving Guide
50% (2)
Hadoop and Pig Problem Solving Guide
199 pages
250 Hadoop Interview Questions and Answers For Experienced Hadoop Developers - Hadoop Online Tutorials
No ratings yet
250 Hadoop Interview Questions and Answers For Experienced Hadoop Developers - Hadoop Online Tutorials
35 pages
Top 100 Hadoop Interview Questions and Answers 2016
No ratings yet
Top 100 Hadoop Interview Questions and Answers 2016
21 pages
Google BigQuery: Scalable Data Analysis
No ratings yet
Google BigQuery: Scalable Data Analysis
2 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
Understanding Apache Pig Architecture
No ratings yet
Understanding Apache Pig Architecture
33 pages
Cloudera's Guide to Apache Hadoop Essentials
No ratings yet
Cloudera's Guide to Apache Hadoop Essentials
3 pages
1 Hdfs Notes
No ratings yet
1 Hdfs Notes
38 pages
Cloudera Hadoop Admin Notes PDF
No ratings yet
Cloudera Hadoop Admin Notes PDF
65 pages
A Performance Comparison of SQL and NoSQL Databases
No ratings yet
A Performance Comparison of SQL and NoSQL Databases
5 pages
Hands-On Hadoop Tutorial Guide
100% (1)
Hands-On Hadoop Tutorial Guide
13 pages
Big Data FAQs: Hadoop & MongoDB Insights
No ratings yet
Big Data FAQs: Hadoop & MongoDB Insights
41 pages
Snowflake Architecture Guide
No ratings yet
Snowflake Architecture Guide
18 pages
Hive and HBase for Data Engineers
No ratings yet
Hive and HBase for Data Engineers
25 pages
Hadoop for Data Engineers
No ratings yet
Hadoop for Data Engineers
180 pages
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
No ratings yet
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
74 pages
AaxHadoop Interview Questions and Answers
No ratings yet
AaxHadoop Interview Questions and Answers
37 pages
Administrator Exercise Instructions 201306
No ratings yet
Administrator Exercise Instructions 201306
117 pages
10 SparkBasics
No ratings yet
10 SparkBasics
45 pages
Understanding the Hadoop Ecosystem
No ratings yet
Understanding the Hadoop Ecosystem
55 pages
17 SparkSQL
No ratings yet
17 SparkSQL
44 pages
Spark Memory and RDD Overview
No ratings yet
Spark Memory and RDD Overview
11 pages
Data Warehouse Components
No ratings yet
Data Warehouse Components
18 pages
Hive Lab: Data Management Scenarios
No ratings yet
Hive Lab: Data Management Scenarios
33 pages
BDE ManagedHadoopDataLakes PAVLIK PDF
No ratings yet
BDE ManagedHadoopDataLakes PAVLIK PDF
10 pages
Hadoop Distributed File System (HDFS) : Suresh Pathipati
No ratings yet
Hadoop Distributed File System (HDFS) : Suresh Pathipati
43 pages
Sqoop Cammand
No ratings yet
Sqoop Cammand
8 pages
Overview of Big Data Platforms
No ratings yet
Overview of Big Data Platforms
82 pages
AWS Athena Knowledgebase
No ratings yet
AWS Athena Knowledgebase
4 pages
Big Data & Hadoop Essentials
No ratings yet
Big Data & Hadoop Essentials
4 pages
Cloud Foundations - Notes
No ratings yet
Cloud Foundations - Notes
33 pages
Presto for Big Data Analytics in Cloud
100% (1)
Presto for Big Data Analytics in Cloud
31 pages
Amazon Redshift Database Developer Guide
No ratings yet
Amazon Redshift Database Developer Guide
783 pages
Hadoop For Windows Succinctly PDF
No ratings yet
Hadoop For Windows Succinctly PDF
148 pages
Planning For Big Data PDF
100% (1)
Planning For Big Data PDF
88 pages
Cloud Deployment and Service Models
No ratings yet
Cloud Deployment and Service Models
10 pages
Hadoop Singlenode
No ratings yet
Hadoop Singlenode
43 pages
HDFS Internals for Developers
No ratings yet
HDFS Internals for Developers
30 pages
HDFS Interview Prep Guide
No ratings yet
HDFS Interview Prep Guide
29 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Databricks Spark Reference Applications
No ratings yet
Databricks Spark Reference Applications
37 pages
Big Data Hadoop Certification Training: About Intellipaat
No ratings yet
Big Data Hadoop Certification Training: About Intellipaat
13 pages
1 AWS EC2 Interview Questions - MindMajix
No ratings yet
1 AWS EC2 Interview Questions - MindMajix
25 pages
Kafka Cheat Sheets
No ratings yet
Kafka Cheat Sheets
1 page
Big Query
No ratings yet
Big Query
5 pages
Introduction to SQL Basics and Usage
No ratings yet
Introduction to SQL Basics and Usage
25 pages
Apache Cassandra Database - Instaclustr
No ratings yet
Apache Cassandra Database - Instaclustr
8 pages
Big Query Interview Q&A
100% (1)
Big Query Interview Q&A
8 pages
Hadoop Chapter 1
No ratings yet
Hadoop Chapter 1
6 pages
Hadoop for Data Professionals
No ratings yet
Hadoop for Data Professionals
34 pages
INtroduction To Big DAta and HAdoop
No ratings yet
INtroduction To Big DAta and HAdoop
30 pages
Unit IV Hadoop
No ratings yet
Unit IV Hadoop
90 pages
Apache Hadoop Developer Training PDF
100% (1)
Apache Hadoop Developer Training PDF
397 pages
BAD601 Module 2 PDF
No ratings yet
BAD601 Module 2 PDF
61 pages
Intro
No ratings yet
Intro
47 pages
02 Haddop Biginsights
No ratings yet
02 Haddop Biginsights
36 pages
Linux Vs Solaris Commands PDF
No ratings yet
Linux Vs Solaris Commands PDF
5 pages
Oracle ASM 11g
No ratings yet
Oracle ASM 11g
50 pages
Demantra Architecture Overview
No ratings yet
Demantra Architecture Overview
20 pages
Oracle ASM 11g
No ratings yet
Oracle ASM 11g
50 pages
Oracle PLSQL Notes
100% (4)
Oracle PLSQL Notes
59 pages
Oracle Database 10g - DBA
100% (1)
Oracle Database 10g - DBA
98 pages
Oracle PLSQL Notes
100% (4)
Oracle PLSQL Notes
59 pages
Oracle PLSQL Notes
100% (4)
Oracle PLSQL Notes
59 pages
Interview Questions
No ratings yet
Interview Questions
70 pages
RAC Database Cloning With HOT Backup
No ratings yet
RAC Database Cloning With HOT Backup
3 pages
Introduction To Cloud Computing: ID2210 Jim Dowling
No ratings yet
Introduction To Cloud Computing: ID2210 Jim Dowling
54 pages
A World Without You 〘Official〙 - Epilogue 21 (the End) - Read Free Manga Online at Bato.to 2
No ratings yet
A World Without You 〘Official〙 - Epilogue 21 (the End) - Read Free Manga Online at Bato.to 2
5 pages
Hand-Written Notes - PDF: Suitable For Csir Ugc Net Gate Tifr Barc Etc
No ratings yet
Hand-Written Notes - PDF: Suitable For Csir Ugc Net Gate Tifr Barc Etc
146 pages
Secure Electronic Transaction (SET) : E-Commerce
No ratings yet
Secure Electronic Transaction (SET) : E-Commerce
12 pages
Types of Industrial Networks
No ratings yet
Types of Industrial Networks
9 pages
Commissioning Data For Interworking With The P-GW
No ratings yet
Commissioning Data For Interworking With The P-GW
3 pages
Installation Manual RVI
No ratings yet
Installation Manual RVI
9 pages
CP UVR 3201E2 I: 32 Ch. 1080N/720P Digital Video Recorder
No ratings yet
CP UVR 3201E2 I: 32 Ch. 1080N/720P Digital Video Recorder
5 pages
Sensor Installation Details
No ratings yet
Sensor Installation Details
369 pages
OSPF Fundamentals for CCNP ENARSI
No ratings yet
OSPF Fundamentals for CCNP ENARSI
21 pages
BSNL Technical
No ratings yet
BSNL Technical
5 pages
AFT HASP USB Key Installation Guide
No ratings yet
AFT HASP USB Key Installation Guide
27 pages
Rsfec Project Report - Final
No ratings yet
Rsfec Project Report - Final
81 pages
WhizCard GCP Associate Cloud Engineer
No ratings yet
WhizCard GCP Associate Cloud Engineer
109 pages
MR1100 DS
No ratings yet
MR1100 DS
4 pages
Millimeter Wave For 5G Cellular
No ratings yet
Millimeter Wave For 5G Cellular
21 pages
Fortios v7.6.0 Release Notes
No ratings yet
Fortios v7.6.0 Release Notes
94 pages
Dell Vostro 54 & Inspiron 5460 Specs
No ratings yet
Dell Vostro 54 & Inspiron 5460 Specs
56 pages
Phison PS5019-E19T
No ratings yet
Phison PS5019-E19T
2 pages
BioConnect Enterprise Replication Guide
No ratings yet
BioConnect Enterprise Replication Guide
9 pages
Introduction To Modern Cryptography Symmetric Encryption: Stream & Block Ciphers
No ratings yet
Introduction To Modern Cryptography Symmetric Encryption: Stream & Block Ciphers
37 pages
NS1000 User Manual PDF
No ratings yet
NS1000 User Manual PDF
384 pages
Nokia Fiber For Everything White Paper EN
No ratings yet
Nokia Fiber For Everything White Paper EN
7 pages
Literature Study of Penetration Testing
No ratings yet
Literature Study of Penetration Testing
6 pages
InteliLite 4 AMF 8 Datasheet - 2
100% (2)
InteliLite 4 AMF 8 Datasheet - 2
4 pages
LTE Link Budget and Capacity Planning
100% (1)
LTE Link Budget and Capacity Planning
79 pages
Traffic Signal Cost Estimate
No ratings yet
Traffic Signal Cost Estimate
1 page
Manual Cámara Ezviz C8W Giratoria
No ratings yet
Manual Cámara Ezviz C8W Giratoria
10 pages
BC-5800 Auto Hematology Analyzer Manual
100% (2)
BC-5800 Auto Hematology Analyzer Manual
43 pages
Informatica
0% (1)
Informatica
32 pages

Hadoop Admin

Uploaded by

Hadoop Admin

Uploaded by

Simplifying Hadoop Usage and

New & useful

New & useful

Different Aspects of Manageability

Roles (often overlap)

User (writes MapReduce

Lifecycle of a MapReduce Job

Run this program as a

Lifecycle of a MapReduce Job

How are the number of splits, number of map and reduce

Job Configuration Parameters

Running time (seconds)

Running time (minutes)

Illustrative Result: 50GB Terasort

Performance at default and rule-of-thumb settings can be poor

These features are very

Multiple data formats

Input data formats

But, they pose nontrivial

Some Thoughts on Possible Solutions

Exploit the MapReduce/Hadoop design

Some Thoughts on Possible Solutions

New & useful

New & useful

You might also like