Simplifying Hadoop Usage and
Administration
Or, With Great Power Comes Great
Responsibility in MapReduce Systems
Shivnath Babu
Duke University
?
Time
Relational DBMS
19751985
New & useful
technology
19851995
Features +++++
Open source ++
19952005
Manageability Crisis,
Research +++
20052010
Claims of self-managing,
Hard to add new features
MapReduce/Hadoop
New & useful
technology
Features +++++
Open source ++
2020
Different Aspects of Manageability
Testing
Tuning
Diagnosis
Applying fixes
Configuring
Benchmarking
Capacity planning
Disaster/failure
recovery automation
Detection/repair of
data corruption
Roles (often overlap)
User (writes MapReduce
programs, Pig scripts,
HiveQL queries, etc.)
Developer
Administrator
Lifecycle of a MapReduce Job
Map function
Reduce function
Run this program as a
MapReduce job
Lifecycle of a MapReduce Job
Time
Input
Splits
Map
Wave 1
Map
Wave 2
Reduce
Wave 1
Reduce
Wave 2
How are the number of splits, number of map and reduce
tasks, memory allocation to tasks, etc., determined?
Job Configuration Parameters
190+ parameters in
Hadoop
Set manually or defaults
are used
Are defaults or rules-ofthumb good enough?
Running time (seconds)
Running time (minutes)
Experiments
On EC2 and
local clusters
Illustrative Result: 50GB Terasort
17-node cluster, 64+32 concurrent map+reduce slots
mapred.reduce. io.sort. io.sort.record.
tasks
factor
percent
10
10
0.15
Based on
popular
rule-ofthumb
10
500
0.15
28
10
0.15
300
10
0.15
300
500
0.15
Running
time
Performance at default and rule-of-thumb settings can be poor
Cross-parameter interactions are significant
Declarative HiveQL/Pig
operations
Job configuration
parameters
Space of execution
choices
Multi-job
workflows
Complexity
Problem Space
Energy
considerations
Cost in pay-as-you-go
environment
Current approaches:
Predominantly manual
Post-mortem analysis
Performance
objectives
Is this where
we want to be?
Challenges
Features of Hadoop from
a usability perspective
These features are very
useful when dealing with
Ability to specify
schema late
Easy integration with
programming lang.
Pluggability
Multiple data formats
Mix of structured and
unstructured data
Multiple computational
engines (e.g., R, DBMS)
Changes/evolution
Input data formats
Storage engines
Schedulers
Instrumentation
But, they pose nontrivial
manageability challenges
Some Thoughts on Possible Solutions
Exploit opportunities to learn
Schema can be learned from Pig Latin scripts, HiveQL queries,
MapReduce jobs
Profile-driven optimization from the compiler world
High ratio of repeated jobs to new jobs is common
Exploit the MapReduce/Hadoop design
Common sort-partition-merge skeleton
Design for robustness gives many mechanisms for adaptation &
observation (speculative execution, storing intermediate data)
Multiple map waves
Fine-grained and pluggable scheduler
Some Thoughts on Possible Solutions
Automate try-it-out and trial-and-error approaches
For example, use 5% of cluster resources to run MapReduce
tasks with a different configuration
Exploit clouds pay-as-you-go resources, EC2 spot instances
?
Time
Relational DBMS
19751985
New & useful
technology
19851995
Features +++++
Open source ++
19952005
Manageability Crisis,
Research +++
20052010
Claims of self-managing,
Hard to add new features
MapReduce/Hadoop
New & useful
technology
Features +++++
Open source ++
2020