100% found this document useful (1 vote)

1K views36 pages

Cloudera Tutorial

Uploaded by

ajay_sngh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

1K views36 pages

Cloudera Tutorial

Uploaded by

ajay_sngh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

IntroducAon

to Data Science
with Hadoop
Glynn Durham, Senior Instructor, Cloudera
glynn@[Link]

1 of 36

Terms
I will cover:

with a few extras:

Hadoop, Hadoop ecosystem

HDFS
MapReduce
Sqoop
Flume
Hive
Pig
Mahout
Machine learning
Data science using Hadoop

YARN
HBase
Impala
Oozie
data products

2 of 36

Hadoop
Hadoop is:
a plaLorm for big data
several Apache SoNware
FoundaOon (ASF) projects
free open source soNware
Major parts:
Hadoop Core

Hadoop ecosystem
Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

3 of 36

Hadoop Core Main Features: File System and Batch Programming

4 of 36

Hadoop Core

Hadoop Core consists of:

HDFS
(Hadoop Distributed File System), for storage
MapReduce
for batch programming

5 of 36

HDFS Writes

6 of 36

HDFS Reads

7 of 36

HDFS Strengths and Weaknesses

HDFS is good at:

storing enormous les

storing a lot of data reliably

throughput on sequenAal writes
throughput on sequenAal reads of a le or part of a le

HDFS is not good at:

high speed random reads of parts of a le
HDFS cannot:
update any part of a le once wri>en*

* but you can always write a new le, and/or delete, move,
and rename les and directories
Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8 of 36

MapReduce: Programming with Simple FuncAons

9 of 36

MapReduce Chains

10 of 36

MapReduce at Scale

11 of 36

MapReduce in Hadoop

12 of 36

MapReduce Strengths and Weaknesses

MapReduce is good at:

processing enormous amounts of data
scaling out as you add more machines
conAnuing to compleAon, even when some machines die

MapReduce is not good at:

running any algorithm you can think up
algorithms that require shared state overall*
* but maybe you can get clever with your algorithm design

MapReduce cannot:
run in real Ame: MapReduce jobs are batch jobs
Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

13 of 36

Detour: YARN, Yet Another Resource NegoAatornear future

14 of 36

Hadoop Ecosystem
The Hadoop Ecosystem consists of other projects that round

out Hadoop Core to make it a useful pla\orm:

Sqoop, for RDBMS integraAon
Flume, for event ingesAon
Hive, for "SQL"-like high-level programming
Pig, another high-level programming paradigm
Mahout, a Java library for machine learning in Hadoop
Plus:
HBase, a "NoSQL" database system
Oozie, a workow manager for Hadoop acAons
....
Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

15 of 36

Sqoop: RDBMS to Hadoop and Back

16 of 36

Flume: IngesAng ConAnuing Event Data

17 of 36

Detour: General File Input/Output

18 of 36

MapReduce revisited: How to write MapReduce programs?

Java MapReduce API

The most expressive technique possible

The most work, by far

(Can be easier with Hadoop Streaming: a way to use streaming programming

such as shell scripOng or Python)
Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

19 of 36

Hive: MapReduce as "SQL"

Familiar language and programming paradigm

Provides interface to many SQL-compliant tools

20 of 36

Detour: Impala, High Speed AnalyAcs in Hadoop

5 to 30 Omes faster then Hive queries (someOmes 100's of Omes faster!)

Cloudera exclusive oering, but Apache licensed, so it's free and open source
Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

21 of 36

Impala Does Not Use MapReduce

22 of 36

Detour: HBase, A NoSQL Database System

23 of 36

Detour: A bit more about HBase

HBase is a NoSQL database system:

programmers create and use database tables
high volume, high performance access to individual cells
much weaker query language than SQL
lacks ACID-compliant transacAons

HBase is not strictly needed to do "data science"

a resource hog; competes with analyAcal programs
ogen deployed on its own separate cluster
may be part of your organizaAon's data storage and delivery,
so you may need to get or put data into an HBase system*
* (or other NoSQL system)
Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

24 of 36

Pig: Another Language for MapReduce

25 of 36

Mahout: Machine Learning in MapReduce

Mahout is:
a collecOon of algorithms, mainly focused on "the three C's" of
machine learning
wriden in Java
largely implemented over Hadoop MapReduce
invocable from the command line
extensible, with the Java API
Mahout is not:
a turnkey soluOon for doing machine learning
always user-friendly
Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

26 of 36

Machine Learning

"The three C's" of machine learning:

27 of 36

Supervised Machine Learning: ClassicaAon

28 of 36

Machine Learning: Clustering

29 of 36

Machine Learning: CollaboraAve Filtering for Recommenders

30 of 36

Simple Enterprise Deployment: Hadoop as ETL Appliance

31 of 36

Detour: Oozie, Workow within Hadoop

Simple workow within Hadoop:
1. Clear out staging directory in HDFS
2. Sqoop import from OLTP tables
3. Hive (or Pig) script to transform data
4. Sqoop export to data warehouse

32 of 36

Hadoop: The Bigger Picture

33 of 36

Data Science with Hadoop

A data scienOst will:
1.

IdenOfy internal and external data for potenOal use (general data wrangling tools).

Help build ingesOon pipelines to obtain data for use (Flume, Sqoop, other).

Examine, clean, and anonymize ingested data (Hive, Impala, Pig, Hadoop Streaming).

Shape data into useful formats (Hive, Pig).

Explore data sets to gain understanding of problems, trends, reality (Impala, Hive, Pig,
staOsOcal programming).

Build predicOve models using staOsOcal programming, machine learning (Mahout).

Contribute to data products: products in the organizaOon that are built in large part
from the data itself (Mahout, Sqoop export, general le export).

Conduct experiments with data products, quanOfying benets and/or tradeos of

system changes (Flume, Sqoop, staOsOcal tests).

Communicate results and insights to stakeholders (visualizaOon*).

34 of 36

VisualizaAon: Needs VisualizaAon Sogware

35 of 36

Thank you!
QuesAons? ContribuAons?
Glynn Durham, Senior Instructor, Cloudera
glynn@[Link]

36 of 36

Cloudera Administration
No ratings yet
Cloudera Administration
694 pages
CDH To CDP Migration-July29v3
0% (1)
CDH To CDP Migration-July29v3
22 pages
Hadoop Training #1: Thinking at Scale
100% (1)
Hadoop Training #1: Thinking at Scale
20 pages
Hive Interview Questions for Professionals
50% (2)
Hive Interview Questions for Professionals
6 pages
Cloudera Administrator Training
100% (6)
Cloudera Administrator Training
373 pages
Configuring Hadoop Security With Cloudera Manager
No ratings yet
Configuring Hadoop Security With Cloudera Manager
52 pages
Cloudera Hbase
100% (1)
Cloudera Hbase
145 pages
Hadoop Administration
No ratings yet
Hadoop Administration
97 pages
Administrator Exercise Instructions 201306
No ratings yet
Administrator Exercise Instructions 201306
117 pages
Cloudera CDSW
No ratings yet
Cloudera CDSW
122 pages
Hadoop Administrator Interview Questions: Cloudera® Enterprise Version
No ratings yet
Hadoop Administrator Interview Questions: Cloudera® Enterprise Version
13 pages
Cloudera Hive
No ratings yet
Cloudera Hive
106 pages
Cloudera Administrator Training Slides PDF
No ratings yet
Cloudera Administrator Training Slides PDF
601 pages
Hadoop for Data Engineers
No ratings yet
Hadoop for Data Engineers
44 pages
Cloudera Manager Administration Guide
No ratings yet
Cloudera Manager Administration Guide
78 pages
BK Hdfs Administration
No ratings yet
BK Hdfs Administration
73 pages
Apache Hue-Cloudera
No ratings yet
Apache Hue-Cloudera
63 pages
Cloudera Manager Installation Guide
No ratings yet
Cloudera Manager Installation Guide
23 pages
Big Data Hadoop Interview Questions and Answers
No ratings yet
Big Data Hadoop Interview Questions and Answers
26 pages
Cloudera Installation
No ratings yet
Cloudera Installation
180 pages
HDFS Exercises - Basic
No ratings yet
HDFS Exercises - Basic
5 pages
Cloudera Administration
No ratings yet
Cloudera Administration
399 pages
Informatica BDM Training Agenda
100% (2)
Informatica BDM Training Agenda
4 pages
Admin Cloudera
100% (3)
Admin Cloudera
637 pages
Percona Monitoring and Management Guide
No ratings yet
Percona Monitoring and Management Guide
589 pages
Administration of Hadoop Summer 2014 Lab Guide v3.1
No ratings yet
Administration of Hadoop Summer 2014 Lab Guide v3.1
107 pages
Cloudera Administration Study Guide
No ratings yet
Cloudera Administration Study Guide
3 pages
Cloudera Distribution of Apache Kafka
No ratings yet
Cloudera Distribution of Apache Kafka
56 pages
Pavan Resume
No ratings yet
Pavan Resume
3 pages
Cloudera's Guide to Apache Hadoop Essentials
No ratings yet
Cloudera's Guide to Apache Hadoop Essentials
3 pages
Cloudera Administration PDF
100% (1)
Cloudera Administration PDF
476 pages
Cloudera Quickstart PDF
No ratings yet
Cloudera Quickstart PDF
28 pages
Connect JMS to WebSphere MQ CCDT
No ratings yet
Connect JMS to WebSphere MQ CCDT
32 pages
TalendOpenStudio BigData UG 5.2.1 en
No ratings yet
TalendOpenStudio BigData UG 5.2.1 en
266 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Hadoop Security for IT Professionals
No ratings yet
Hadoop Security for IT Professionals
27 pages
Hadoop Admin Course
No ratings yet
Hadoop Admin Course
8 pages
Apache Hive Tutorial
No ratings yet
Apache Hive Tutorial
139 pages
Cloudera Introduction PDF
No ratings yet
Cloudera Introduction PDF
97 pages
Hadoop For Windows Succinctly PDF
No ratings yet
Hadoop For Windows Succinctly PDF
148 pages
Cloudera Hadoop Admin Notes PDF
No ratings yet
Cloudera Hadoop Admin Notes PDF
65 pages
Cloudera Administrator Exercise Instructions PDF
No ratings yet
Cloudera Administrator Exercise Instructions PDF
126 pages
Cloudera Kudu
100% (1)
Cloudera Kudu
102 pages
Load Unstructured Data into Hive with PySpark
No ratings yet
Load Unstructured Data into Hive with PySpark
9 pages
Snowflake Data Prep Best Practices
No ratings yet
Snowflake Data Prep Best Practices
8 pages
Cloudera Kafka
No ratings yet
Cloudera Kafka
175 pages
Apache Hadoop Developer Training PDF
100% (1)
Apache Hadoop Developer Training PDF
397 pages
Cloudera Hadoop Introduction PDF
100% (1)
Cloudera Hadoop Introduction PDF
50 pages
Cloud Era Csu La 11122012
No ratings yet
Cloud Era Csu La 11122012
50 pages
Cloudera Apache Hadoop 101
100% (1)
Cloudera Apache Hadoop 101
51 pages
Slides PDF
No ratings yet
Slides PDF
30 pages
Week 4 - Hadoop Ecosystem
No ratings yet
Week 4 - Hadoop Ecosystem
109 pages
Apache Hadoop: Getting Started With
No ratings yet
Apache Hadoop: Getting Started With
7 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
55 pages
Introduction to Apache Hadoop
No ratings yet
Introduction to Apache Hadoop
29 pages
Understanding the Hadoop Ecosystem
No ratings yet
Understanding the Hadoop Ecosystem
55 pages
Understanding the Hadoop Ecosystem
No ratings yet
Understanding the Hadoop Ecosystem
55 pages
Introduction to Hadoop and Cloudera
100% (1)
Introduction to Hadoop and Cloudera
91 pages
CDH Installation on VirtualBox Guide
No ratings yet
CDH Installation on VirtualBox Guide
44 pages
Fine Arts
No ratings yet
Fine Arts
33 pages
Vocational Courses for Students
No ratings yet
Vocational Courses for Students
2 pages
NCC Training Curriculum Overview
No ratings yet
NCC Training Curriculum Overview
35 pages
ECONOMICS (Code No. 030) : Rationale
No ratings yet
ECONOMICS (Code No. 030) : Rationale
12 pages
20 Entrepreneurship
No ratings yet
20 Entrepreneurship
14 pages
Atal Pension Yojana Registration Form
No ratings yet
Atal Pension Yojana Registration Form
1 page
Out Put Log
No ratings yet
Out Put Log
18 pages
Oral Appliances for OSA in Easton, PA
No ratings yet
Oral Appliances for OSA in Easton, PA
9 pages
Central List of Obcs For The State of Jharkhand
No ratings yet
Central List of Obcs For The State of Jharkhand
4 pages
Focus On Elementary What Is Good Teaching For Elementary English Language Learners
No ratings yet
Focus On Elementary What Is Good Teaching For Elementary English Language Learners
7 pages
Final Documentation-Customer Information System Dem
No ratings yet
Final Documentation-Customer Information System Dem
18 pages
Assignment 2 (179065022)
No ratings yet
Assignment 2 (179065022)
3 pages
Kpo-00-Qac-Gls-00001 Er Rev A6
No ratings yet
Kpo-00-Qac-Gls-00001 Er Rev A6
111 pages
PDF SAP S4HCON E S4HCON2019 Certificati PDF
No ratings yet
PDF SAP S4HCON E S4HCON2019 Certificati PDF
7 pages
Web-Based Pharmacy Management System
No ratings yet
Web-Based Pharmacy Management System
75 pages
Assignment Questions Maths S S 1
No ratings yet
Assignment Questions Maths S S 1
2 pages
TS TET Paper 2 Maths Science 2023
No ratings yet
TS TET Paper 2 Maths Science 2023
16 pages
Subject Verb Agreement
No ratings yet
Subject Verb Agreement
2 pages
Inglês - Pré-Vestibular III
No ratings yet
Inglês - Pré-Vestibular III
4 pages
1 Out of 1: Correct. The Answer Is B
No ratings yet
1 Out of 1: Correct. The Answer Is B
18 pages
Nest Slides
No ratings yet
Nest Slides
171 pages
Division Strategies for Year 4 Maths
No ratings yet
Division Strategies for Year 4 Maths
3 pages
C++ Object-Oriented Programming Solutions
No ratings yet
C++ Object-Oriented Programming Solutions
3 pages
Lecture Three English 2ndyear Blades of Grass
No ratings yet
Lecture Three English 2ndyear Blades of Grass
9 pages
Major Project Documentation
No ratings yet
Major Project Documentation
36 pages
Task 2 - Design An Oral Activity - Sara Herves Fernández
No ratings yet
Task 2 - Design An Oral Activity - Sara Herves Fernández
13 pages
Homi Bhabha's Third Space and African Identity
100% (1)
Homi Bhabha's Third Space and African Identity
11 pages
Detailed Lesson Plan in Teaching Math I. Objectives
100% (1)
Detailed Lesson Plan in Teaching Math I. Objectives
5 pages
Fourier Series for Engineers
No ratings yet
Fourier Series for Engineers
7 pages
DX Diag
No ratings yet
DX Diag
39 pages
Sentence Reordering
No ratings yet
Sentence Reordering
8 pages
Intro to Mathematical Language
No ratings yet
Intro to Mathematical Language
24 pages
Latin Verb "Amo" Conjugation Guide
No ratings yet
Latin Verb "Amo" Conjugation Guide
4 pages
The Structure of English: Word Formation Strategies
No ratings yet
The Structure of English: Word Formation Strategies
77 pages
Grammar Teaching Plan: Real Conditional Sentences: 1. Warming Up
No ratings yet
Grammar Teaching Plan: Real Conditional Sentences: 1. Warming Up
9 pages
Rail Reservation System Project Manual
No ratings yet
Rail Reservation System Project Manual
8 pages
CS 3853 Computer Architecture - Memory Hierarchy
No ratings yet
CS 3853 Computer Architecture - Memory Hierarchy
37 pages
Mastering Past Perfect Continuous Tense
No ratings yet
Mastering Past Perfect Continuous Tense
2 pages
Renaissance Self Fashioning From More To Shakespeare
100% (2)
Renaissance Self Fashioning From More To Shakespeare
37 pages

Cloudera Tutorial

Uploaded by

Cloudera Tutorial

Uploaded by

IntroducAon

with a few extras:

Hadoop, Hadoop ecosystem

Hadoop Core Main Features: File System and Batch Programming

Hadoop Core consists of:

HDFS Strengths and Weaknesses

HDFS is good at:

storing a lot of data reliably

HDFS is not good at:

MapReduce: Programming with Simple FuncAons

MapReduce Strengths and Weaknesses

MapReduce is good at:

MapReduce is not good at:

Detour: YARN, Yet Another Resource NegoAatornear future

out Hadoop Core to make it a useful pla\orm:

Sqoop: RDBMS to Hadoop and Back

Flume: IngesAng ConAnuing Event Data

Detour: General File Input/Output

MapReduce revisited: How to write MapReduce programs?

The most expressive technique possible

The most work, by far

(Can be easier with Hadoop Streaming: a way to use streaming programming

Hive: MapReduce as "SQL"

Familiar language and programming paradigm

Provides interface to many SQL-compliant tools

Detour: Impala, High Speed AnalyAcs in Hadoop

5 to 30 Omes faster then Hive queries (someOmes 100's of Omes faster!)

Impala Does Not Use MapReduce

Detour: HBase, A NoSQL Database System

Detour: A bit more about HBase

HBase is a NoSQL database system:

HBase is not strictly needed to do "data science"

Pig: Another Language for MapReduce

Mahout: Machine Learning in MapReduce

"The three C's" of machine learning:

Supervised Machine Learning: ClassicaAon

Machine Learning: Clustering

Machine Learning: CollaboraAve Filtering for Recommenders

Simple Enterprise Deployment: Hadoop as ETL Appliance

Detour: Oozie, Workow within Hadoop

Hadoop: The Bigger Picture

Data Science with Hadoop

Shape data into useful formats (Hive, Pig).

Build predicOve models using staOsOcal programming, machine learning (Mahout).

Conduct experiments with data products, quanOfying benets and/or tradeos of

Communicate results and insights to stakeholders (visualizaOon*).

VisualizaAon: Needs VisualizaAon Sogware

You might also like