0% found this document useful (0 votes)
56 views16 pages

Big Data Open Source Implementation & Administration

This document provides an overview of a course on Big Data open source implementation and administration. The course roadmap covers modules on big data management systems, data acquisition and storage, data access and processing, and data unification and analysis. The course uses a Big Data virtual machine to demonstrate concepts and technologies like Apache Hadoop, HDFS, MapReduce, YARN, Hive, Pig, and Impala.

Uploaded by

Joze1208
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views16 pages

Big Data Open Source Implementation & Administration

This document provides an overview of a course on Big Data open source implementation and administration. The course roadmap covers modules on big data management systems, data acquisition and storage, data access and processing, and data unification and analysis. The course uses a Big Data virtual machine to demonstrate concepts and technologies like Apache Hadoop, HDFS, MapReduce, YARN, Hive, Pig, and Impala.

Uploaded by

Joze1208
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Big Data Open Source Implementation and Administration

Hours: 40
Instructor: Ing. Yonogy Curi

1 Introduction
Objectives 1-2
Questions About You 1-3
Course Objectives 1-4
Course Road Map: Module 1 Big Data Management System 1-5
Course Road Map: Module 2 Data Acquisition and Storage 1-6
Course Road Map: Module 3 Data Access and Processing 1-7
Course Road Map: Module 4 Data Unification and Analysis 1-8
The Big Data Virtual Machine (Used in this Course) Home Page 1-10
Connecting to the Practice Environment 1-11
Starting the Big Data Virtual Machine (VM) Used in this Course 1-12
Starting the Big Data (BDLite) Virtual Machine (VM) used in this Course 1-13
Accessing the Getting Started Page from the BDVM 1-14
Big Data Appliances Documentation 1-19

2 Big Data and the Information Management System


Lesson Objectives 2-3
Big Data: A Strategic IM Perspective 2-4
Big Data 2-5
Characteristics of Big Data 2-6
Importance of Big Data 2-8
Big Data Opportunities: Some Examples 2-9
Big Data Challenges 2-10
Information Management Landscape 2-12
Extending the Boundaries of Information Management 2-13
A Simple Functional Model for Big Data 2-14
Information Management Conceptual Architecture 2-16
Design Patterns to Component Usage Map 2-18
Big Data Adoption and Implementation Patterns 2-20
IM Architecture Data Approaches: Schema-on-Write vs Schema-on-Read 2-22
Course Approach: Big Data Project Phases 2-24
IM System for Big Data 2-26
Additional Resources 2-30
Summary 2-31
3 Using Big Data on Virtual Machine
Objectives 3-3
Lesson Agenda 3-4
Big Data Virtual Machine: Introduction 3-5
Big Data VM Components 3-6
Initializing the Environment for the Big Data VM 3-7
Initializing the Environment 3-8
Lesson Agenda 3-9
MoviePlex Case Study: Introduction 3-10
Big Data Challenge 3-12
Derive Value from Big Data 3-13
MoviePlex: Goal 3-14
MoviePlex: Big Data Challenges 3-15
MoviePlex: Architecture 3-16
MoviePlex: Data Generation 3-17
MoviePlex: Data Generation Format 3-18
MoviePlex Application 3-19
Summary 3-20

4 Introduction to the Big Data Ecosystem


Objectives 4-3
Computer Clusters 4-4
Distributed Computing 4-5
Apache Hadoop 4-6
Types of Analysis That Use Hadoop 4-7
Apache Hadoop Ecosystem 4-8
Apache Hadoop Core Components 4-9
HDFS Key Definitions 4-11
NameNode (NN) & DataNodes 4-12
MapReduce Framework 4-14
Benefits of MapReduce 4-15
MapReduce Job 4-16
MapReduce Versions 4-19
Choosing a Hadoop Distribution and Version 4-20
Additional Resources: Apache Hadoop 4-22
Cloudera’s Distribution Including Apache Hadoop (CDH) 4-23
CDH Architecture 4-24
CDH Components 4-25
CDH Architecture 4-26
CDH Components 4-28
Summary 4-30
5 Introduction to the Hadoop Distributed File System (HDFS)
Objectives 5-3
HDFS: Characteristics 5-5
HDFS Deployments: High Availability (HA) and Non-HA 5-7
HDFS Key Definitions 5-8
Functions of the NameNode 5-10
Secondary NameNode (Non-HA) 5-11
Functions of DataNodes 5-13
NameNode and Secondary NameNodes 5-14
Storing and Accessing Data Files in HDFS 5-15
HDFS Architecture: HA 5-17
Configuring an HA Cluster Hardware Resources 5-19
Data Replication Process 5-26
Accessing HDFS 5-27
HDFS Commands 5-29
Shell Interface 5-30
Accessing HDFS 5-32
FS Shell Commands 5-33
Sample FS Shell Commands 5-35
HDFS Administration Commands 5-38
Using the hdfs fsck Command: Example 5-39
HDFS Features and Benefits 5-40
Summary 5-41

6 Acquire Data Using CLI, Fuse DFS, and Flume


Objectives 6-3
Reviewing the Command Line Interface (CLI) 6-4
Viewing File System Contents Using the CLI 6-5
Loading Data Using the CLI 6-6
What is Fuse DFS? 6-7
Enabling Fuse DFS on Big Data 6-8
Using Fuse DFS 6-9
What is Flume? 6-10
Flume: Architecture 6-11
Flume Sources (Consume Events) 6-12
Flume Channels (Hold Events) 6-13
Flume Sinks (Deliver Events) 6-14
Configuring Flume 6-16
Exploring a flume*.conf File 6-17
Additional Resources 6-18
Summary 6-19
7 Acquire and Access Data Using NoSQL Database
Objectives 7-3
What is a NoSQL Database? 7-4
RDBMS Compared to NoSQL 7-5
HDFS Compared to NoSQL 7-6
NoSQL Database 7-7
Points to Consider Before Choosing NoSQL 7-8
NoSQL Key-Value Data Model 7-9
Acquiring and Accessing Data in a NoSQL DB 7-11
Primary (Parent) Table Data Model 7-12
Table Data Model: Child Tables 7-13
Creating Tables 7-14
Creating Tables: Two Options 7-15
Data Definition Language (DDL) Commands 7-16
CREATE TABLE 7-17
Accessing the CLI 7-19
Executing a DDL Command 7-20
Viewing Table Descriptions 7-21
Recommendation: Using Scripts 7-22
Loading Data Into Tables 7-23
Accessing the KVStore 7-24
Introducing the TableAPI 7-25
Write Operations: put() Methods 7-26
Writing Rows to Tables: Steps 7-27
Constructing a Handle 7-28
Creating Row Object, Adding Fields, and Writing Record 7-29
Reading Data from Tables 7-30
Read Operations: get() Methods 7-31
Retrieving Table Data: Steps 7-32
Retrieving Single a Row 7-33
Retrieving Multiple Rows 7-34
Retrieving Child Tables 7-35
Removing Data From Tables 7-36
Delete Operations: 3 TableAPIs 7-37
Deleting Row(s) From a Table: Steps 7-38
Additional Resources 7-39
Summary 7-40
8 Primary Administrative Tasks for NoSQL Database
Objectives 8-3
Installation Planning: KVStore Analysis 8-4
InitialCapacityPlanning Spreadsheet 8-5
Planning Spreadsheet Sections 8-6
Next Topic 8-7
Configuration Requirements 8-8
Determine the Number of Shards 8-9
Determine # of Partitions and Replication Factor 8-10
Determine # of Storage Nodes 8-11
Installation and Configuration Steps 8-12
Step 1: Creating Directories 8-13
Step 2: Extracting Software 8-14
Step 3: Verifying the Installation 8-15
Step 4: Configuring Nodes (Using the makebootconfig Utility) 8-16
Using the makebootconfig Utility 8-18
Starting the Storage Node Agents 8-19
Pinging the Replication Nodes 8-20
Next Topic 8-21
Configuration and Monitoring Tools 8-22
Steps to Deploy a KVStore 8-23
Introducing Plans 8-24
States of a Plan 8-25
Starting the Configuration Tool 8-26
Configuring KVStore 8-27
Creating a Zone 8-28
Deploying Storage and Admin Nodes 8-29
Creating a Storage Pool 8-30
Joining Nodes to the Storage Pool 8-31
Creating a Topology 8-32
Deploying the KVStore 8-33
Testing the KVStore 8-34
Additional Resources 8-35
Summary 8-36
9 Introduction to MapReduce
Objectives 9-3
MapReduce 9-4
MapReduce Architecture 9-5
MapReduce Version 1 (MRv1) Architecture 9-6
MapReduce Phases 9-7
MapReduce Framework 9-8
Parallel Processing with MapReduce 9-9
MapReduce Jobs 9-10
Interacting with MapReduce 9-11
MapReduce Processing 9-12
MapReduce (MRv1) Daemons 9-13
Hadoop Basic Cluster (MRv1): Example 9-14
MapReduce Application Workflow 9-15
Data Locality Optimization in Hadoop 9-17
MapReduce Mechanics: Deck of Cards Example 9-18
MapReduce Mechanics Example: Assumptions 9-19
MapReduce Mechanics: The Map Phase 9-20
MapReduce Mechanics: The Shuffle and Sort Phase 9-21
MapReduce Mechanics: The Reduce Phase 9-22
Word Count Process: Example 9-23
Submitting a MapReduce 9-24
Summary 9-25

10 Resource Management Using YARN


Objectives 10-3
Agenda 10-4
Apache Hadoop YARN: Overview 10-5
MapReduce 2.0 or YARN Architecture 10-7
MapReduce 2.0 (MRv2) or YARN Daemons 10-8
Hadoop Basic Cluster YARN (MRv2): Example 10-9
YARN Versus MRv1 Architecture 10-10
YARN (MRv2) Architecture 10-11
MapReduce 2.0 (MRv2) or YARN Daemons 10-13
YARN (MRv2) Daemons 10-14
YARN: Features 10-15
Launching an Application on a YARN Cluster 10-16
MRv1 Versus MRv2 10-18
Job Scheduling in YARN 10-20
YARN Fair Scheduler 10-21
Cloudera Manager Resource Management Features 10-23
Static Service Pools 10-25
Working with the Fair Scheduler 10-26
Cloudera Manager Dynamic Resource Management: Example 10-27
Submitting a Job to hrpool By User lucy from the hr Group 10-33
Monitoring the Status of the Submitted MapReduce Job 10-34
Examining the marketingpool 10-35
Submitting a Job to marketingpool By User lucy from the hr Group 10-36
Monitoring the Status of the Submitted MapReduce Job 10-37
Submitting a Job to marketingpool By User bob from the marketing Group 10-38
Monitoring the Status of the Submitted MapReduce Job 10-39
Delay Scheduling 10-40
Agenda 10-41
YARN application Command 10-42
YARN application Command: Example 10-43
Monitoring an Application Using the UI 10-45
The Scheduler: BDA Example 10-46
Summary 10-47

11 Overview of Hive and Pig


Objectives 11-3
Hive 11-4
Use Case: Storing Clickstream Data 11-5
Defining Tables over HDFS 11-6
Hive: Data Units 11-8
The Hive Metastore Database 11-9
Hive Framework 11-10
Creating a Hive Database 11-11
Data Manipulation in Hive 11-12
Data Manipulation in Hive: Nested Queries 11-13
Steps in a Hive Query 11-14
Hive-Based Applications 11-15
Hive: Limitations 11-16
Pig: Overview 11-17
Pig Latin 11-18
Pig Applications 11-19
Running Pig Latin Statements 11-20
Pig Latin: Features 11-21
Working with Pig 11-22
Summary 11-23
12 Overview of Cloudera Impala
Objectives 12-3
Hadoop: Some Data Access/Processing Options 12-4
Cloudera Impala 12-5
Cloudera Impala: Key Features 12-6
Cloudera Impala: Supported Data Formats 12-7
Cloudera Impala: Programming Interfaces 12-8
How Impala Fits Into the Hadoop Ecosystem 12-9
How Impala Works with Hive 12-10
How Impala Works with HDFS and HBase 12-11
Summary of Cloudera Impala Benefits 12-12
Impala and Hadoop: Limitations 12-13
Summary 12-14

13 Using XQuery for Hadoop


Objectives 13-3
XML 13-4
XML Elements 13-6
XML Attributes 13-8
XML Path Language 13-9
XPath Terminology: Node Types 13-10
XPath Terminology: Family Relationships 13-11
XPath Expressions 13-12
Location Path Expression: Example 13-13
XQuery: Review 13-14
XQuery Terminology 13-15
XQuery Review: books.xml Document Example 13-16
XQuery for Hadoop (OXH) 13-18
OXH Features 13-19
XQuery for Hadoop Data Flow 13-20
Using OXH 13-21
OXH Installation 13-22
OXH Functions 13-23
OXH Adapters 13-24
Running a Query: Syntax 13-25
OXH: Configuration Properties 13-26
XQuery Transformation and Basic Filtering: Example 13-27
Viewing the Completed Application in YARN 13-30
Calling Custom Java Functions from XQuery 13-31
Additional Resources 13-32
Summary 13-33
14 Overview of Solr
Objectives 14-3
Apache Solr (Cloudera Search) 14-4
Types of Indexing 14-5
The solrctl Command 14-12
SchemaXML File 14-13
Creating a Solr Collection 14-14
Using OXH with Solr 14-15
Using Solr with Hue 14-16
Summary 14-18

15 Apache Spark
Objectives 15-3
Apache Spark 15-4
Introduction to Spark 15-5
Spark: Components for Distributed Execution 15-6
Resilient Distributed Dataset (RDD) 15-7
RDD Operations 15-8
Characteristics of RDD 15-9
Directed Acyclic Graph Execution Engine 15-10
Scala Language: Overview 15-11
Scala Program: Word Count Example 15-12
Spark Shells 15-13
Summary 15-14

16 Options for Integrating Your Big Data


Objectives 16-3
Unifying Data: A Typical Requirement 16-4
Introducing Data Unification Options 16-6
Data Unification: Batch Loading 16-7
Sqoop 16-8
Loader for Hadoop (OLH) 16-9
Copy to BDA 16-10
Data Unification: Batch and Dynamic Loading 16-11
SQL Connector for Hadoop 16-12
Data Unification: ETL and Synchronization 16-13
Big Data Heterogeneous Integration with Hadoop Environments 16-14
Data Unification: Dynamic Access 16-16
Big Data SQL: A New Architecture 16-17
When To Use Different Technologies? 16-18
Summary 16-19
17 Overview of Apache Sqoop
Objectives 17-3
Apache Sqoop 17-4
Sqoop Components 17-5
Sqoop Features 17-6
Sqoop: Connectors 17-7
Importing Data into Hive 17-8
Sqoop: Advantages 17-9
Summary 17-10

18 Using Loaders for Hadoop (OLH)


Objectives 18-3
Loader for Hadoop 18-4
Software Prerequisites 18-5
Modes of Operation 18-6
OLH: Online Database Mode 18-7
Running an OLH Job 18-8
OLH Use Cases 18-9
Load Balancing in OLH 18-10
Input Formats 18-11
OLH: Offline Database Mode 18-12
Offline Load Advantages in OLH 18-13
OLH Versus Sqoop 18-14
Summary 18-15

19 Using Copy to BDA


Course Road Map 19-2
Objectives 19-3
Copy to BDA 19-4
Requirements for Using Copy to BDA 19-5
How Does Copy to BDA Work? 19-6
Copy to BDA: Functional Steps 19-7
Querying the Data in Hive 19-13
Summary 19-14
20 Using SQL Connector for HDFS
Objectives 20-3
SQL Connector for HDFS 20-4
OSCH Architecture 20-5
Using OSCH: Two Simple Steps 20-6
Using OSCH: Creating External Directory 20-7
Using OSCH: Database Objects and Grants 20-8
Using OSCH: Supported Data Formats 20-9
Using OSCH: HDFS Text File Support 20-10
Using OSCH: Hive Table Support 20-12
Using OSCH: Partitioned Hive Table Support 20-14
OSCH: Features 20-15
OSCH: Performance Tuning 20-17
OSCH: Key Benefits 20-18
Summary 20-20

21 Data Integrator with Hadoop


Objectives 21-3
Data Integrator 21-4
Declarative Design 21-5
Big Data Heterogeneous Integration with Hadoop Environments 21-7
Resources for Integration 21-13
Summary 21-14

22 Using Big Data SQL


Objectives 22-3
Barriers to Effective Big Data Adoption 22-4
Overcoming Big Data Barriers 22-5
Goal and Benefits 22-7
Using Big Data SQL 22-8
Configuring Big Data SQL 22-9
Create External Tables Over HDFS Data and Query the Data 22-14
Create External Tables to Leverage the Hive Metastore and Query the Data 22-16
Using Access Parameters with _hive 22-17
Automating External Table Creation 22-19
Applying Database Security Policies 22-20
Viewing the Results 22-21
Applying Redaction Policies to Data in Hadoop 22-22
Viewing Results from the Hive (Avro) Source 22-23
Viewing the Results from Joined RDBMS and HDFS Data 22-24
Summary 22-25
23 Using Advanced Analytics: Data Mining and R Enterprise
Objectives 23-3
Advanced Analytics 23-4
Data Mining Overview 23-5
What Is Data Mining? 23-6
Common Uses of Data Mining 23-7
Defining Key Data Mining Properties 23-8
Data Mining Categories 23-10
Supervised Data Mining Techniques 23-11
Supervised Data Mining Algorithms 23-12
Unsupervised Data Mining Techniques 23-13
Unsupervised Data Mining Algorithms 23-14
Data Mining: Overview 23-15
Data Miner GUI 23-16
DM SQL Interface 23-17
Data Miner 4.1 Big Data Enhancement 23-18
Example Workflow Using JSON Query Node 23-19
ODM Resources 23-20
What Is R? 23-23
Who Uses R? 23-24
Why Do Statisticians, Data Analysts, Data Scientists Use R? 23-25
Limitations of R 23-26
Strategy for the R Community 23-27
R Enterprise 23-28
R: Software Features 23-29
R Packages 23-30
Functions for Interacting with Database 23-31
R: Target Environment 23-32
R: Data Sources 23-33
R and Hadoop 23-34
R and HDFS Connectivity and Interaction 23-37
Hadoop Connectivity and Interaction 23-40
Summary 23-45
24 Introducing Big Data Discovery
Course Road Map 24-2
Objectives 24-3
Big Data Discovery 24-4
Find Data 24-5
Explore Data 24-6
Transform and Enrich Data 24-7
Discover Information 24-8
Share Insights 24-9
BDD: Technical Innovation on Hadoop 24-10
Additional Resources 24-11
Summary 24-12

25 Introduction to the Big Data Appliance (BDA)


Objectives 25-3
Big Data Appliance 25-4
Big Data Appliance: Key Component of the Big Data 25-5
Engineered Systems for Big Data 25-6
The Available BDA Configurations 25-7
Using the Mammoth Utility 25-8
Using BDA Configuration Generation Utility 25-10
Configuring Big Data Appliance 25-11
The Generated Configuration Files 25-13
The BDA Configuration Generation Utility Pages 25-15
Big Data Appliance: Software Components 25-16
Big Data Appliance and YARN 25-17
Stopping the YARN Service 25-18
Hardware Failure in NoSQL 25-22
Integrated Lights Out Manager (ILOM): Overview 25-23
ILOM Users 25-24
Connecting to ILOM Using the Network 25-25
ILOM: Integrated View 25-26
Monitoring the Health of BDA: Management Utilities 25-27
Big Data Appliance: Usage Guidelines 25-39
Summary 25-40
26 Managing BDA
Objectives 26-3
Lesson Agenda 26-4
Mammoth Utility 26-5
Installation types 26-6
Mammoth Code: Examples 26-7
Mammoth Installation Steps 26-8
Lesson Agenda 26-10
Monitoring BDA 26-11
BDA Command-Line Interface 26-12
bdacli 26-13
setup-root-ssh 26-14
Lesson Agenda 26-15
Monitor BDA with Enterprise Manager 26-16
OEM: Web and Command-Line Interfaces 26-17
OEM: Hardware Monitoring 26-18
Hadoop Cluster Monitoring 26-19
Lesson Agenda 26-20
Managing CDH Operations 26-21
Using Cloudera Manager 26-22
Monitoring BDA Status 26-23
Performing Administrative Tasks 26-24
Managing Services 26-25
Lesson Agenda 26-26
Monitoring MapReduce Jobs 26-27
Monitoring the Health of HDFS 26-28
Lesson Agenda 26-29
Cloudera Hue 26-30
Hive Query Editor (Hue) Interface 26-31
Logging in to Hue 26-32
Lesson Agenda 26-33
Starting BDA 26-34
Stopping BDA 26-35
BDA Port Assignments 26-36
Summary 26-37
27 Balancing MapReduce Jobs
Objectives 27-3
Ideal World: Neatly Balanced MapReduce Jobs 27-4
Real World: Skewed Data and Unbalanced Jobs 27-5
Data Skew 27-6
Data Skew Can Slow Down the Entire Hadoop Job 27-7
Perfect Balance 27-8
How Does the Perfect Balance Work? 27-9
Using Perfect Balance 27-10
Application Requirements for Using Perfect Balance 27-11
Perfect Balance: Benefits 27-12
Using Job Analyzer 27-13
Getting Started with Perfect Balance 27-14
Using Job Analyzer 27-16
Environmental Setup for Perfect Balance and Job Analyzer 27-17
Using Job Analyzer as a Stand-Alone Utility: Example with a YARN Cluster 27-19
Configuring Perfect Balance 27-20
Using Perfect Balance to Run a Balanced MapReduce Job 27-21
Running a Job Using Perfect Balance: Examples 27-23
Perfect Balance–Generated Reports 27-25
The Job Analyzer Reports: Structure of the Job Output Directory 27-26
Reading the Job Analyzer Reports 27-27
Reading the Job Analyzer Report in HDFS Using a Web Browser 27-28
Reading the Job Analyzer Report in the Local File System in a Web Browser 27-29
Looking for Skew Indicators in the Job Analyzer Reports 27-30
Job Analyzer Sample Reports 27-31
Collecting Additional Metrics with Job Analyzer 27-32
Using Perfect Balance API 27-34
Troubleshooting Jobs Running with Perfect Balance 27-37
Perfect Balance Examples Available with Installation 27-38
Summary 27-40
28 Securing Your Data
Objectives 28-3
Security Trends 28-4
Security Levels 28-5
Outline 28-6
Relaxed Security 28-7
HDFS ACLs 28-10
Changing Access Privileges 28-11
Challenges with Relaxed Security 28-13
Create Databases (in Hive) 28-15
Privileges on Source Data for Tables 28-16
Granting Privileges on Source Data for Tables 28-18
Creating the Table and Loading the Data 28-19
Grant and Revoke Access to Table 28-21
Database Access to HDFS 28-22
Auditing 28-24
Encryption 28-25
Network Encryption 28-26
Data at Rest Encryption 28-28
Summary 28-30

29 Introduction to Big Data on Cloud


Objectives 29-2
Big Data on Cloud Service 29-3
Big Data Cloud Service: Key Features 29-4
Big Data Cloud Service: Benefits 29-5
Elasticity: Dedicated Compute Bursting 8-6
Security Made Easy 8-8
Comprehensive Analytics Toolset Included 29-9
Big Data Deployment Models: Choices 29-12
Resources 29-15
Summary 29-16

You might also like