Big Data Open Source Implementation and Administration
Hours: 40
Instructor: Ing. Yonogy Curi
1 Introduction
Objectives 1-2
Questions About You 1-3
Course Objectives 1-4
Course Road Map: Module 1 Big Data Management System 1-5
Course Road Map: Module 2 Data Acquisition and Storage 1-6
Course Road Map: Module 3 Data Access and Processing 1-7
Course Road Map: Module 4 Data Unification and Analysis 1-8
The Big Data Virtual Machine (Used in this Course) Home Page 1-10
Connecting to the Practice Environment 1-11
Starting the Big Data Virtual Machine (VM) Used in this Course 1-12
Starting the Big Data (BDLite) Virtual Machine (VM) used in this Course 1-13
Accessing the Getting Started Page from the BDVM 1-14
Big Data Appliances Documentation 1-19
2 Big Data and the Information Management System
Lesson Objectives 2-3
Big Data: A Strategic IM Perspective 2-4
Big Data 2-5
Characteristics of Big Data 2-6
Importance of Big Data 2-8
Big Data Opportunities: Some Examples 2-9
Big Data Challenges 2-10
Information Management Landscape 2-12
Extending the Boundaries of Information Management 2-13
A Simple Functional Model for Big Data 2-14
Information Management Conceptual Architecture 2-16
Design Patterns to Component Usage Map 2-18
Big Data Adoption and Implementation Patterns 2-20
IM Architecture Data Approaches: Schema-on-Write vs Schema-on-Read 2-22
Course Approach: Big Data Project Phases 2-24
IM System for Big Data 2-26
Additional Resources 2-30
Summary 2-31
3 Using Big Data on Virtual Machine
Objectives 3-3
Lesson Agenda 3-4
Big Data Virtual Machine: Introduction 3-5
Big Data VM Components 3-6
Initializing the Environment for the Big Data VM 3-7
Initializing the Environment 3-8
Lesson Agenda 3-9
MoviePlex Case Study: Introduction 3-10
Big Data Challenge 3-12
Derive Value from Big Data 3-13
MoviePlex: Goal 3-14
MoviePlex: Big Data Challenges 3-15
MoviePlex: Architecture 3-16
MoviePlex: Data Generation 3-17
MoviePlex: Data Generation Format 3-18
MoviePlex Application 3-19
Summary 3-20
4 Introduction to the Big Data Ecosystem
Objectives 4-3
Computer Clusters 4-4
Distributed Computing 4-5
Apache Hadoop 4-6
Types of Analysis That Use Hadoop 4-7
Apache Hadoop Ecosystem 4-8
Apache Hadoop Core Components 4-9
HDFS Key Definitions 4-11
NameNode (NN) & DataNodes 4-12
MapReduce Framework 4-14
Benefits of MapReduce 4-15
MapReduce Job 4-16
MapReduce Versions 4-19
Choosing a Hadoop Distribution and Version 4-20
Additional Resources: Apache Hadoop 4-22
Cloudera’s Distribution Including Apache Hadoop (CDH) 4-23
CDH Architecture 4-24
CDH Components 4-25
CDH Architecture 4-26
CDH Components 4-28
Summary 4-30
5 Introduction to the Hadoop Distributed File System (HDFS)
Objectives 5-3
HDFS: Characteristics 5-5
HDFS Deployments: High Availability (HA) and Non-HA 5-7
HDFS Key Definitions 5-8
Functions of the NameNode 5-10
Secondary NameNode (Non-HA) 5-11
Functions of DataNodes 5-13
NameNode and Secondary NameNodes 5-14
Storing and Accessing Data Files in HDFS 5-15
HDFS Architecture: HA 5-17
Configuring an HA Cluster Hardware Resources 5-19
Data Replication Process 5-26
Accessing HDFS 5-27
HDFS Commands 5-29
Shell Interface 5-30
Accessing HDFS 5-32
FS Shell Commands 5-33
Sample FS Shell Commands 5-35
HDFS Administration Commands 5-38
Using the hdfs fsck Command: Example 5-39
HDFS Features and Benefits 5-40
Summary 5-41
6 Acquire Data Using CLI, Fuse DFS, and Flume
Objectives 6-3
Reviewing the Command Line Interface (CLI) 6-4
Viewing File System Contents Using the CLI 6-5
Loading Data Using the CLI 6-6
What is Fuse DFS? 6-7
Enabling Fuse DFS on Big Data 6-8
Using Fuse DFS 6-9
What is Flume? 6-10
Flume: Architecture 6-11
Flume Sources (Consume Events) 6-12
Flume Channels (Hold Events) 6-13
Flume Sinks (Deliver Events) 6-14
Configuring Flume 6-16
Exploring a flume*.conf File 6-17
Additional Resources 6-18
Summary 6-19
7 Acquire and Access Data Using NoSQL Database
Objectives 7-3
What is a NoSQL Database? 7-4
RDBMS Compared to NoSQL 7-5
HDFS Compared to NoSQL 7-6
NoSQL Database 7-7
Points to Consider Before Choosing NoSQL 7-8
NoSQL Key-Value Data Model 7-9
Acquiring and Accessing Data in a NoSQL DB 7-11
Primary (Parent) Table Data Model 7-12
Table Data Model: Child Tables 7-13
Creating Tables 7-14
Creating Tables: Two Options 7-15
Data Definition Language (DDL) Commands 7-16
CREATE TABLE 7-17
Accessing the CLI 7-19
Executing a DDL Command 7-20
Viewing Table Descriptions 7-21
Recommendation: Using Scripts 7-22
Loading Data Into Tables 7-23
Accessing the KVStore 7-24
Introducing the TableAPI 7-25
Write Operations: put() Methods 7-26
Writing Rows to Tables: Steps 7-27
Constructing a Handle 7-28
Creating Row Object, Adding Fields, and Writing Record 7-29
Reading Data from Tables 7-30
Read Operations: get() Methods 7-31
Retrieving Table Data: Steps 7-32
Retrieving Single a Row 7-33
Retrieving Multiple Rows 7-34
Retrieving Child Tables 7-35
Removing Data From Tables 7-36
Delete Operations: 3 TableAPIs 7-37
Deleting Row(s) From a Table: Steps 7-38
Additional Resources 7-39
Summary 7-40
8 Primary Administrative Tasks for NoSQL Database
Objectives 8-3
Installation Planning: KVStore Analysis 8-4
InitialCapacityPlanning Spreadsheet 8-5
Planning Spreadsheet Sections 8-6
Next Topic 8-7
Configuration Requirements 8-8
Determine the Number of Shards 8-9
Determine # of Partitions and Replication Factor 8-10
Determine # of Storage Nodes 8-11
Installation and Configuration Steps 8-12
Step 1: Creating Directories 8-13
Step 2: Extracting Software 8-14
Step 3: Verifying the Installation 8-15
Step 4: Configuring Nodes (Using the makebootconfig Utility) 8-16
Using the makebootconfig Utility 8-18
Starting the Storage Node Agents 8-19
Pinging the Replication Nodes 8-20
Next Topic 8-21
Configuration and Monitoring Tools 8-22
Steps to Deploy a KVStore 8-23
Introducing Plans 8-24
States of a Plan 8-25
Starting the Configuration Tool 8-26
Configuring KVStore 8-27
Creating a Zone 8-28
Deploying Storage and Admin Nodes 8-29
Creating a Storage Pool 8-30
Joining Nodes to the Storage Pool 8-31
Creating a Topology 8-32
Deploying the KVStore 8-33
Testing the KVStore 8-34
Additional Resources 8-35
Summary 8-36
9 Introduction to MapReduce
Objectives 9-3
MapReduce 9-4
MapReduce Architecture 9-5
MapReduce Version 1 (MRv1) Architecture 9-6
MapReduce Phases 9-7
MapReduce Framework 9-8
Parallel Processing with MapReduce 9-9
MapReduce Jobs 9-10
Interacting with MapReduce 9-11
MapReduce Processing 9-12
MapReduce (MRv1) Daemons 9-13
Hadoop Basic Cluster (MRv1): Example 9-14
MapReduce Application Workflow 9-15
Data Locality Optimization in Hadoop 9-17
MapReduce Mechanics: Deck of Cards Example 9-18
MapReduce Mechanics Example: Assumptions 9-19
MapReduce Mechanics: The Map Phase 9-20
MapReduce Mechanics: The Shuffle and Sort Phase 9-21
MapReduce Mechanics: The Reduce Phase 9-22
Word Count Process: Example 9-23
Submitting a MapReduce 9-24
Summary 9-25
10 Resource Management Using YARN
Objectives 10-3
Agenda 10-4
Apache Hadoop YARN: Overview 10-5
MapReduce 2.0 or YARN Architecture 10-7
MapReduce 2.0 (MRv2) or YARN Daemons 10-8
Hadoop Basic Cluster YARN (MRv2): Example 10-9
YARN Versus MRv1 Architecture 10-10
YARN (MRv2) Architecture 10-11
MapReduce 2.0 (MRv2) or YARN Daemons 10-13
YARN (MRv2) Daemons 10-14
YARN: Features 10-15
Launching an Application on a YARN Cluster 10-16
MRv1 Versus MRv2 10-18
Job Scheduling in YARN 10-20
YARN Fair Scheduler 10-21
Cloudera Manager Resource Management Features 10-23
Static Service Pools 10-25
Working with the Fair Scheduler 10-26
Cloudera Manager Dynamic Resource Management: Example 10-27
Submitting a Job to hrpool By User lucy from the hr Group 10-33
Monitoring the Status of the Submitted MapReduce Job 10-34
Examining the marketingpool 10-35
Submitting a Job to marketingpool By User lucy from the hr Group 10-36
Monitoring the Status of the Submitted MapReduce Job 10-37
Submitting a Job to marketingpool By User bob from the marketing Group 10-38
Monitoring the Status of the Submitted MapReduce Job 10-39
Delay Scheduling 10-40
Agenda 10-41
YARN application Command 10-42
YARN application Command: Example 10-43
Monitoring an Application Using the UI 10-45
The Scheduler: BDA Example 10-46
Summary 10-47
11 Overview of Hive and Pig
Objectives 11-3
Hive 11-4
Use Case: Storing Clickstream Data 11-5
Defining Tables over HDFS 11-6
Hive: Data Units 11-8
The Hive Metastore Database 11-9
Hive Framework 11-10
Creating a Hive Database 11-11
Data Manipulation in Hive 11-12
Data Manipulation in Hive: Nested Queries 11-13
Steps in a Hive Query 11-14
Hive-Based Applications 11-15
Hive: Limitations 11-16
Pig: Overview 11-17
Pig Latin 11-18
Pig Applications 11-19
Running Pig Latin Statements 11-20
Pig Latin: Features 11-21
Working with Pig 11-22
Summary 11-23
12 Overview of Cloudera Impala
Objectives 12-3
Hadoop: Some Data Access/Processing Options 12-4
Cloudera Impala 12-5
Cloudera Impala: Key Features 12-6
Cloudera Impala: Supported Data Formats 12-7
Cloudera Impala: Programming Interfaces 12-8
How Impala Fits Into the Hadoop Ecosystem 12-9
How Impala Works with Hive 12-10
How Impala Works with HDFS and HBase 12-11
Summary of Cloudera Impala Benefits 12-12
Impala and Hadoop: Limitations 12-13
Summary 12-14
13 Using XQuery for Hadoop
Objectives 13-3
XML 13-4
XML Elements 13-6
XML Attributes 13-8
XML Path Language 13-9
XPath Terminology: Node Types 13-10
XPath Terminology: Family Relationships 13-11
XPath Expressions 13-12
Location Path Expression: Example 13-13
XQuery: Review 13-14
XQuery Terminology 13-15
XQuery Review: books.xml Document Example 13-16
XQuery for Hadoop (OXH) 13-18
OXH Features 13-19
XQuery for Hadoop Data Flow 13-20
Using OXH 13-21
OXH Installation 13-22
OXH Functions 13-23
OXH Adapters 13-24
Running a Query: Syntax 13-25
OXH: Configuration Properties 13-26
XQuery Transformation and Basic Filtering: Example 13-27
Viewing the Completed Application in YARN 13-30
Calling Custom Java Functions from XQuery 13-31
Additional Resources 13-32
Summary 13-33
14 Overview of Solr
Objectives 14-3
Apache Solr (Cloudera Search) 14-4
Types of Indexing 14-5
The solrctl Command 14-12
SchemaXML File 14-13
Creating a Solr Collection 14-14
Using OXH with Solr 14-15
Using Solr with Hue 14-16
Summary 14-18
15 Apache Spark
Objectives 15-3
Apache Spark 15-4
Introduction to Spark 15-5
Spark: Components for Distributed Execution 15-6
Resilient Distributed Dataset (RDD) 15-7
RDD Operations 15-8
Characteristics of RDD 15-9
Directed Acyclic Graph Execution Engine 15-10
Scala Language: Overview 15-11
Scala Program: Word Count Example 15-12
Spark Shells 15-13
Summary 15-14
16 Options for Integrating Your Big Data
Objectives 16-3
Unifying Data: A Typical Requirement 16-4
Introducing Data Unification Options 16-6
Data Unification: Batch Loading 16-7
Sqoop 16-8
Loader for Hadoop (OLH) 16-9
Copy to BDA 16-10
Data Unification: Batch and Dynamic Loading 16-11
SQL Connector for Hadoop 16-12
Data Unification: ETL and Synchronization 16-13
Big Data Heterogeneous Integration with Hadoop Environments 16-14
Data Unification: Dynamic Access 16-16
Big Data SQL: A New Architecture 16-17
When To Use Different Technologies? 16-18
Summary 16-19
17 Overview of Apache Sqoop
Objectives 17-3
Apache Sqoop 17-4
Sqoop Components 17-5
Sqoop Features 17-6
Sqoop: Connectors 17-7
Importing Data into Hive 17-8
Sqoop: Advantages 17-9
Summary 17-10
18 Using Loaders for Hadoop (OLH)
Objectives 18-3
Loader for Hadoop 18-4
Software Prerequisites 18-5
Modes of Operation 18-6
OLH: Online Database Mode 18-7
Running an OLH Job 18-8
OLH Use Cases 18-9
Load Balancing in OLH 18-10
Input Formats 18-11
OLH: Offline Database Mode 18-12
Offline Load Advantages in OLH 18-13
OLH Versus Sqoop 18-14
Summary 18-15
19 Using Copy to BDA
Course Road Map 19-2
Objectives 19-3
Copy to BDA 19-4
Requirements for Using Copy to BDA 19-5
How Does Copy to BDA Work? 19-6
Copy to BDA: Functional Steps 19-7
Querying the Data in Hive 19-13
Summary 19-14
20 Using SQL Connector for HDFS
Objectives 20-3
SQL Connector for HDFS 20-4
OSCH Architecture 20-5
Using OSCH: Two Simple Steps 20-6
Using OSCH: Creating External Directory 20-7
Using OSCH: Database Objects and Grants 20-8
Using OSCH: Supported Data Formats 20-9
Using OSCH: HDFS Text File Support 20-10
Using OSCH: Hive Table Support 20-12
Using OSCH: Partitioned Hive Table Support 20-14
OSCH: Features 20-15
OSCH: Performance Tuning 20-17
OSCH: Key Benefits 20-18
Summary 20-20
21 Data Integrator with Hadoop
Objectives 21-3
Data Integrator 21-4
Declarative Design 21-5
Big Data Heterogeneous Integration with Hadoop Environments 21-7
Resources for Integration 21-13
Summary 21-14
22 Using Big Data SQL
Objectives 22-3
Barriers to Effective Big Data Adoption 22-4
Overcoming Big Data Barriers 22-5
Goal and Benefits 22-7
Using Big Data SQL 22-8
Configuring Big Data SQL 22-9
Create External Tables Over HDFS Data and Query the Data 22-14
Create External Tables to Leverage the Hive Metastore and Query the Data 22-16
Using Access Parameters with _hive 22-17
Automating External Table Creation 22-19
Applying Database Security Policies 22-20
Viewing the Results 22-21
Applying Redaction Policies to Data in Hadoop 22-22
Viewing Results from the Hive (Avro) Source 22-23
Viewing the Results from Joined RDBMS and HDFS Data 22-24
Summary 22-25
23 Using Advanced Analytics: Data Mining and R Enterprise
Objectives 23-3
Advanced Analytics 23-4
Data Mining Overview 23-5
What Is Data Mining? 23-6
Common Uses of Data Mining 23-7
Defining Key Data Mining Properties 23-8
Data Mining Categories 23-10
Supervised Data Mining Techniques 23-11
Supervised Data Mining Algorithms 23-12
Unsupervised Data Mining Techniques 23-13
Unsupervised Data Mining Algorithms 23-14
Data Mining: Overview 23-15
Data Miner GUI 23-16
DM SQL Interface 23-17
Data Miner 4.1 Big Data Enhancement 23-18
Example Workflow Using JSON Query Node 23-19
ODM Resources 23-20
What Is R? 23-23
Who Uses R? 23-24
Why Do Statisticians, Data Analysts, Data Scientists Use R? 23-25
Limitations of R 23-26
Strategy for the R Community 23-27
R Enterprise 23-28
R: Software Features 23-29
R Packages 23-30
Functions for Interacting with Database 23-31
R: Target Environment 23-32
R: Data Sources 23-33
R and Hadoop 23-34
R and HDFS Connectivity and Interaction 23-37
Hadoop Connectivity and Interaction 23-40
Summary 23-45
24 Introducing Big Data Discovery
Course Road Map 24-2
Objectives 24-3
Big Data Discovery 24-4
Find Data 24-5
Explore Data 24-6
Transform and Enrich Data 24-7
Discover Information 24-8
Share Insights 24-9
BDD: Technical Innovation on Hadoop 24-10
Additional Resources 24-11
Summary 24-12
25 Introduction to the Big Data Appliance (BDA)
Objectives 25-3
Big Data Appliance 25-4
Big Data Appliance: Key Component of the Big Data 25-5
Engineered Systems for Big Data 25-6
The Available BDA Configurations 25-7
Using the Mammoth Utility 25-8
Using BDA Configuration Generation Utility 25-10
Configuring Big Data Appliance 25-11
The Generated Configuration Files 25-13
The BDA Configuration Generation Utility Pages 25-15
Big Data Appliance: Software Components 25-16
Big Data Appliance and YARN 25-17
Stopping the YARN Service 25-18
Hardware Failure in NoSQL 25-22
Integrated Lights Out Manager (ILOM): Overview 25-23
ILOM Users 25-24
Connecting to ILOM Using the Network 25-25
ILOM: Integrated View 25-26
Monitoring the Health of BDA: Management Utilities 25-27
Big Data Appliance: Usage Guidelines 25-39
Summary 25-40
26 Managing BDA
Objectives 26-3
Lesson Agenda 26-4
Mammoth Utility 26-5
Installation types 26-6
Mammoth Code: Examples 26-7
Mammoth Installation Steps 26-8
Lesson Agenda 26-10
Monitoring BDA 26-11
BDA Command-Line Interface 26-12
bdacli 26-13
setup-root-ssh 26-14
Lesson Agenda 26-15
Monitor BDA with Enterprise Manager 26-16
OEM: Web and Command-Line Interfaces 26-17
OEM: Hardware Monitoring 26-18
Hadoop Cluster Monitoring 26-19
Lesson Agenda 26-20
Managing CDH Operations 26-21
Using Cloudera Manager 26-22
Monitoring BDA Status 26-23
Performing Administrative Tasks 26-24
Managing Services 26-25
Lesson Agenda 26-26
Monitoring MapReduce Jobs 26-27
Monitoring the Health of HDFS 26-28
Lesson Agenda 26-29
Cloudera Hue 26-30
Hive Query Editor (Hue) Interface 26-31
Logging in to Hue 26-32
Lesson Agenda 26-33
Starting BDA 26-34
Stopping BDA 26-35
BDA Port Assignments 26-36
Summary 26-37
27 Balancing MapReduce Jobs
Objectives 27-3
Ideal World: Neatly Balanced MapReduce Jobs 27-4
Real World: Skewed Data and Unbalanced Jobs 27-5
Data Skew 27-6
Data Skew Can Slow Down the Entire Hadoop Job 27-7
Perfect Balance 27-8
How Does the Perfect Balance Work? 27-9
Using Perfect Balance 27-10
Application Requirements for Using Perfect Balance 27-11
Perfect Balance: Benefits 27-12
Using Job Analyzer 27-13
Getting Started with Perfect Balance 27-14
Using Job Analyzer 27-16
Environmental Setup for Perfect Balance and Job Analyzer 27-17
Using Job Analyzer as a Stand-Alone Utility: Example with a YARN Cluster 27-19
Configuring Perfect Balance 27-20
Using Perfect Balance to Run a Balanced MapReduce Job 27-21
Running a Job Using Perfect Balance: Examples 27-23
Perfect Balance–Generated Reports 27-25
The Job Analyzer Reports: Structure of the Job Output Directory 27-26
Reading the Job Analyzer Reports 27-27
Reading the Job Analyzer Report in HDFS Using a Web Browser 27-28
Reading the Job Analyzer Report in the Local File System in a Web Browser 27-29
Looking for Skew Indicators in the Job Analyzer Reports 27-30
Job Analyzer Sample Reports 27-31
Collecting Additional Metrics with Job Analyzer 27-32
Using Perfect Balance API 27-34
Troubleshooting Jobs Running with Perfect Balance 27-37
Perfect Balance Examples Available with Installation 27-38
Summary 27-40
28 Securing Your Data
Objectives 28-3
Security Trends 28-4
Security Levels 28-5
Outline 28-6
Relaxed Security 28-7
HDFS ACLs 28-10
Changing Access Privileges 28-11
Challenges with Relaxed Security 28-13
Create Databases (in Hive) 28-15
Privileges on Source Data for Tables 28-16
Granting Privileges on Source Data for Tables 28-18
Creating the Table and Loading the Data 28-19
Grant and Revoke Access to Table 28-21
Database Access to HDFS 28-22
Auditing 28-24
Encryption 28-25
Network Encryption 28-26
Data at Rest Encryption 28-28
Summary 28-30
29 Introduction to Big Data on Cloud
Objectives 29-2
Big Data on Cloud Service 29-3
Big Data Cloud Service: Key Features 29-4
Big Data Cloud Service: Benefits 29-5
Elasticity: Dedicated Compute Bursting 8-6
Security Made Easy 8-8
Comprehensive Analytics Toolset Included 29-9
Big Data Deployment Models: Choices 29-12
Resources 29-15
Summary 29-16