0% found this document useful (0 votes)

46 views6 pages

Introduction to Spark MLlib

The document provides an overview of Apache Spark and its components, highlighting its capabilities in big data processing through a unified engine that supports batch and real-time analytics. It details Spark's architecture, including the roles of the Driver Program, Cluster Manager, Worker Nodes, and Executors, as well as key components like Spark Core, Spark SQL, Spark Streaming, and MLlib. Additionally, it introduces Spark MLlib, a machine learning library optimized for large-scale data processing, and discusses its features, practical use cases, and future developments.

Uploaded by

Kalyan G V

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views6 pages

Introduction to Spark MLlib

Uploaded by

Kalyan G V

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

AMC ENGINEERING COLLEGE

Dept. Of Computer Science and Engineering

Big Data Analytics [21CS71] Assignement-1
Team-4 Topic: - Spark and SparkMLlib Kalyan G V (1AM21CS077)
Introduction to Apache Spark
1) What is Apache Spark?
 Overview: Apache Spark is a fast, open-source distributed computing system that simplifies big data
processing. Designed as a unified engine, Spark supports various data processing needs like batch
processing, real-time processing, and analytics.
 Key Features:
o Speed: Optimized for both in-memory and disk-based data processing.
o Ease of Use: Supports Java, Python, Scala, and R with high-level APIs.
o Versatility: Suits multiple big data use cases, such as ETL, data streaming, machine learning,
and graph processing.
2) Spark Architecture
The Spark follows the master-slave architecture. Its cluster consists of a single master and multiple
slaves.
The Spark architecture depends upon two abstractions:
o Resilient Distributed Dataset (RDD)
o Directed Acyclic Graph (DAG)

Resilient Distributed Datasets (RDD)

The Resilient Distributed Datasets are the group of data items that can be stored in-memory on worker
nodes. Here,
o Resilient: Restore the data on failure.
o Distributed: Data is distributed among different nodes.
o Dataset: Group of data.

Directed Acyclic Graph (DAG)

Directed Acyclic Graph is a finite direct graph that performs a sequence of computations on data. Each
node is an RDD partition, and the edge is a transformation on top of data. Here, the graph refers the
navigation whereas directed and acyclic refers to how it is done.
Let's understand the Spark architecture.
Driver Program
The Driver Program is a process that runs the main() function of the application and creates
the SparkContext object. The purpose of SparkContext is to coordinate the spark applications, running
as independent sets of processes on a cluster.
To run on a cluster, the SparkContext connects to a different type of cluster managers and then perform
the following tasks: -
o It acquires executors on nodes in the cluster.
o Then, it sends your application code to the executors. Here, the application code can be defined by
JAR or Python files passed to the SparkContext.
o At last, the SparkContext sends tasks to the executors to run.

Cluster Manager
o The role of the cluster manager is to allocate resources across applications. The Spark is capable
enough of running on a large number of clusters.
o It consists of various types of cluster managers such as Hadoop YARN, Apache Mesos and
Standalone Scheduler.
o Here, the Standalone Scheduler is a standalone spark cluster manager that facilitates to install Spark
on an empty set of machines.
Worker Node
o The worker node is a slave node
o Its role is to run the application code in the cluster.

Executor
o An executor is a process launched for an application on a worker node.
o It runs tasks and keeps data in memory or disk storage across them.
o It read and write data to the external sources.
o Every application contains its executor.

Task
o A unit of work that will be sent to one executor.

Spark Components
The Spark project consists of different types of tightly integrated components. At its core, Spark is a
computational engine that can schedule, distribute and monitor multiple applications.

Let's understand each Spark component in detail.

Spark Core

o The Spark Core is the heart of Spark and performs the core functionality.
o It holds the components for task scheduling, fault recovery, interacting with storage systems and
memory management.

Spark SQL

o The Spark SQL is built on the top of Spark Core. It provides support for structured data.

o It allows to query the data via SQL (Structured Query Language) as well as the Apache Hive variant
of SQL?called the HQL (Hive Query Language).

o It supports JDBC and ODBC connections that establish a relation between Java objects and existing
databases, data warehouses and business intelligence tools.

o It also supports various sources of data like Hive tables, Parquet, and JSON.

Spark Streaming

o Spark Streaming is a Spark component that supports scalable and fault-tolerant processing of
streaming data.

o It uses Spark Core's fast scheduling capability to perform streaming analytics.

o It accepts data in mini-batches and performs RDD transformations on that data.

o Its design ensures that the applications written for streaming data can be reused to analyze batches of
historical data with little modification.

o The log files generated by web servers can be considered as a real-time example of a data stream.

MLlib

o The MLlib is a Machine Learning library that contains various machine learning algorithms.

o These include correlations and hypothesis testing, classification and regression, clustering, and
principal component analysis.

o It is nine times faster than the disk-based implementation used by Apache Mahout.

GraphX

o The GraphX is a library that is used to manipulate graphs and perform graph-parallel computations.

o It facilitates to create a directed graph with arbitrary properties attached to each vertex and edge.

o To manipulate graph, it supports various fundamental operators like subgraph, join Vertices, and
aggregate Messages.
Introduction to Spark MLlib
What is Spark MLlib?
 Overview: Spark MLlib is a library of machine learning algorithms optimized for large-scale data
processing on distributed systems.
 Goals: Simplifies the process of developing and deploying machine learning models on big data, and
provides easy integration with other Spark modules.
MLlib Algorithms
 Categories:
o Classification: Algorithms for tasks that predict categorical outcomes (e.g., Decision Trees,
Logistic Regression).
o Regression: Used for predicting continuous values (e.g., Linear Regression).

o Clustering: Groups similar data points into clusters (e.g., K-Means).

o Collaborative Filtering: Builds recommendation systems (e.g., Alternating Least Squares).

 Features: Supports basic statistical operations, feature engineering, and model selection tools.
Key Features of Spark MLlib
 Scalability: MLlib algorithms are optimized for distributed data processing, making them suitable for
big data.
 Versatility: Covers a range of supervised and unsupervised learning algorithms.
 Compatibility: Works seamlessly with Spark SQL and DataFrames for streamlined data handling.

Practical Use Case with Spark MLlib

Setting up Spark and MLlib
 Installation: Instructions on setting up Spark on local and cloud environments.
 Loading Data: Demonstrates how to load and work with structured data in Spark DataFrames for ML
tasks.
Example Use Case: Building a Simple Classification Model
 Dataset: Use a sample dataset, such as the Titanic or Iris dataset, for illustration.
 Steps:
o Load and preprocess the dataset, including steps like handling missing values and scaling features.
o Split the data into training and testing sets.

o Train a classifier (e.g., Logistic Regression) and evaluate the model.

 Code Example: A PySpark example illustrating each step with sample code for building and
evaluating a model.

Summary and Future of Spark MLlib

 Benefits: Discusses the advantages of using Spark and MLlib in big data and machine learning
applications.
 Future Developments: Potential advancements in MLlib’s ecosystem, such as the addition of more
algorithms, enhanced model interpretability, and tighter integration with emerging big data tools.

Core Features of Spark MLlib

Scalability and Distributed Processing
 MLlib is built on Spark's distributed architecture, allowing it to process terabytes of data efficiently
across multiple nodes in a cluster. This enables the handling of large-scale machine learning
operations that would be challenging with traditional libraries.
Integration with Spark Ecosystem
 MLlib seamlessly integrates with Spark SQL, Spark Streaming, and other Spark components. This
integration allows MLlib to use Spark DataFrames for structured data handling and DataFrames
API for streamlined data processing.
 Compatibility with Spark SQL means users can apply SQL queries directly to data being processed
in MLlib, which is particularly useful in data exploration and preprocessing stages.
ML Pipelines
 MLlib provides a high-level API for building machine learning pipelines, which consist of a
zequence of stages. Each stage could be a data transformation (e.g., feature scaling) or an
algorithm (e.g., logistic regression).
 Pipeline components include Transformers (data transformers like normalization) and Estimators
(machine learning models).
 Pipelines make it easy to automate the workflow, from data preprocessing to model training,
evaluation, and tuning.
Distributed Algorithms and Data Transformation Tools
 MLlib includes a range of distributed algorithms, from basic statistical analysis to advanced
machine learning models, all designed for big data.
 Feature transformation and selection tools, such as one-hot encoding, feature scaling, and PCA
(Principal Component Analysis), help in preparing the data for machine learning tasks.

Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
21CS71 Big Data Analytics
No ratings yet
21CS71 Big Data Analytics
17 pages
8 TH
No ratings yet
8 TH
19 pages
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
18 pages
3.5 Apache Spark
No ratings yet
3.5 Apache Spark
12 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
Sparklyr Online Training Overview
No ratings yet
Sparklyr Online Training Overview
80 pages
Spark Guide for 4th Year Engineering Students
No ratings yet
Spark Guide for 4th Year Engineering Students
241 pages
Overview of Apache Spark and RDDs
100% (1)
Overview of Apache Spark and RDDs
109 pages
Overview of Apache Spark Features and Benefits
No ratings yet
Overview of Apache Spark Features and Benefits
16 pages
Mastering Advanced Analytics With Apache Spark
No ratings yet
Mastering Advanced Analytics With Apache Spark
75 pages
20J41A0514-Big Data Spark
No ratings yet
20J41A0514-Big Data Spark
12 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
Unit - 4
No ratings yet
Unit - 4
49 pages
Overview of Apache Spark Features
No ratings yet
Overview of Apache Spark Features
9 pages
Introduction to Apache Spark Overview
No ratings yet
Introduction to Apache Spark Overview
21 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
Bda U4
No ratings yet
Bda U4
49 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
Apache Spark IP Gemini 1 PDF
No ratings yet
Apache Spark IP Gemini 1 PDF
38 pages
Understanding RDDs in Apache Spark
No ratings yet
Understanding RDDs in Apache Spark
14 pages
Big Data Anlytics Unit 3 R22 It
No ratings yet
Big Data Anlytics Unit 3 R22 It
57 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
4a.introduction To Apache Spark
No ratings yet
4a.introduction To Apache Spark
28 pages
Apache Spark: Dhineshkumar S K
No ratings yet
Apache Spark: Dhineshkumar S K
31 pages
Lec 11
No ratings yet
Lec 11
8 pages
Apache Spark: Features & Components
No ratings yet
Apache Spark: Features & Components
9 pages
Module 4
No ratings yet
Module 4
29 pages
Spark
No ratings yet
Spark
7 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Spark: Big Data Processing & Libraries
No ratings yet
Spark: Big Data Processing & Libraries
47 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Shark
No ratings yet
Shark
24 pages
Apache Spark
No ratings yet
Apache Spark
113 pages
Machine Learning With Spark Nick Pentreath Available Instanly
No ratings yet
Machine Learning With Spark Nick Pentreath Available Instanly
147 pages
Advanced Data Science with Spark
No ratings yet
Advanced Data Science with Spark
47 pages
Pyspark Notes New
No ratings yet
Pyspark Notes New
18 pages
DE in AI
No ratings yet
DE in AI
14 pages
Understanding Apache Spark Basics
No ratings yet
Understanding Apache Spark Basics
125 pages
Unit-4 - Apache Spark
No ratings yet
Unit-4 - Apache Spark
24 pages
06 Big Data
No ratings yet
06 Big Data
52 pages
Introduction to Apache Spark 2 Architecture
No ratings yet
Introduction to Apache Spark 2 Architecture
43 pages
Bda U3 p1 (Intro To Spark)
No ratings yet
Bda U3 p1 (Intro To Spark)
66 pages
PySpark+Slides v1
100% (1)
PySpark+Slides v1
458 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Unit V
No ratings yet
Unit V
35 pages
Spark Fundamentals Overview
No ratings yet
Spark Fundamentals Overview
25 pages
Lecture 4 - Spark Introduction
No ratings yet
Lecture 4 - Spark Introduction
45 pages
Unit 5.1
No ratings yet
Unit 5.1
9 pages
Introduction to Apache Spark Overview
No ratings yet
Introduction to Apache Spark Overview
21 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
Apache Spark Overview & Features
No ratings yet
Apache Spark Overview & Features
65 pages
Overview of Apache Spark Components
No ratings yet
Overview of Apache Spark Components
4 pages
BDA Unit-6
No ratings yet
BDA Unit-6
11 pages
DOM Manipulation Cheat Sheet
No ratings yet
DOM Manipulation Cheat Sheet
2 pages
TCS iON NQT Corporates Brochure
No ratings yet
TCS iON NQT Corporates Brochure
4 pages
Full Stack Development-Module 1
No ratings yet
Full Stack Development-Module 1
42 pages
IOT
No ratings yet
IOT
5 pages
21CS53 DBMS IA02 QBankAnswers
No ratings yet
21CS53 DBMS IA02 QBankAnswers
21 pages
Ansible Handlers Loops Multiloops Vault
No ratings yet
Ansible Handlers Loops Multiloops Vault
17 pages
ISO27001 CheatSheet Explanation
No ratings yet
ISO27001 CheatSheet Explanation
2 pages
Using Best practices-IBM - PMM
No ratings yet
Using Best practices-IBM - PMM
62 pages
Information Systems Report
No ratings yet
Information Systems Report
12 pages
Pseudocode and Flowchart Examples
100% (7)
Pseudocode and Flowchart Examples
5 pages
Dork
No ratings yet
Dork
14 pages
Rapport PFE - V2
No ratings yet
Rapport PFE - V2
50 pages
Use of Computer in The Business
No ratings yet
Use of Computer in The Business
1 page
C - Type Casting - Tutorialspoint
No ratings yet
C - Type Casting - Tutorialspoint
3 pages
Create Table Function in SQL Using SAP HANA Studio PDF
No ratings yet
Create Table Function in SQL Using SAP HANA Studio PDF
7 pages
Chapter 8: Software Quality Assurance: Notes From W. Humphrey - Page 24
No ratings yet
Chapter 8: Software Quality Assurance: Notes From W. Humphrey - Page 24
12 pages
IR Sensor Distance Measurement with PIC
No ratings yet
IR Sensor Distance Measurement with PIC
5 pages
Goal Question Metric (GQM) Approach
No ratings yet
Goal Question Metric (GQM) Approach
3 pages
sqp1 4
No ratings yet
sqp1 4
23 pages
Cisco 7600 Series Supervisor Engine Guide
No ratings yet
Cisco 7600 Series Supervisor Engine Guide
47 pages
Math Fact Menu
No ratings yet
Math Fact Menu
1 page
Ruhi Sania Resume 47756198
No ratings yet
Ruhi Sania Resume 47756198
1 page
Sales Planning Template Overview
No ratings yet
Sales Planning Template Overview
25 pages
Graphic Design Course Overview & Resources
100% (12)
Graphic Design Course Overview & Resources
13 pages
Where Can I Get Irodov's Solution PDF For Free?: 6 Answers
No ratings yet
Where Can I Get Irodov's Solution PDF For Free?: 6 Answers
5 pages
Python Revision Tour Worksheet
No ratings yet
Python Revision Tour Worksheet
7 pages
Christian CHIKOMOLA RUZUBA B.SC.: Professional Summary
No ratings yet
Christian CHIKOMOLA RUZUBA B.SC.: Professional Summary
6 pages
Monthly Budget Algorithm Design and Debugging Techniques
No ratings yet
Monthly Budget Algorithm Design and Debugging Techniques
5 pages
Xplora 4 Kids Smartwatch Guide
No ratings yet
Xplora 4 Kids Smartwatch Guide
92 pages
VC200 Series Setup Manual
100% (1)
VC200 Series Setup Manual
37 pages
UCM6304A IP PBX Installation Guide
No ratings yet
UCM6304A IP PBX Installation Guide
5 pages
Amcat Sample Questions 4
100% (1)
Amcat Sample Questions 4
17 pages
(Ebook PDF) GIS Research Methods: Incorporating Spatial Perspectivesinstant Download
100% (8)
(Ebook PDF) GIS Research Methods: Incorporating Spatial Perspectivesinstant Download
49 pages
Annotation in Rap
No ratings yet
Annotation in Rap
133 pages

Introduction to Spark MLlib

Uploaded by

Introduction to Spark MLlib

Uploaded by

AMC ENGINEERING COLLEGE

Dept. Of Computer Science and Engineering

Resilient Distributed Datasets (RDD)

Directed Acyclic Graph (DAG)

Let's understand each Spark component in detail.

o It uses Spark Core's fast scheduling capability to perform streaming analytics.

o It accepts data in mini-batches and performs RDD transformations on that data.

o Clustering: Groups similar data points into clusters (e.g., K-Means).

o Collaborative Filtering: Builds recommendation systems (e.g., Alternating Least Squares).

Practical Use Case with Spark MLlib

o Train a classifier (e.g., Logistic Regression) and evaluate the model.

Summary and Future of Spark MLlib

Core Features of Spark MLlib

You might also like