AMC ENGINEERING COLLEGE
Dept. Of Computer Science and Engineering
Big Data Analytics [21CS71] Assignement-1
Team-4 Topic: - Spark and SparkMLlib Kalyan G V (1AM21CS077)
Introduction to Apache Spark
1) What is Apache Spark?
Overview: Apache Spark is a fast, open-source distributed computing system that simplifies big data
processing. Designed as a unified engine, Spark supports various data processing needs like batch
processing, real-time processing, and analytics.
Key Features:
o Speed: Optimized for both in-memory and disk-based data processing.
o Ease of Use: Supports Java, Python, Scala, and R with high-level APIs.
o Versatility: Suits multiple big data use cases, such as ETL, data streaming, machine learning,
and graph processing.
2) Spark Architecture
The Spark follows the master-slave architecture. Its cluster consists of a single master and multiple
slaves.
The Spark architecture depends upon two abstractions:
o Resilient Distributed Dataset (RDD)
o Directed Acyclic Graph (DAG)
Resilient Distributed Datasets (RDD)
The Resilient Distributed Datasets are the group of data items that can be stored in-memory on worker
nodes. Here,
o Resilient: Restore the data on failure.
o Distributed: Data is distributed among different nodes.
o Dataset: Group of data.
Directed Acyclic Graph (DAG)
Directed Acyclic Graph is a finite direct graph that performs a sequence of computations on data. Each
node is an RDD partition, and the edge is a transformation on top of data. Here, the graph refers the
navigation whereas directed and acyclic refers to how it is done.
Let's understand the Spark architecture.
Driver Program
The Driver Program is a process that runs the main() function of the application and creates
the SparkContext object. The purpose of SparkContext is to coordinate the spark applications, running
as independent sets of processes on a cluster.
To run on a cluster, the SparkContext connects to a different type of cluster managers and then perform
the following tasks: -
o It acquires executors on nodes in the cluster.
o Then, it sends your application code to the executors. Here, the application code can be defined by
JAR or Python files passed to the SparkContext.
o At last, the SparkContext sends tasks to the executors to run.
Cluster Manager
o The role of the cluster manager is to allocate resources across applications. The Spark is capable
enough of running on a large number of clusters.
o It consists of various types of cluster managers such as Hadoop YARN, Apache Mesos and
Standalone Scheduler.
o Here, the Standalone Scheduler is a standalone spark cluster manager that facilitates to install Spark
on an empty set of machines.
Worker Node
o The worker node is a slave node
o Its role is to run the application code in the cluster.
Executor
o An executor is a process launched for an application on a worker node.
o It runs tasks and keeps data in memory or disk storage across them.
o It read and write data to the external sources.
o Every application contains its executor.
Task
o A unit of work that will be sent to one executor.
Spark Components
The Spark project consists of different types of tightly integrated components. At its core, Spark is a
computational engine that can schedule, distribute and monitor multiple applications.
Let's understand each Spark component in detail.
Spark Core
o The Spark Core is the heart of Spark and performs the core functionality.
o It holds the components for task scheduling, fault recovery, interacting with storage systems and
memory management.
Spark SQL
o The Spark SQL is built on the top of Spark Core. It provides support for structured data.
o It allows to query the data via SQL (Structured Query Language) as well as the Apache Hive variant
of SQL?called the HQL (Hive Query Language).
o It supports JDBC and ODBC connections that establish a relation between Java objects and existing
databases, data warehouses and business intelligence tools.
o It also supports various sources of data like Hive tables, Parquet, and JSON.
Spark Streaming
o Spark Streaming is a Spark component that supports scalable and fault-tolerant processing of
streaming data.
o It uses Spark Core's fast scheduling capability to perform streaming analytics.
o It accepts data in mini-batches and performs RDD transformations on that data.
o Its design ensures that the applications written for streaming data can be reused to analyze batches of
historical data with little modification.
o The log files generated by web servers can be considered as a real-time example of a data stream.
MLlib
o The MLlib is a Machine Learning library that contains various machine learning algorithms.
o These include correlations and hypothesis testing, classification and regression, clustering, and
principal component analysis.
o It is nine times faster than the disk-based implementation used by Apache Mahout.
GraphX
o The GraphX is a library that is used to manipulate graphs and perform graph-parallel computations.
o It facilitates to create a directed graph with arbitrary properties attached to each vertex and edge.
o To manipulate graph, it supports various fundamental operators like subgraph, join Vertices, and
aggregate Messages.
Introduction to Spark MLlib
What is Spark MLlib?
Overview: Spark MLlib is a library of machine learning algorithms optimized for large-scale data
processing on distributed systems.
Goals: Simplifies the process of developing and deploying machine learning models on big data, and
provides easy integration with other Spark modules.
MLlib Algorithms
Categories:
o Classification: Algorithms for tasks that predict categorical outcomes (e.g., Decision Trees,
Logistic Regression).
o Regression: Used for predicting continuous values (e.g., Linear Regression).
o Clustering: Groups similar data points into clusters (e.g., K-Means).
o Collaborative Filtering: Builds recommendation systems (e.g., Alternating Least Squares).
Features: Supports basic statistical operations, feature engineering, and model selection tools.
Key Features of Spark MLlib
Scalability: MLlib algorithms are optimized for distributed data processing, making them suitable for
big data.
Versatility: Covers a range of supervised and unsupervised learning algorithms.
Compatibility: Works seamlessly with Spark SQL and DataFrames for streamlined data handling.
Practical Use Case with Spark MLlib
Setting up Spark and MLlib
Installation: Instructions on setting up Spark on local and cloud environments.
Loading Data: Demonstrates how to load and work with structured data in Spark DataFrames for ML
tasks.
Example Use Case: Building a Simple Classification Model
Dataset: Use a sample dataset, such as the Titanic or Iris dataset, for illustration.
Steps:
o Load and preprocess the dataset, including steps like handling missing values and scaling features.
o Split the data into training and testing sets.
o Train a classifier (e.g., Logistic Regression) and evaluate the model.
Code Example: A PySpark example illustrating each step with sample code for building and
evaluating a model.
Summary and Future of Spark MLlib
Benefits: Discusses the advantages of using Spark and MLlib in big data and machine learning
applications.
Future Developments: Potential advancements in MLlib’s ecosystem, such as the addition of more
algorithms, enhanced model interpretability, and tighter integration with emerging big data tools.
Core Features of Spark MLlib
Scalability and Distributed Processing
MLlib is built on Spark's distributed architecture, allowing it to process terabytes of data efficiently
across multiple nodes in a cluster. This enables the handling of large-scale machine learning
operations that would be challenging with traditional libraries.
Integration with Spark Ecosystem
MLlib seamlessly integrates with Spark SQL, Spark Streaming, and other Spark components. This
integration allows MLlib to use Spark DataFrames for structured data handling and DataFrames
API for streamlined data processing.
Compatibility with Spark SQL means users can apply SQL queries directly to data being processed
in MLlib, which is particularly useful in data exploration and preprocessing stages.
ML Pipelines
MLlib provides a high-level API for building machine learning pipelines, which consist of a
zequence of stages. Each stage could be a data transformation (e.g., feature scaling) or an
algorithm (e.g., logistic regression).
Pipeline components include Transformers (data transformers like normalization) and Estimators
(machine learning models).
Pipelines make it easy to automate the workflow, from data preprocessing to model training,
evaluation, and tuning.
Distributed Algorithms and Data Transformation Tools
MLlib includes a range of distributed algorithms, from basic statistical analysis to advanced
machine learning models, all designed for big data.
Feature transformation and selection tools, such as one-hot encoding, feature scaling, and PCA
(Principal Component Analysis), help in preparing the data for machine learning tasks.