0% found this document useful (0 votes)
19 views12 pages

Apache Spark vs Hadoop: Key Features & Differences

Uploaded by

21f3000149
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views12 pages

Apache Spark vs Hadoop: Key Features & Differences

Uploaded by

21f3000149
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Spark

Features, Components,Differences between Hadoop and Spark


Batch Vs Real-time Processing
Limitations of Mapreduce in Hadoop

1. Since MapReduce is suitable only for batch processing jobs, implementing interactive
jobs and models becomes impossible.

2. Implementing iterative mapreduce jobs is expensive due to the huge space consumption
by each job.

3. Joining two large data sets with complex conditions

4. Processing graphs

5. Unfit for large data on network


Evolution of Apache Spark

● Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab


by Matei Zaharia.
● It was Open Sourced in 2010 under a BSD license. It was donated to Apache
software foundation in 2013, and now Apache Spark has become a top level
Apache project from Feb-2014.
Spark
● Apache Spark is a fast cluster computing technology, designed for fast computation.
● It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently
use it for more types of computations, which includes interactive queries and stream
processing.
● The main feature of Spark is its in-memory cluster computing that increases the
processing speed of an application.
● Spark uses Hadoop in two ways – one is storage and second is processing. Since
Spark has its own cluster management computation, it uses Hadoop for storage
purpose only.
Features of Apache Spark

Apache Spark has following features.

● Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in
memory, and 10 times faster when running on disk.

● Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python.
Therefore, you can write applications in different languages.

● Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL
queries, Streaming data, Machine learning (ML), and Graph algorithms.
Components of Spark
Apache Spark Core
Spark Core is the underlying general execution engine for spark platform that all other functionality is built
upon. It provides In-Memory computing and referencing datasets in external storage systems.

Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data abstraction called SchemaRDD,
which provides support for structured and semi-structured data.

Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics. It ingests
data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations on those mini-batches
of data.

Spark uses Micro-batching for real-time streaming.

Micro-batching is a technique that permits a method or a task to treat a stream as a sequence of little
batches of information.
Contd...
MLlib (Machine Learning Library)
MLlib is a distributed machine learning framework above Spark because of the distributed
memory-based Spark [Link] MLlib is nine times as fast as the Hadoop disk-based version
of Apache Mahout.

GraphX
GraphX is a distributed graph-processing framework on top of Spark. It provides an API for expressing
graph computation that can model the user-defined graphs by using Pregel abstraction API.
Application of In-Memory Processing
Differences between Hadoop and Spark
[Link] Hadoop Spark

1. Hadoop is an open source framework Spark is lightning fast cluster computing


which uses a MapReduce algorithm technology, which extends the MapReduce
model to efficiently use with more type of
computations.

2. Hadoop’s MapReduce model reads Spark reduces the number of read/write cycles
and writes from a disk, thus slow down to disk and store intermediate data in-memory,
the processing speed hence faster-processing speed.

3. Hadoop is designed to handle batch Spark is designed to handle real-time data


processing efficiently efficiently.
Contd...

5. With Hadoop MapReduce, a Spark can process real-time data, from real
developer can only process data in time events like twitter, facebook
batch mode only

6. Hadoop is a cheaper option available Spark requires a lot of RAM to run


while comparing it in terms of cost in-memory, thus increasing the cluster and
hence cost.

You might also like