Performance Improvement Approaches for Apache Spark

Shyam  Deshmukh

Performance Improvement Approaches for Apache Spark

Shyam Deshmukh

2015

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

Apache Spark a new big data processing framework, caches data in memory and then processes it. Spark creates Resilient Distributed Datasets (RDD’s) from data which are cached in memory. Although Spark is popular for its performance in iterative applications, its performance can be limited by some factors. One such factor is of disk access time. This paper incorporates some approaches for performance improvement in Spark by trying to improve its disk access time. KeywordsApache Spark, Caching, Prefetching.

Maha Dessokey

2020

In the era of Big Data, processing large amounts of data through data-intensive applications, is presenting a challenge. An in-memory distributed computing system; Apache Spark is often used to speed up big data applications. It caches intermediate data into memory, so there is no need to repeat the computation or reload data from disk when reusing these data later. This mechanism of caching data in memory makes Apache Spark much faster than other systems. When the memory used for caching data is full, the cache replacement policy used by Apache Spark is the Least Recently Used (LRU), however LRU algorithm performs poorly in some workloads. This review is going to give an insight about different replacement algorithms used to address the LRU problems, categorize the different selection factors and provide a comparison between the algorithms in terms of selection factors, performance and the benchmarks used in the research.

Log In

Performance Improvement Approaches for Apache Spark

Sign up for access to the world's latest research

Abstract

Related papers