Academia.eduAcademia.edu

Performance Improvement Approaches for Apache Spark

2015

Abstract

Apache Spark a new big data processing framework, caches data in memory and then processes it. Spark creates Resilient Distributed Datasets (RDD’s) from data which are cached in memory. Although Spark is popular for its performance in iterative applications, its performance can be limited by some factors. One such factor is of disk access time. This paper incorporates some approaches for performance improvement in Spark by trying to improve its disk access time. KeywordsApache Spark, Caching, Prefetching.