Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2015
Apache Spark a new big data processing framework, caches data in memory and then processes it. Spark creates Resilient Distributed Datasets (RDD’s) from data which are cached in memory. Although Spark is popular for its performance in iterative applications, its performance can be limited by some factors. One such factor is of disk access time. This paper incorporates some approaches for performance improvement in Spark by trying to improve its disk access time. KeywordsApache Spark, Caching, Prefetching.
2020
In the era of Big Data, processing large amounts of data through data-intensive applications, is presenting a challenge. An in-memory distributed computing system; Apache Spark is often used to speed up big data applications. It caches intermediate data into memory, so there is no need to repeat the computation or reload data from disk when reusing these data later. This mechanism of caching data in memory makes Apache Spark much faster than other systems. When the memory used for caching data is full, the cache replacement policy used by Apache Spark is the Least Recently Used (LRU), however LRU algorithm performs poorly in some workloads. This review is going to give an insight about different replacement algorithms used to address the LRU problems, categorize the different selection factors and provide a comparison between the algorithms in terms of selection factors, performance and the benchmarks used in the research.
Journal of Big Data
Many applications generate and handle very large volumes of data, like social networking, cloud applications, public web sites, search engines, scientific simulations, data warehouse and so on. These large volumes of data are coming from a variety of sources and are both unstructured and structured data. In order to transform efficiently this massive data of various types into valuable information and meaningful knowledge, we need large-scale cluster infrastructures. In this context, one challenging problem is to realize an effective resource management of these large-scale cluster infrastructures in order to run distributed data analytics.
2020 IEEE International Conference on Big Data (Big Data)
Apache Spark is a distributed computing framework used for big data processing. A common pattern in many Spark applications is to iteratively evolve a dataset until reaching some user-specified convergence condition. Unfortunately, some aspects of Spark's execution model make it difficult for developers who are not familiar with the implementation-level details of Spark to write efficient iterative programs. Since results are constructed iteratively and results from previous iterations may be used multiple times, effective use of caching is necessary to avoid recomputing intermediate results. Currently, developers of Spark applications must manually indicate which intermediate results should be cached. We present a method for using metadata already captured by Spark to automate caching decisions for many Spark programs. We show how this allows Spark applications to benefit from caching without the need for manual caching annotations.
While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data an-alytics for being a unified framework for both, batch and stream data processing. However, recent studies on micro-architectural characterization of in-memory data analytics are limited to only batch processing workloads. We compare micro-architectural performance of batch processing and stream processing workloads in Apache Spark using hardware performance counters on a dual socket server. In our evaluation experiments, we have found that while batch processing workloads are bounded on the latency of frequent data accesses to DRAM, stream processing workloads are curbed by L1 instruction cache misses. For data accesses we have found that simultaneous multi-threading is effective in hiding the data latencies. We have also observed that (i) data locality on NUMA nodes can improve the performance with up to 12%, (ii) disabling next-line L1-D prefetchers can reduce the execution time by up-to 15% and (iii) multiple small executors can provide up-to 36% speedup over single large executor.
IRJET, 2020
In the recent past, Apache Spark has become the most popular Big Data Analytics Framework, having taken on Apache Hive based on MapReduce due to the edge offered by Spark's in-memory computation. The key obstacles for Big Data Analytics are operating on tremendous volumes of data, managing wide variations in data and high-speed data processing. Spark provides default configurations which have been evaluated for low capability hardware and is not the most optimal solution for specific types of data and the computations performed on them. Fine grain control through the various performance parameters is essential to leverage maximum capabilities of Spark. Parameter tuning of available hardware resources for Spark applications takes the highest precedence to achieve optimal performance. However, there lacks an in-depth understanding on the impact of these performance parameters. This paper discusses in detail the various parameters such as number of executors, memory persistence levels, caching, broadcasting, serialization, compression, repartitioning and network parameters that can be tuned to enhance the efficiency of Spark applications tailored to the data being handled and the execution environment.
Hadoop is a very popular general purpose framework for many different classes of data-intensive applications. However, it is not good for iterative operations because of the cost paid for the data reloading from disk at each iteration. As an emerging framework, Spark, which is designed to have a global cache mechanism, can achieve better performance in response time since the in-memory access over the distributed machines of cluster will proceed during the entire iterative process. Although the performance on time has been evaluated for Spark over Hadoop [1], the memory consumption, another system performance criteria, is not deeply analyzed in the literature. In this work, we conducted extensive experiments for iterative operations to compare the performance in both time and memory cost between Hadoop and Spark. We found that although Spark is in general faster than Hadoop in iterative operations, it has to pay for more memory consumption. Also, its speed advantage is weakened at the moment when the memory is not sufficient enough to store newly created intermediate results.
The Big data is the name used ubiquitously now a day in distributed paradigm on the web. As the name point out it is the collection of sets of very large amounts of data in pet bytes, Exabyte etc. related systems as well as the algorithms used to analyze this enormous data. Hadoop technology as a big data processing technology has proven to be the go to solution for processing enormous data sets. MapReduce is a conspicuous solution for computations, which requirement one-pass to complete, but not exact efficient for use cases that need multi-pass for computations and algorithms. The Job output data between every stage has to be stored in the file system before the next stage can begin. Consequently, this method is slow, disk Input/output operations and due to replication. Additionally, Hadoop ecosystem doesn't have every component to ending a big data use case. Suppose we want to do an iterative job, you would have to stitch together a sequence of MapReduce jobs and execute them in sequence. Every this job has high-latency, and each depends upon the completion of the previous stage. Apache Spark is one of the most widely used open source processing engines for big data, with wealthy language-integrated APIs and an extensive range of libraries. Apache Spark is a usual framework for distributed computing that offers high performance for both batch and interactive processing. In this paper, we aimed to demonstrate a close-up view about Apache Spark and its features and working with Spark using Hadoop. We are in a nutshell discussing about the Resilient Distributed Datasets (RDD), RDD operations, features, and limitation. Spark can be used along with MapReduce in the same Hadoop cluster or can be used lonely as a processing framework. In the last comparative analysis between Spark and Hadoop and MapReduce in this paper.
Studies in Big Data, 2018
The rapidly growing human genomics data driven by advances in sequencing technologies demands fast and costeffective processing. However, processing this data brings some challenges particularly in selecting appropriate algorithms and computing platforms. Computing systems need data closer to the processor for fast processing. Previously, due to the cost, volatility and other physical constraints of DRAM, it was not feasible to place large amounts of working data sets in memory. However, new emerging storage class memories allow storing and processing big data closer to the processor. In this work, we show how commonly used genomics data format, Sequence Alignment/Map (SAM) can be presented in the Apache Arrow in-memory data representation to take benefits of in-memory processing to ensure the better scalability through shared memory Plasma Object Store by avoiding huge (de)serialization overheads in cross-language interoperability. To demonstrate the benefits of such a system, we presented an inmemory SAM representation, we called it ArrowSAM, Apache Arrow framework is integrated into genome pre-processing applications including BWA-MEM, Sorting and Picard as use cases to show the advantages of ArrowSAM. Our implementation comprises three components, First, We integrated Apache Arrow into BWA-MEM to write output SAM data in ArrowSAM. Secondly, we sorted all the ArrowSAM data by their coordinates in parallel through pandas dataframes. Finally, Apache Arrow is integrated into HTSJDK library (used in Picard for disk I/O handling), where all ArrowSAM data is processed in parallel for duplicates removal. This implementation gives promising performance improvements for genome data pre-processing in term of both, speedup and system resource utilization. Due to columnar data format, better cache locality is exploited in both applications and shared memory objects enable parallel processing.
SAMRIDDHI : A Journal of Physical Sciences, Engineering and Technology
Apache Spark has recently become the most popular big data analytics framework. Default configurations are provided by Spark. HDFS stands for Hadoop Distributed File System. It means the large files will be physically stored on multiple nodes in a distributed fashion. The block size determines how large files are distributed, while the replication factor determines how reliable the files are. If there is just one copy of each block for a given file and the node fails, the data in the files become unreadable. The block size and replication factor are configurable per file. The results and analysis of the experimental study to determine the efficiency of adjusting the settings of tuning Apache Spark for minimizing application execution time as compared to standard values are described in this paper. Based on a vast number of studies, we employed a trial-anderror strategy to fine-tune these values. We chose two workloads to test the Apache framework for comparative analysis: Wordcount ...
Journal of Technological Advancements
As social networking services and e-commerce are growing rapidly, the number of online users also dynamically growing that facilitate contribution of huge contents to digital world. In such dynamic environment, meeting the demand of computing is very challenging special with existing computing model. Although Spark is recently introduced to alleviate the problems with concept of in-memory computing for big data analytic with many parameters configuration that allow to configure and improve its performance, still it has performance bottleneck which require to investigate performance improvement mechanism by focus on the combinations of Scheduling and Shuffle Manager with data serialization with intermediate data caching options. Standalone cluster computing model was selected as experimental methodology with submit command line for data submission. Three Spark application such as WorkCount, TeraSort and PageRank were selected and developed for experiment. As a result, 2.45% and 8.01%...
Sheer increase in volume of data over the last decade has triggered research in cluster computing frameworks that enable web enterprises to extract big insights from big data. While Apache Spark is gaining popularity for exhibiting superior scale-out performance on the commodity machines, the impact of data volume on the performance of Spark based data analytics in scale-up configuration is not well understood. We present a deep-dive analysis of Spark based applications on a large scale-up server machine. Our analysis reveals that Spark based data analytics are DRAM bound and do not benefit by using more than 12 cores for an executor. By enlarging input data size, application performance degrades significantly due to substantial increase in wait time during I/O operations and garbage collection, despite 10% better instruction retirement rate (due to lower L1 cache misses and higher core utilization). We match memory behavior with the garbage collector to improve performance of applications between 1.6x to 3x.
International Journal of Data Science and Analytics, 2016
Apache Spark has emerged as the de facto framework for big data analytics with its advanced in-memory programming model and upper-level libraries for scalable machine learning, graph analysis, streaming and structured data processing. It is a general-purpose cluster computing framework with language-integrated APIs in Scala, Java, Python and R. As a rapidly evolving open source project, with an increasing number of contributors from both academia and industry, it is difficult for researchers to comprehend the full body of development and research behind Apache Spark, especially those who are beginners in this area. In this paper, we present a technical review on big data analytics using Apache Spark. This review focuses on the key components, abstractions and features of Apache Spark. More specifically, it shows what Apache Spark has for designing and implementing big data algorithms and pipelines for machine learning, graph analysis and stream processing. In addition, we highlight some research and development directions on Apache Spark for big data analytics.
Proceedings of the 2019 8th International Conference on Networks, Communication and Computing, 2019
Apache spark is one of the high speed "in-memory computing" that run over the JVM. Due to increasing data in volume, it needs performance optimization mechanism that requires management of JVM heap space. To Manage JVM heap space it needs management of garbage collector pause time that affects application performance. There are different parameters to pass to spark to control JVM heap space and GC time overhead to increase application performance. Passing appropriate heap size with appropriate types of GC as a parameter is one of performance optimization which is known as Spark Garbage collection tuning. To reduce GC overhead, an experiment was done by adjusting certain parameters for loading and dataframe creation and data retrieval process. The result shows 3.23% improvement in Latency and 1.62% improvement in Throughput as compared to default parameter configuration in garbage collection tuning approach.
Cybernetics and Information Technologies, 2020
The optimization of large-scale data sets depends on the technologies and methods used. The MapReduce model, implemented on Apache Hadoop or Spark, allows splitting large data sets into a set of blocks distributed on several machines. Data compression reduces data size and transfer time between disks and memory but requires additional processing. Therefore, finding an optimal tradeoff is a challenge, as a high compression factor may underload Input/Output but overload the processor. The paper aims to present a system enabling the selection of the compression tools and tuning the compression factor to reach the best performance in Apache Hadoop and Spark infrastructures based on simulation analyzes.
2017 IEEE International Conference on Big Data (Big Data), 2017
2018
Apache Spark employs lazy evaluation [11, 6]; that is, in Spark, a dataset is represented as Resilient Distributed Dataset (RDD), and a single-threaded application (driver) program simply describes transformations (RDD to RDD), referred to as lineage [7, 12], without performing distributed computation until output is requested. The lineage traces computation and dependency back to external (and assumed durable) data sources, allowing Spark to opportunistically cache intermediate RDDs, because it can recompute everything from external data sources. To initiate computation on worker machines, the driver process constructs a directed acyclic graph (DAG) representing computation and dependency according to the requested RDD’s lineage. Then the driver broadcasts this DAG to all involved workers requesting they execute their portion of the result RDD. When a requested RDD has a long lineage, as one would expect from iterative convergent or streaming applications [9, 15], constructing and ...
International Journal of Computer Techniques, 2022
Recently, due to the advent of social networks, bio-computing, and the Internet of Things, more data is being generated than in the existing IT environment, and as a result, research on efficient large-capacity data processing techniques is being conducted. MapReduce is an effective programming model for dataintensive computational applications. A typical MapReduce application includes Hadoop, which is being developed and supported by the Apache Software Foundation. This paper proposes a data prefetching technique and a streaming technique to improve the performance of Hadoop MapReduce. One of the performance issues of Hadoop MapReduce is work delay due to input data transmission in the MapReduce process. In order to minimize this data transfer time, a prefetching thread in charge of data transfer was created separately, unlike the existing MapReduce. As a result, data transmission became possible even during the MapReduce operation of data, reducing the overall data processing time. Even with this prefetching technique, the job waits for the first data transmission due to the characteristics of Hadoop MapReduce. To reduce this waiting time, the streaming technique was used to further reduce the waiting time due to data transmission. Mathematical modeling was performed to measure the performance of the proposed method, and as a result of the performance measurement, it was confirmed that the performance of MapReduce to which the streaming method was additionally applied was improved compared to MapReduce to which only the existing Hadoop MapReduce and prefetching methods were applied.
International Journal of Computer Science and Information Security (IJCSIS), Vol. 22, No. 6, November-December, 2024
Apache Spark is a widely used framework for distributed data processing, offering scalability and efficiency for handling large datasets. However, its default file-splitting and data allocation mechanisms often result in significant inter-executor data movement, causing increased shuffle costs, job latency, and resource contention. This paper proposes a novel content-aware strategy integrating Apache Iceberg Technology to optimise file-splitting and dynamic data-to-executor mapping. The proposed approach minimises inter-node data transfers by preserving data locality and embedding intelligence into data placement during the file-reading phase. Experimental evaluation using datasets of varying sizes (500 GB, 1 TB, and 10 TB) demonstrates significant improvements over state-ofthe- art baseline methods. The proposed method reduces job execution time by up to 31%, shuffle costs by up to 35%, and data transfer volumes by up to 44%. Additionally, it achieves better resource utilisation and cost efficiency, highlighting its scalability and economic benefits for cloud-based environments. Keywords: apache spark; optimization; data allocation; data locality; distributed data processing; cloud
Regular, 2020
Big data applications play an important role in real time data processing. Apache Spark is a data processing framework with in-memory data engine that quickly processes large data sets. It can also distribute data processing tasks across multiple computers, either on its own or in tandem with other distributed computing tools. Spark’s in-memory processing cannot share data between the applications and hence, the RAM memory will be insufficient for storing petabytes of data. Alluxio is a virtual distributed storage system that leverages memory for data storage and provides faster access to data in different storage systems. Alluxio helps to speed up data intensive Spark applications, with various storage systems. In this work, the performance of applications on Spark as well as Spark running over Alluxio have been studied with respect to several storage formats such as Parquet, ORC, CSV, and JSON; and four types of queries from Star Schema Benchmark (SSB). A benchmark is evolved to s...
—While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analytics for being a unified framework for both, batch and stream data processing. However, recent studies on micro-architectural characterization of in-memory data analytics are limited to only batch processing workloads. We compare the micro-architectural performance of batch processing and stream processing workloads in Apache Spark using hardware performance counters on a dual socket server. In our evaluation experiments, we have found that batch processing and stream processing has same micro-architectural behavior in Spark if the difference between two implementations is of micro-batching only. If the input data rates are small, stream processing workloads are front-end bound. However, the front end bound stalls are reduced at larger input data rates and instruction retirement is improved. Moreover, Spark workloads using DataFrames have improved instruction retirement over workloads using RDDs.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.