Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2021, 9th International Conference "Distributed Computing and Grid Technologies in Science and Education"
…
5 pages
1 file
Spark and Hadoop ecosystem includes a wide variety of different components and can be integratedwith any tool required for Big Data nowadays. From release-to-release developers of theseframeworks optimize the inner work of components and make their usage more flexible and elaborate.Nevertheless, since inventing MapReduce as a programming model and the first Hadoop releases dataskew has been the main problem of distributed data processing. Data skew leads to performancedegradation, i.e., slowdown of application execution due to idling while waiting for the resources tobecome available. The newest Spark framework versions allow handling this situation easily from thebox. However, there is no opportunity to upgrade versions of tools and appropriate logic in the case ofcorporate environments with multiple large-scale projects development of which was started yearsago. In this article we consider approaches to execution optimization of SQL query in case of dataskew on concrete example wi...
TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES
Apache Spark is one of the most technically challenged frameworks for cluster computing in which data are processed in a parallel fashion. The cluster consists of unreliable machines. It processes a large amount of data faster compared to the MapReduce framework. For providing the facility of optimized and fast SQL query processing, a new unit is developed in Apache Spark named Spark SQL. It allows users to use relational processing and functional programming in one place. It provides many optimizations by leveraging the benefits of its core. This is called the catalyst optimizer. This optimizer has many rules to optimize queries for efficient execution. In this paper, we discuss a scenario in which the catalyst optimizer is not able to optimize the query competently for a specific case. This is the reason for inefficient memory usage and increases in the time required for the execution of the query by Spark SQL. For dealing with this issue, we propose a solution in this paper by which the query is optimized up to the peak level. This significantly reduces the time and memory consumed by the shuffling process.
Journal of Big Data
Many applications generate and handle very large volumes of data, like social networking, cloud applications, public web sites, search engines, scientific simulations, data warehouse and so on. These large volumes of data are coming from a variety of sources and are both unstructured and structured data. In order to transform efficiently this massive data of various types into valuable information and meaningful knowledge, we need large-scale cluster infrastructures. In this context, one challenging problem is to realize an effective resource management of these large-scale cluster infrastructures in order to run distributed data analytics.
2015
Apache Spark a new big data processing framework, caches data in memory and then processes it. Spark creates Resilient Distributed Datasets (RDD’s) from data which are cached in memory. Although Spark is popular for its performance in iterative applications, its performance can be limited by some factors. One such factor is of disk access time. This paper incorporates some approaches for performance improvement in Spark by trying to improve its disk access time. KeywordsApache Spark, Caching, Prefetching.
IOP Conference Series: Materials Science and Engineering
Big data is becoming bigger every day. Even for simple applications such as the Digital Bibliography & Library Project (DBLP) database, the data is becoming unmanageable using the conventional databases because of its size. Applying big data processing methods such as Hadoop and Spark is becoming more popular because of that. In this work, we investigate the use of Hadoop and Spark in the querying process of big data and we compare the performance of them in terms of their execution time. We use the DBLP database as a case study. Results show that Hadoop and Spark enhances the query execution time significantly when compared with conventional database management systems. We also found that Spark enhances the execution time over Hadoop.
The Big data is the name used ubiquitously now a day in distributed paradigm on the web. As the name point out it is the collection of sets of very large amounts of data in pet bytes, Exabyte etc. related systems as well as the algorithms used to analyze this enormous data. Hadoop technology as a big data processing technology has proven to be the go to solution for processing enormous data sets. MapReduce is a conspicuous solution for computations, which requirement one-pass to complete, but not exact efficient for use cases that need multi-pass for computations and algorithms. The Job output data between every stage has to be stored in the file system before the next stage can begin. Consequently, this method is slow, disk Input/output operations and due to replication. Additionally, Hadoop ecosystem doesn't have every component to ending a big data use case. Suppose we want to do an iterative job, you would have to stitch together a sequence of MapReduce jobs and execute them in sequence. Every this job has high-latency, and each depends upon the completion of the previous stage. Apache Spark is one of the most widely used open source processing engines for big data, with wealthy language-integrated APIs and an extensive range of libraries. Apache Spark is a usual framework for distributed computing that offers high performance for both batch and interactive processing. In this paper, we aimed to demonstrate a close-up view about Apache Spark and its features and working with Spark using Hadoop. We are in a nutshell discussing about the Resilient Distributed Datasets (RDD), RDD operations, features, and limitation. Spark can be used along with MapReduce in the same Hadoop cluster or can be used lonely as a processing framework. In the last comparative analysis between Spark and Hadoop and MapReduce in this paper.
2018
At the Uppsala Monitoring Centre (UMC), individual case safety reports (ICSRs) are managed, analyzed and processed for publishing statistics of adverse drug reactions. On top of the UMC's ICSR database there is a data processing tool used to analyze the data. Unfortunately, there are some constraints limiting the current processing-tool along with that the amount of arriving data to be processed grows at a rapid rate. The UMC's processing system must be improved in order to handle future demands. In order to improve performance various frameworks for parallelization can be used. In this work, the in-memory computing framework Spark was used for parallelization of one of the current data processing tasks. Local clusters for running the new implementation in parallel was also established. Tryckt av: Reprocentralen ITC
Proceedings of the 2011 international conference on Management of data - SIGMOD '11, 2011
Hadapt is a start-up company currently commercializing the Yale University research project called HadoopDB. The company focuses on building a platform for Big Data analytics in the cloud by introducing a storage layer optimized for structured data and by providing a framework for executing SQL queries efficiently. This work considers processing data warehousing queries over very large datasets. Our goal is to maximize performance while, at the same time, not giving up fault tolerance and scalability. We analyze the complexity of this problem in the split execution environment of HadoopDB. Here, incoming queries are examined; parts of the query are pushed down and executed inside the higher performing database layer; and the rest of the query is processed in a more generic MapReduce framework. In this paper, we discuss in detail performance-oriented query execution strategies for data warehouse queries in split execution environments, with particular focus on join and aggregation operations. The efficiency of our techniques is demonstrated by running experiments using the TPC-H benchmark with 3TB of data. In these experiments we compare our results with a standard commercial parallel database and an open-source MapReduce implementation featuring a SQL interface (Hive). We show that HadoopDB successfully competes with other systems.
Proceedings of the VLDB Endowment
The cost of big-data query execution is dominated by stateful operators. These include sort and hash-aggregate that typically materialize intermediate data in memory, and exchange that materializes data to disk and transfers data over the network. In this paper we focus on several query optimization techniques that reduce the cost of these operators. First, we introduce a novel exchange placement algorithm that improves the state-of-the-art and significantly reduces the amount of data exchanged. The algorithm simultaneously minimizes the number of exchanges required and maximizes computation reuse via multi-consumer exchanges. Second, we introduce three partial push-down optimizations that push down partial computation derived from existing operators ( group-bys , intersections and joins ) below these stateful operators. While these optimizations are generically applicable we find that two of these optimizations ( partial aggregate and partial semi-join push-down ) are only benefici...
International Journal of Computer Science and Information Security (IJCSIS), Vol. 22, No. 6, November-December, 2024
Apache Spark is a widely used framework for distributed data processing, offering scalability and efficiency for handling large datasets. However, its default file-splitting and data allocation mechanisms often result in significant inter-executor data movement, causing increased shuffle costs, job latency, and resource contention. This paper proposes a novel content-aware strategy integrating Apache Iceberg Technology to optimise file-splitting and dynamic data-to-executor mapping. The proposed approach minimises inter-node data transfers by preserving data locality and embedding intelligence into data placement during the file-reading phase. Experimental evaluation using datasets of varying sizes (500 GB, 1 TB, and 10 TB) demonstrates significant improvements over state-ofthe- art baseline methods. The proposed method reduces job execution time by up to 31%, shuffle costs by up to 35%, and data transfer volumes by up to 44%. Additionally, it achieves better resource utilisation and cost efficiency, highlighting its scalability and economic benefits for cloud-based environments. Keywords: apache spark; optimization; data allocation; data locality; distributed data processing; cloud
Hadoop is a very popular general purpose framework for many different classes of data-intensive applications. However, it is not good for iterative operations because of the cost paid for the data reloading from disk at each iteration. As an emerging framework, Spark, which is designed to have a global cache mechanism, can achieve better performance in response time since the in-memory access over the distributed machines of cluster will proceed during the entire iterative process. Although the performance on time has been evaluated for Spark over Hadoop [1], the memory consumption, another system performance criteria, is not deeply analyzed in the literature. In this work, we conducted extensive experiments for iterative operations to compare the performance in both time and memory cost between Hadoop and Spark. We found that although Spark is in general faster than Hadoop in iterative operations, it has to pay for more memory consumption. Also, its speed advantage is weakened at the moment when the memory is not sufficient enough to store newly created intermediate results.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
International Journal of Web Engineering and Technology
Proc. VLDB Endow., 2021
2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2016
Big Data Benchmarks, Performance Optimization, and Emerging Hardware, 2014
Cybernetics and Information Technologies, 2020
SAMRIDDHI : A Journal of Physical Sciences, Engineering and Technology
Proceedings of the 5th ACM/SPEC international conference on Performance engineering, 2014
arXiv (Cornell University), 2022
2017 IEEE International Conference on Big Data (Big Data), 2017
International Journal of Data Science and Analytics, 2016
Lecture Notes in Computer Science, 2016