Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2018
…
8 pages
1 file
Apache Spark employs lazy evaluation [11, 6]; that is, in Spark, a dataset is represented as Resilient Distributed Dataset (RDD), and a single-threaded application (driver) program simply describes transformations (RDD to RDD), referred to as lineage [7, 12], without performing distributed computation until output is requested. The lineage traces computation and dependency back to external (and assumed durable) data sources, allowing Spark to opportunistically cache intermediate RDDs, because it can recompute everything from external data sources. To initiate computation on worker machines, the driver process constructs a directed acyclic graph (DAG) representing computation and dependency according to the requested RDD’s lineage. Then the driver broadcasts this DAG to all involved workers requesting they execute their portion of the result RDD. When a requested RDD has a long lineage, as one would expect from iterative convergent or streaming applications [9, 15], constructing and ...
2020 IEEE International Conference on Big Data (Big Data)
Apache Spark is a distributed computing framework used for big data processing. A common pattern in many Spark applications is to iteratively evolve a dataset until reaching some user-specified convergence condition. Unfortunately, some aspects of Spark's execution model make it difficult for developers who are not familiar with the implementation-level details of Spark to write efficient iterative programs. Since results are constructed iteratively and results from previous iterations may be used multiple times, effective use of caching is necessary to avoid recomputing intermediate results. Currently, developers of Spark applications must manually indicate which intermediate results should be cached. We present a method for using metadata already captured by Spark to automate caching decisions for many Spark programs. We show how this allows Spark applications to benefit from caching without the need for manual caching annotations.
Procedia Computer Science, 2018
The increasing computational complexity of Big Data software requires the scale up of the nodes of clusters of commodity hardware that have been used widely for Big Data workloads. Thus, FPGA-based accelerators and GPU devices have recently become a first class citizen in data centers. Utilizing these devices is not trivial task however from an engineering effort perspective since developers versed in distributed computing frameworks, such as Apache Spark are used to developing in higher level languages and APIs, like Python and Scala, while accelerators require the use of low-level APIs like Cuda and OpenCl. Through recent developments in accelerator virtualization like VineTalk [6] a software layer that handles the complex communication between applications and FPGAs or GPU devices, software development using accelerators has been simplified. This paper presents HetSpark, a heterogeneous modification of Apache Spark. HetSpark enables Apache Spark to operate with two classes of executors: an accelerated class, and a commodity class. HetSpark applications are expected to use VineTalk for their entire interaction with accelerators. The schedulers of HetSpark are sophisticated enough to detect the existence of VineTalk routines in the java binary code. Thus, they take decisions as to which tasks require the use of accelerators, and they send them only to executors of the former class. Finally, we evaluated thoroughly the performance of HetSpark with different mixes of executors of the two different classes. When applications run linear tasks, we observed that the use of CPU-only accelerators is preferable to GPU enhanced accelerators, while for applications with computationally challenging tasks, the time savings from the use of GPUs compensate for data transfers between commodity and accelerated executors.
ArXiv, 2021
Distributed data processing ecosystems are widespread and their components are highly specialized, such that efficient interoperability is urgent. Recently, Apache Arrow was chosen by the community to serve as a format mediator, providing efficient in-memory data representation. Arrow enables efficient datamovement between data processing and storage engines, significantly improving interoperability and overall performance. In this work, we design a new zero-cost data interoperability layer between Apache Spark and Arrow-based data sources through the Arrow Dataset API. Our novel data interface helps separate the computation (Spark) and data (Arrow) layers. This enables practitioners to seamlessly use Spark to access data from all Arrow Dataset API-enabled data sources and frameworks. To benefit our community, we open-source our work and show that consuming data through Apache Arrow is zero-cost: our novel data interface is either on-par or more performant than native Spark.
Journal of Big Data
Many applications generate and handle very large volumes of data, like social networking, cloud applications, public web sites, search engines, scientific simulations, data warehouse and so on. These large volumes of data are coming from a variety of sources and are both unstructured and structured data. In order to transform efficiently this massive data of various types into valuable information and meaningful knowledge, we need large-scale cluster infrastructures. In this context, one challenging problem is to realize an effective resource management of these large-scale cluster infrastructures in order to run distributed data analytics.
Italy, 2016
IEEE Access, 2021
Spark programs typically codify to reuse some of their generated datasets, called partition instances, to make their subsequent computations complete in a reasonable time. At runtime, however, the underlying Spark platform may independently delete such instances or accidentally cause these instances inaccessible to the program executions. Those instances will invalidate the computation assumption made in writing these programs that such depending instances are present, which leads performance bloat and even breaks the executions. In this paper, we present FAR, a novel and effective framework to handle such performance bloat and actively repair the executions by maintaining the instance dependencies in Spark program executions. FAR monitors the partition instance lifecycle activities at all levels, and determines from the execution plan of the current Spark action in the current program execution on whether a partition instance will have a dependency relation with a later one underlying the computation of that action. The experimental results showed that with the active execution repair mechanism of FAR, when some dependency partition instances were inaccessible, programs can achieve 7.3x to 67.0x speedup in re-generating them. The results also interestingly revealed that the program executions actively repaired by FAR can run to successful completion in environments with 1.7x-2.0x fewer available memory. INDEX TERMS Debugging, execution repair, dataset dependency, big data.
2015
Apache Spark a new big data processing framework, caches data in memory and then processes it. Spark creates Resilient Distributed Datasets (RDD’s) from data which are cached in memory. Although Spark is popular for its performance in iterative applications, its performance can be limited by some factors. One such factor is of disk access time. This paper incorporates some approaches for performance improvement in Spark by trying to improve its disk access time. KeywordsApache Spark, Caching, Prefetching.
Springer eBooks, 2017
The success of using workflows for modeling large-scale scientific applications has fostered the research on parallel execution of scientific workflows in shared-nothing clusters, in which large volumes of scientific data may be stored and processed in parallel using ordinary machines. However, most of the current scientific workflow management systems do not handle the memory and data locality appropriately. Apache Spark deals with these issues by chaining activities that should be executed in a specific node, among other optimizations such as the in-memory storage of intermediate data in RDDs (Resilient Distributed Datasets). However, to take advantage of the RDDs, Spark requires existing workflows to be described using its own API, which forces the activities to be reimplemented in Python, Java, Scala or R, and this demands a big effort from the workflow programmers. In this paper, we propose a parallel scientific workflow engine called TARDIS, whose objective is to run existing workflows inside a Spark cluster, using RDDs and smart caching, in a completely transparent way for the user, i.e., without needing to reimplement the workflows in the Spark API. We evaluated our system through experiments and compared its performance with Swift/K. The results show that TARDIS performs better (up to 138% improvement) than Swift/K for parallel scientific workflow execution.
The Big data is the name used ubiquitously now a day in distributed paradigm on the web. As the name point out it is the collection of sets of very large amounts of data in pet bytes, Exabyte etc. related systems as well as the algorithms used to analyze this enormous data. Hadoop technology as a big data processing technology has proven to be the go to solution for processing enormous data sets. MapReduce is a conspicuous solution for computations, which requirement one-pass to complete, but not exact efficient for use cases that need multi-pass for computations and algorithms. The Job output data between every stage has to be stored in the file system before the next stage can begin. Consequently, this method is slow, disk Input/output operations and due to replication. Additionally, Hadoop ecosystem doesn't have every component to ending a big data use case. Suppose we want to do an iterative job, you would have to stitch together a sequence of MapReduce jobs and execute them in sequence. Every this job has high-latency, and each depends upon the completion of the previous stage. Apache Spark is one of the most widely used open source processing engines for big data, with wealthy language-integrated APIs and an extensive range of libraries. Apache Spark is a usual framework for distributed computing that offers high performance for both batch and interactive processing. In this paper, we aimed to demonstrate a close-up view about Apache Spark and its features and working with Spark using Hadoop. We are in a nutshell discussing about the Resilient Distributed Datasets (RDD), RDD operations, features, and limitation. Spark can be used along with MapReduce in the same Hadoop cluster or can be used lonely as a processing framework. In the last comparative analysis between Spark and Hadoop and MapReduce in this paper.
Proceedings of the 7th International Conference on Cloud Computing and Services Science, 2017
The contribution of this paper is twofold. First, we propose a Domain Specific Language (DSL) to easily reconfigure and compose Spark applications. For each Spark application we define its input and output interfaces. Then, given a set of connections that map outputs of some Spark applications to free inputs of other Spark applications, we automatically embed Spark applications with the required synchronization and communication to properly run them according to the user-defined mapping. Second, we present an adaptive quality management/selection method for Spark applications. The method takes as input a pipeline of parameterized Spark applications, where the execution time of each Spark application is an unknown increasing function of quality level parameters. The method builds a controller that automatically computes adequate quality for each Spark application to meet a user-defined deadline. Consequently, users can submit a pipeline of Spark applications and a deadline, our method automatically runs all the Spark applications with the maximum quality while respecting the deadline specified by the user. We present experimental results showing the effectiveness of our method.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
2017 IEEE International Conference on Big Data (Big Data), 2017
Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, 2018
arXiv (Cornell University), 2017
International Journal of Computer Science and Information Security (IJCSIS), Vol. 22, No. 6, November-December, 2024
Complexity, 2021
International Journal of Data Science and Analytics, 2016
Proceedings of the 26th International Conference on World Wide Web Companion - WWW '17 Companion, 2017
Proceedings of the 2019 ACM SIGPLAN International Symposium on Memory Management, 2019
Service-Oriented and Cloud Computing
Lecture Notes in Computer Science, 2016
Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing - HPDC '16, 2016