Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
AI
This paper investigates the performance of the BSP/CGM model for high-performance computing (HPC) in cloud environments through experiments on classical graph problems such as Connected Components and Eulerian Path. Results demonstrate how the BSP model provides advantages over MapReduce in terms of computation efficiency, particularly in extensive cloud setups, paving the way for future applications in scalable computing using locality and partial synchronization.
Nowadays the concepts and infrastructures of Cloud Computing are becoming a standard for several applications. Scalability is not only a buzzword anymore, but is being used effectively. However, despite the economical advantages of virtualization and scalability, some factors as latency, bandwidth and processor sharing can be a problem for doing Parallel Computing on the Cloud.
MapReduce is a framework for processing large data sets largely used in cloud computing. MapReduce implementations like Hadoop can tolerate crashes and file corruptions, but there is evidence that general arbitrary faults do occur and can affect the correctness of job executions. Furthermore, many individual cloud outages have been reported, raising concerns about depending on a single cloud.
2009
Recently, there has been a huge growth in the amount of data processed by enterprises and the scientific computing community. Two promising trends ensure that applications will be able to deal with ever increasing data volumes: First, the emergence of cloud computing, which provides transparent access to a large number of compute, storage and networking resources; and second, the development of the MapReduce programming model, which provides a highlevel abstraction for data-intensive computing. However, the design space of these systems has not been explored in detail. Specifically, the impact of various design choices and run-time parameters of a MapReduce system on application performance remains an open question.
Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, 2010
In recent years the MapReduce framework has emerged as one of the most widely used parallel computing platforms for processing data on terabyte and petabyte scales. Used daily at companies such as Yahoo!, Google, Amazon, and Facebook, and adopted more recently by several universities, it allows for easy parallelization of data intensive computations over many machines. One key feature of MapReduce that differentiates it from previous models of parallel computation is that it interleaves sequential and parallel computation. We propose a model of efficient computation using the MapReduce paradigm. Since MapReduce is designed for computations over massive data sets, our model limits the number of machines and the memory per machine to be substantially sublinear in the size of the input. On the other hand, we place very loose restrictions on the computational power of of any individual machineour model allows each machine to perform sequential computations in time polynomial in the size of the original input. We compare MapReduce to the PRAM model of computation. We prove a simulation lemma showing that a large class of PRAM algorithms can be efficiently simulated via MapReduce. The strength of MapReduce, however, lies in the fact that it uses both sequential and parallel computation. We demonstrate how algorithms can take advantage of this fact to compute an MST of a dense graph in only two rounds, as opposed to Ω(log(n)) rounds needed in the standard PRAM model. We show how to evaluate a wide class of functions using the MapReduce framework. We conclude by applying this result to show how to compute some basic algorithmic problems such as undirected s-t connectivity in the MapReduce framework.
2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, 2013
Implementing parallel graph algorithms in large, shared memory machines, such as the Cray XMT, can be challenging for programmers. Synchronization, deadlock, hotspotting, and others can be barriers to obtaining linear scalability. Alternative programming models, such as the bulk synchronous parallel programming model used in Google's Pregel, have been proposed for large graph analytics. This model eases programmer complexity by eliminating deadlock and simplifying data sharing. We investigate the algorithmic effects of the BSP model for graph algorithms and compare performance and scalability with hand-tuned, open source software using GraphCT. We analyze the innermost iterations of these algorithms to understand the differences in scalability between BSP and shared memory algorithms. We show that scalable performance can be obtained with graph algorithms in the BSP model on the Cray XMT. These algorithms perform within a factor of 10 of hand-tuned C code.
2012
The MapReduce framework, originally proposed by Google [8], and its open source implementation, Hadoop [27], are nowadays considered the standard frameworks, both in industry and academia, to deal with petabyte scale datasets. In this paper we describe a two-rounds MapReduce approach to biconnectivity in undirected graphs, that is the computation of the set of articulation points, the set of bridges and the set of biconnected components of a graph G. We recall that an articulation point (resp. a bridge) is a vertex (resp. an edge) whose removal increases the number of connected components. A biconnected component is a maximal biconnected subgraph, i.e., it does not include neither articulation points nor bridges.
Lecture Notes in Computer Science, 2012
MapReduce (MR) has become a de facto standard for largescale data analysis. Moreover, it has also attracted the attention of the HPC community due to its simplicity, efficiency and highly scalable parallel model. However, MR implementations present some issues that may complicate its execution in existing HPC clusters, specially concerning the job submission. While on MR there are no strict parameters required to submit a job, in a typical HPC cluster, users must specify the number of nodes and amount of time required to complete the job execution. This paper presents the MR Job Adaptor, a component to optimize the scheduling of MR jobs along with HPC jobs in an HPC cluster. Experiments performed using real-world HPC and MapReduce workloads have show that MR Job Adaptor can properly transform MR jobs to be scheduled in an HPC Cluster, minimizing the job turnaround time, and exploiting unused resources in the cluster.
Computer Communications and Networks, 2015
is a software framework that allows distributed processing of large datasets across clusters of computers using simple programming constructs/models. It is designed to scale-up from a single server to thousands of nodes. It is designed to detect failures at the application level rather than rely on hardware for high-availability thereby delivering a highly available service on top of cluster of commodity hardware nodes each of which is prone to failures [2]. While Hadoop can be run on a single machine the true power of Hadoop is realized in its ability to scale-up to thousands of computers, each with several processor cores. It also distributes large amounts of work across the clusters efficiently [1]. The lower end of Hadoop-scale is probably in hundreds of gigabytes, as it was designed to handle web-scale of the order of terabytes to petabytes. At this scale the dataset will not even fit a single computer's hard drive, much less in memory. Hadoop's distributed file system breaks the data into chunks and distributes them across several computers to hold. The processes are computed in parallel on all these chunks, thus obtaining the results with as much efficiency as possible.
Parallel Computing, 2000
We describe a transport protocol suitable for BSPlib programs running on a cluster of PCs connected by a 100 Mbps Ethernet switch. The protocol provides a reliable packet-delivery mechanism that uses global knowledge of a program's communication pattern to maximise switch performance. The performance is comparable to previous low-latency protocols on similar hardware, but the addition of reliability means that this protocol can be directly used by application software. For a modest budget of $US20000 it is possible to build a machine that outperforms an IBM SP2 on all the NAS benchmarks (BT +80%, SP +70%, MG +9%, and LU +65% improvement), and an SGI Origin 2000 on half (BT +10, SP −24%, MG +10%, and LU −28%). The protocol has a CPU overhead of s for packet download and s for upload. Small packets can be communicated through the switch in a pipelined fashion every s. Application-to-application one-way latency is s plus the latency of the switch. A raw link bandwidth of 93 Mbps is achieved for 1400-byte packets, and 50 Mbps for 128-byte packets. This scales to eight processors communicating at 91 Mbps per link, to give a sustained global bandwidth of 728 Mbps.
Procedia Computer Science, 2010
Composition of Map and Reduce algorithmic skeletons have been widely studied at the end of the last century and it has demonstrated effective on a wide class of problems. We recall the theoretical results motivating the introduction of these skeletons, then we discuss an experiment implementing three algorithmic skeletons, a map, a reduce and an optimized composition of a map followed by a reduce skeleton (map+reduce). The map+reduce skeleton implemented computes the same kind of problems computed by Google MapReduce, but the data flow through the skeleton is streamed rather than relying on already distributed (and possibly quite large) data items. We discuss the implementation of the three skeletons on top of ProActive/GCM in the MareMare prototype and we present some experimental obtained on a COTS cluster.
Concurrency and Computation: Practice and Experience, 2015
Efficiently scheduling MapReduce tasks is considered as one of the major challenges that face MapReduce frameworks. Many algorithms were introduced to tackle this issue. Most of these algorithms are focusing on the data locality property for tasks scheduling. The data locality may cause less physical resources utilization in non-virtualized clusters and more power consumption. Virtualized clusters provide a viable solution to support both data locality and better cluster resources utilization. In this paper, we evaluate the major MapReduce scheduling algorithms such as FIFO, Matchmaking, Delay, and multithreading locality (MTL) on virtualized infrastructure. Two major factors are used to test the evaluated algorithms: the simulation time and the energy consumption. The evaluated schedulers are compared, and the results show the superiority and the preference of the MTL scheduler over the other existing schedulers. Also, we present a comparison study between virtualized and non-virtualized clusters for MapReduce tasks scheduling. Q. ALTHEBYAN ET AL.
2012
Abstract Traditional HPC architectures separate compute nodes and storage nodes, which are interconnected with high speed links to satisfy data access requirements in multi-user environments. However, the capacity of those high speed links is still much less than the aggregate bandwidth of all compute nodes. In Data Parallel Systems such as GFS/MapReduce, clusters are built with commodity hardware and each node takes the roles of both computation and storage, which makes it possible to bring compute to data.
2010 10th IEEE/ACM …, 2010
MapReduce is a programming paradigm for parallel processing that is increasingly being used for data-intensive applications in cloud computing environments. An understanding of the characteristics of workloads running in MapReduce environments benefits both the service providers in the cloud and users: the service provider can use this knowledge to make better scheduling decisions, while the user can learn what aspects of their jobs impact performance. This paper analyzes 10-months of MapReduce logs from the M45 supercomputing cluster which Yahoo! made freely available to select universities for systems research. We characterized resource utilization patterns, job patterns, and sources of failures. We use an instance-based learning technique that exploits temporal locality to predict job completion times from historical data and identify potential performance problems in our dataset.
2012 7th Open Cirrus Summit, 2012
Many applications in real life can be modeled by Graph, and the data scale is very large in many fields. People have paid more attention to large-scale graph processing. A BSPbased system with disk cache for large-scale graph processing is proposed in this paper. The system has the ability to expand the functions and strategies (such as adjusting the parameters according to the volume of data and supporting multiple aggregation functions at the same time), to process large-scale data, to balance load, and to run clustering or classification algorithms on metric datasets. Some experiments are done to evaluate the scalability of the system implemented in the paper, and the comparison between BC-BSP-based applications and MapReduce-based ones are made. The experimental results show that BSP-based applications have higher efficiency than the MapReduce-based applications when the volume of data can be put in the memory during the course of processing; on the contrary the latter is better than the former.
Future Generation Computer Systems, 2013
The scalability of High Performance Computing (HPC) applications depends heavily on the efficient support of network communications in virtualized environments. However, Infrastructure as a Service (IaaS) providers are more focused on deploying systems with higher computational power interconnected via high-speed networks rather than improving the scalability of the communication middleware. This paper analyzes the main performance bottlenecks in HPC application scalability on the Amazon EC2 Cluster Compute platform: (1) evaluating the communication performance on shared memory and a virtualized 10 Gigabit Ethernet network; (2) assessing the scalability of representative HPC codes, the NAS Parallel Benchmarks, using an important number of cores, up to 512; (3) analyzing the new cluster instances (CC2), both in terms of single instance performance, scalability and cost-efficiency of its use; (4) suggesting techniques for reducing the impact of the virtualization overhead in the scalability of communication-intensive HPC codes, such as the direct access of the Virtual Machine to the network and reducing the number of processes per instance; and (5) proposing the combination of message-passing with multithreading as the most scalable and cost-effective option for running HPC applications on the Amazon EC2 Cluster Compute platform.
2015
The Blossom V algorithm is used in graph theory to compute a perfect matching of minimum cost. We conducted performance studies on the algorithm using the maya cluster in the UMBC High Performance Computing Facility to better understand the performance capabilities and emphasize potential approaches for improvement. In the performance studies, we varied the number of nodes, graph density, and weight range for numerous graphs. For each graph, we recorded execution times and memory usage. For all graphs used in the performance studies, we found that the majority of time is spent in initialization. We also found that as graph density increases, both execution time and memory usage increase. While we anticipated these conclusions, we reached other conclusions that were more surprising. We determined that as the weight range of a graph increases, initialization time and total execution time increase. We also found that scaling down integer weight ranges to real-valued weight ranges has a...
2016
OpenHPC is a newly formed, community-based project that is providing an integrated collection of HPC-centric software components that can be used to implement a full-featured reference HPC compute resource. Components span the entire HPC software ecosystem including provisioning and system administration tools, resource management, I/O services, development tools, numerical libraries, and performance analysis tools. Common clustering tools and scientific libraries are distributed as pre-built and validated binaries and are meant to seamlessly layer on top of existing Linux distributions. The architecture of OpenHPC is intentionally modular to allow end users to pick and choose from the provided components, as well as to foster a community of open contribution. This paper presents an overview of the underlying community vision, governance structure, packaging conventions, build and release infrastructure and validation methodologies.
2018 IEEE International Conference on Cluster Computing (CLUSTER), 2018
The question "Can big data and HPC infrastructure converge?" has important implications for many operators and clients of modern computing. However, answering it is challenging. The hardware is currently different, and fast evolving: big data uses machines with modest numbers of fat cores per socket, large caches, and much memory, whereas HPC uses machines with larger numbers of (thinner) cores, non-trivial NUMA architectures, and fast interconnects. In this work, we investigate the convergence of big data and HPC infrastructure for one of the most challenging application domains, the highly irregular graph processing. We contrast through a systematic, experimental study of over 300,000 core-hours the performance of a modern multicore, Intel Knights Landing (KNL) and of traditional big data hardware, in processing representative graph workloads using state-of-the-art graph analytics platforms. The experimental results indicate KNL is convergence-ready, performance-wise, but only after extensive and expert-level tuning of software and hardware parameters.
2005
Many emerging large-scale data science applications require searching large graphs distributed across multiple memories and processors. This paper presents a distributed breadthfirst search (BFS) scheme that scales for random graphs with up to three billion vertices and 30 billion edges. Scalability was tested on IBM BlueGene/L with 32,768 nodes at the Lawrence Livermore National Laboratory. Scalability was obtained through a series of optimizations, in particular, those that ensure scalable use of memory. We use 2D (edge) partitioning of the graph instead of conventional 1D (vertex) partitioning to reduce communication overhead. For Poisson random graphs, we show that the expected size of the messages is scalable for both 2D and 1D partitionings. Finally, we have developed efficient collective communication functions for the 3D torus architecture of BlueGene/L that also take advantage of the structure in the problem. The performance and characteristics of the algorithm are measured and reported.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.