Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
1994, Journal of Parallel and Distributed Computing
We discuss issues pertinent to performance analysis of massively parallel systems. We first argue that single parameter characterization of parallel software or of parallel hardware rarely provides insight into the complex interactions among the soft,ware and the hard ware components of a parallel system. In particular, bounds for the speed up based upon simple models of parallelism are violated when a model ignores the effects of communication.
Statistics: A Series of Textbooks and Monographs, 2005
1990
In this paper we analyze a model of a parallel processing syslem. In our model there is a single queue which is scrvcd by K 1 1 identical proccs-SOTS. Jobs arc assumed LO consist of a scqucnce of barrier synchronizations where, at each step, the number of tasks that must be synchronixcd is random with a known distribution. An exact analysis of the model is dcrivcd. The model lcads to a rich set of rcsul(x characterizing the performance of parallel processing syslcms. WC show Ihal, the number of jobs concurrenlly in execution, as well as I.hc number 0C synchronixation varia.bles, grows linearly wi1.h Ihe load or the system and strongly dcpcntls on the avcragc number of parallel lasks Found in the workload. I'ropcrtics of expected rcsponsc I,imc or such syslcms arc exbcnsively analyzed and, in parliciilar, wc report on some non-obvious response time behavior that arises as a function of l;hc variance of parallelism round in the work1oa.d. Rascd on exact response lime analysis, we propose a simple calculalion lhat can be used as a rule of t.humh 1.0 predict speedups. This can be viewed a.s a gcncra.lizalion of Amdahl's law t.hat includes qucucirrg cC fccts. This gcncralizat.ion. is rcformulalcd when prccisc workloads cannot be characterizccl, but raohcr when only Ihc fraclion or scquerrtial work and the average number or parallcl tasks arc assumcd to bc known.
Journal of Parallel and Distributed Computing, 2013
In this paper we introduce our estimation method for parallel execution times, based on identifying separate "parts" of the work done by parallel programs. Our run time analysis works without any source code inspection. The time of parallel program execution is expressed in terms of the sequential work and the parallel penalty. We measure these values for different problem sizes and numbers of processors and estimate them for unknown values in both dimensions using statistical methods. This allows us to predict parallel execution time for unknown inputs and non-available processor numbers with high precision. Our prediction methods require orders of magnitude less data points than existing approaches. We verified our approach on parallel machines ranging from a multicore computer to a peta-scale supercomputer.
1989
This technical report has been reviewed and is approved for publication.
Proceedings of the fourth international workshop on High-level parallel programming and applications - HLPP '10, 2010
In this paper we estimate parallel execution times, based on identifying separate "parts" of the work done by parallel programs. We assume that programs are described using algorithmic skeletons. Therefore our runtime analysis works without any source code inspection. The time of parallel program execution is expressed in terms of the sequential work and the parallel penalty. We measure these values for different problem sizes and numbers of processors and estimate them for unknown values in both dimensions. This allows us to predict parallel execution time for unknown inputs and non-available processor numbers.
IEEE Transactions on Computers, 1988
Abstruct-A software tool for measuring parallelism in large scientific/engineering applications is described in this paper. The proposed tool measures the total parallelism present in programs, filtering out the effects of communication/synchronization delays, finite storage, limited number of processors, the policies for management of processors and storage, etc. Although an ideal machine which can exploit the total parallelism is not realizable, such measures would aid the calibration and design of various architectures/compilers. The proposed software tool accepts ordinary Fortran programs as its input. Therefore, parallelism can be measured easily on many fairly big programs. Some measurements for parallelism obtained with the help of this tool are also reported. It is observed that the average parallelism in the chosen programs is in the range of 500-3500 Fortran statements executing concurrently in each clock cycle in an idealized environment.
Journal of Parallel and Distributed Computing, 1993
There are several metrics that characterize the performance of a parallel system, such as, parallel execution time, speedup and e ciency. A number of properties of these metrics have been studied. For example, it is a well known fact that given a parallel architecture and a problem of a xed size, the speedup of a parallel algorithm does not continue to increase with increasing number of processors. It usually tends to saturate or peak at a certain limit. Thus it may not be useful to employ more than an optimal number of processors for solving a problem on a parallel computer. This optimal number of processors depends on the problem size, the parallel algorithm and the parallel architecture. In this paper we study the impact of parallel processing overheads and the degree of concurrency of a parallel algorithm on the optimal number of processors to be used when the criterion for optimality is minimizing the parallel execution time. We then study a more general criterion of optimality and show how operating at the optimal point is equivalent to operating at a unique value of e ciency which is characteristic of the criterion of optimality and the properties of the parallel system under study. We put the technical results derived in this paper in perspective with similar results that have appeared in the literature before and show how this paper generalizes and/or extends these earlier results.
Parallel Computing, 2004
This work presents a new approach to the relation between theoretical complexity models and performance analysis and tuning. The analysis of an algorithm produces a complexity function that gives an approach to the asymptotic number of operations performed by the algorithm. The time spent on these operations depends on the software-hardware platform being used. Usually such platforms are described, from the performance point of view, through a number of parameters. Those parameters are evaluated by a benchmarking program. Though for a given available platform, the algorithmic constants associated with the complexity formula can be computed using multidimensional linear regression, there is still the problem of predicting the performance when the platform is not available. We introduce the concept of Universal Instruction Class and derive from it a set of equations relating the values of the algorithmic constants with the platform parameters. Due to the hierarchical design of current memory systems, the performance behavior of most algorithms varies in a small number of large regions corresponding to small size, medium size and large size inputs. The constants involved in the complexity formula usually have different values for these regions. Assuming we have a complexity formula for the memory resources, it is possible to find a partition of the input size space and the different values of the algorithmic constants. This way, though the complexity formula is the same, the family of constants provides the adaptability of the formula to the different stationary uses of the memory.
Parallel Computing, 1996
In this paper we describe how a performance tuning tool-set, AIMS, guides the user towards developing efficient and scalable production-level parallel programs by locating performance improvement opportunities and determining optimization benefits. AIMS's Xisk helps identify potential optimizations by computing various pre-defined normalized performance indices from program traces. Inspection of these index point to specific optimizations that may benefit program performance. After identifying and characterizing performance problems, AIMS's MK can provide quantitative estimates of performance benefits to help the user avoid arduous optimizations that may not lead to expected performance improvements by. MK also helps identify potential pitfalls or benefits of changing any of various system parameters. Based on MK's performance projection, an informed decision regarding the most beneficial program optimizations or upgrades in execution environments can be chosen. complete information about N/P systems. Then, these N/P systems are solved locally on each node using Gaussian elimination. Finally, the matrix of solution vectors is transposed so that they are distributed in the same way as the coefficients were in the beginning. The initial and final stages are communication-intensive because of the transpose operations. The particular transposition algorithm implemented requires log,(P) communication steps, each involving bi-directional nearest-neighbor communication along a different dimension of an hypercube. Each message in the transpose operation involves N */2 P values.
IEEE Transactions on Computers, 1989
2010
The purpose of this study is to determine analytically what and how acceleration from paralleling execution of a task depends. It is reasonable if level of parallelism is increased, the costs of synchronization will be increased also and upon reaching a certain degree of granulation acceleration of multi-program execution starts to decrease.
2014
L'analyse et la compréhension du comportement d'applications parallèles sur des platesformes de calcul variées est un problème récurent de la communauté du calcul scientiőque. Lorsque les environnements d'exécution ne sont pas disponibles, la simulation devient une approche raisonnable pour obtenir des indicateurs de performance objectifs et pour explorer plusieurs scénarios łwhat-if?ž. Dans cette thèse, nous présentons un environnement pour la simulation of-line d'applications écrites avec MPI. La principale originalité de notre travail par rapport aux travaux précédents réside dans la déőnition de traces indépendantes du temps. Elles permettent d'obtenir une extensibilité maximale puisque des ressources hétérogènes et distribuées peuvent être utilisées pour obtenir une trace. Nous proposons un format dans lequel pour chaque événement qui apparaît durant l'exécution d'une application, nous récupérons les informations sur le volume d'instructions pour une phase de calcul ou le nombre d'octets et le type d'une communication. Pour obtenir des traces indépendantes du temps lors de l'exécution d'applications MPI, nous devons les instrumenter pour récupérer les données requises. Il existe plusieurs outils d'instrumentation qui peuvent instrumenter une application. Nous proposons un système de notation qui correspond aux besoins de notre environnement et nous évaluons les outils d'instrumentation selon lui. De plus, nous introduisons un outil original appelé Minimal Instrumentation qui a été conçu pour répondre au besoins de notre environnement. Nous étudions plusieurs méthodes d'instrumentation et plusieurs stratégies d'acquisition. Nous détaillons les outils qui extraient les traces indépendantes du temps à partir des traces d'instrumentations de quelques outils de proőling connus. Enőn nous évaluons la procédure d'acquisition complète et présentons l'acquisition d'instances à grande échelle. Nous décrivons en détail la procédure pour fournir un őchier de plateforme simulée réaliste à notre outil d'exécution de traces qui prend en compte la topologie de la plateforme cible ainsi que la procédure de calibrage par rapport à l'application qui va être simulée. De plus, nous montrons que notre simulateur peut prédire les performances de certains benchmarks MPI avec moins de 11% d'erreur relative entre l'exécution réelle et la simulation pour les cas où il n'y a pas de problème de performance. Enőn, nous identiőons les causes de problèmes de performances et nous proposons des solutions pour y remédier.
Lecture Notes in Computer Science, 2011
In this paper two classifiers have been derived in order to determine if identical computer tasks have been executed at different processors. The classifiers have been developed analytically following a classical hypothesis testing approach. The main assumption of this work is that the probability distribution function (pdf) of the random times taken by the processors to serve tasks are known. This assumption has been fulfilled by empirically characterizing the pdf of such random times. The performance of the classifiers developed here has been assessed using traces from real processors. Further, the performance of the classifiers is compared to heuristic classifiers, linear discriminants, and non-linear discriminants among other classifiers.
2011
Abstract. This paper introduces the ADVANCE approach to engineering concurrent systems using a new component-based approach. A cost-directed tool-chain maps concurrent programs onto emerging hardware architectures, where costs are expressed in terms of programmer annotations for the throughput, latency and jitter of components.
Microprocessing and microprogramming, 1994
Many parallel algorithms can be modelled as directed acyclic task graphs. Recently, Degree of Simultaneousness (DS) and Degree of Connection (DC) have been defined as the two measures of parallelism in algorithms represented by task graphs. ...
[1990] Proceedings. Second IEEE Workshop on Future Trends of Distributed Computing Systems
Generic queueing models of parallel systems with K 2 2 exponential servers where jobs may be split into K independent tasks are considered. The queueing of jobs is distributed if each server has its own queue, and is centralized if there is a common queue. The scheduling of jobs is no splitting if all tasks of a job must run on one processor and splitting if they can run concurrently on different processors. Exact and approximate expressions for the mean response time, TrZK, of the ra, r = I , 2, ..., K, departing task in a job are obtained and compared for four models: Distributed/Splitting (D/S), Distributed/No Splitting (D/NS), Centralized/Splitting (Cis) and Centralized/No Splitting (C/NS). It is shown that 7r:K for Il/S systems is lower than that of CjS systems for small values of r and medium to high utilizations. The effect of splitting jobs into tasks is studied and it is shown that C/NS systems yield lower va!ues of TrrK than D/S systems only when Y and the utilization are quite high. Also, the relative range defined as (TKrK-T I : K) / T K : K is shown to be bounded for the systems considered except for D/S systems, which implies the possibility of overflow of the waiting space used for task synchronization. These results are useful in evaluating the performance of parallel algorithms with rout of -K operations, in assessing the lock holding times in replicated database systems, arid in comparing alternative architectures of parallel processing systems.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.