Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000
This paper addresses the problem of designing scalable concurrent priority queues for large scale multiprocessors -machines with up to several hundred processors. Priority queues are fundamental in the design of modern multiprocessor algorithms, with many classical applications ranging from numerical algorithms through discrete event simulation and expert systems.
2017
Priority queues are data structures that store information in an orderly fashion. They are of tremendous importance because they are an integral part of many applications, like Dijkstras shortest path algorithm, MST algorithms, priority schedulers, and so on. Since priority queues by nature have high contention on the delete min operation, the design of an efficient priority queue should involve an intelligent choice of the data structure as well as relaxation bounds on the data structure. Lock-free data structures provide higher scalability as well as progress guarantee than a lock-based data structure. That is another factor to be considered in the priority queue design. We present a relaxed non-blocking priority queue based on skiplists. We address all the design issues mentioned above in our priority queue. Use of skiplists allows multiple threads to concurrently access different parts of the skiplist quickly, whereas relaxing the priority queue delete min operation distributes ...
Lecture Notes in Computer Science, 1996
In this paper, we explore parallel implementations of the abstract data type priority queue. We use the BSP* model, an extension of Valiant's BSP model which rewards blockwise communication, i.e. sending a few large messages instead of many small ones. We present two randomized approaches for different relations between the size of the data structure and the number of parallel updates to be performed. Both yield work optimal algorithms that need asymptotically less communication than computation time and use large messages. All previous work optimal algorithms need asymptotically as much communication as computation or do not consider blockwise communication. We use a work optimal randomized selection algorithm as a building block. This might be of independent interest. It uses less communication than computation time, if the keys are distributed at random. A similar selection algorithm was independently developed by Gerbessiotis and Siniolakis for the standard BSP model. We improve upon previous work by both reducing the amount of communication and by using large messages.
CPH STL TR 2004, 2004
Abstract. We introduce a framework for reducing the number of element comparisons performed in priority-queue operations. In particular, we give a priority queue which guarantees the worst-case cost of O (1) per minimum finding and insertion, and the worst-...
Journal of the ACM, 1993
In this paper, analytical models for a multiprocessor executing a stream consisting of K classes of fork-join jobs are developed. Here, a fork-join job consists of a random number of tasks that can beexecuted independently ofeach other. Several priority policies are analyzed:(a) a strict nonpreemptive head of the line policy (b) a preemptive policy that allows preemptions at the job level, (c) a preemptive policy that allows preemptions at the task level, and (d) a policy in which the priority is a nondecreasing function of the number of tasks in the queue with preemptions at the job level. Using these models, the mean job response time for the different classes under the different policies is compared. These policies are compared to a system in which processors are partitioned so that classes are allocated only to certain processor groups. It is shown that, for the system considered, the task preemption policy has a uniformly better mean class response time and thus is preferable to a system with partitioned processors.
2015
Existing concurrent priority queues do not allow to update the priority of an element after its insertion. As a result, algorithms that need this functionality, such as Dijkstra's single source shortest path algorithm, resort to cumbersome and inefficient workarounds. We report on a heap-based concurrent priority queue which allows to change the priority of an element after its insertion. We show that the enriched interface allows to express Dijkstra's algorithm in a more natural way, and that its implementation, using our concurrent priority queue, outperform existing algorithms.
Queueing Systems, 2005
We present the first near-exact analysis of an M/PH/k queue with m > 2 preemptive-resume priority classes. Our analysis introduces a new technique, which we refer to as Recursive Dimensionality Reduction (RDR). The key idea in RDR is that the m-dimensionally infinite Markov chain, representing the m class state space, is recursively reduced to a 1-dimensionally infinite Markov chain, that is easily and quickly solved. RDR involves no truncation and results in only small inaccuracy when compared with simulation, for a wide range of loads and variability in the job size distribution.
Priority queues are used in many applications including real-time systems, operating systems, and simulations. Their implementation may have a profound effect on the performance of such applications. In this article, we study the performance of well-known sequential priority queue implementations and the recently proposed parallel access priority queues. To accurately assess the performance of a priority queue, the performance measurement methodology must be appropriate. We use the Classic Hold, the Markov Model, and an Up/Down access pattern to measure performance and look at both the average access time and the worst-case time that are of vital interest to real-time applications. Our results suggest that the best choice for priority queue algorithms depends heavily on the application. For queue sizes smaller than 1,000 elements, the Splay Tree, the Skew Heap, and Henriksen's algorithm show good average access times. For large queue sizes of 5,000 elements or more, the Calendar Queue and the Lazy Queue offer good average access times but have very long worst-case access times. The Skew Heap and the Splay Tree exhibit the best worst-case access times. Among the parallel access priority queues tested, the Parallel Access Skew Heap provides the best performance on small shared memory multiprocessors.
2001
ABSTRACT This paper presents a new implementation technique for priority search queues. This ahstract data type is an amazing hlend of finite maps and priority queues. Our implementation supports logarithmic access to a hinding with a given key and constant access to a hinding with the minimum value. Priority search queues can he used. for instance. to give a simple. purely functional implementation of Dijkstra's single-source shortest-paths algorithm.
2014 Proceedings of the Sixteenth Workshop on Algorithm Engineering and Experiments (ALENEX), 2013
The theory community has proposed several new heap variants in the recent past which have remained largely untested experimentally. We take the field back to the drawing board, with straightforward implementations of both classic and novel structures using only standard, well-known optimizations. We study the behavior of each structure on a variety of inputs, including artificial workloads, workloads generated by running algorithms on real map data, and workloads from a discrete event simulator used in recent systems networking research. We provide observations about which characteristics are most correlated to performance. For example, we find that the L1 cache miss rate appears to be strongly correlated with wallclock time. We also provide observations about how the input sequence affects the relative performance of the different heap variants. For example, we show (both theoretically and in practice) that certain random insertion-deletion sequences are degenerate and can lead to misleading results. Overall, our findings suggest that while the conventional wisdom holds in some cases, it is sorely mistaken in others.
Proceedings 11th International Parallel Processing Symposium, 1997
We present a parallel priority data structure that improves the running time of certain algorithms for problems that lack a fast and work-e cient parallel solution. As a main application, we give a parallel implementation of Dijkstra's algorithm which runs in O(n) time while performing O(m log n) work on a CREW PRAM. This is a logarithmic factor improvement for the running time compared with previous approaches. The main feature of our data structure is that the operations needed in each iteration of Dijkstra's algorithm can be supported in O(1) time.
The Journal of Supercomputing, 1992
We describe a new parallel data structure, namely parallel heap, for exclusive-read exclusive-write parallel random access machines. To our knowledge, it is the first such data structure to efficiently implement a truly parallel priority queue based on a heap structure. Employing p processors, the parallel heap allows deletions of O(p) highest priority items and insertions of O(p) new items, each in O(log n) time, where n is the size of the parallel heap. Furthermore, it can efficiently utilize processors in the range 1 through n.
Journal of Parallel and Distributed Computing, 1998
We present a parallel priority queue that supports the following operations in constant time: parallel insertion of a sequence of elements ordered according to key, parallel decrease key for a sequence of elements ordered according to key, deletion of the minimum key element, and deletion of an arbitrary element. Our data structure is the first to support multi-insertion and multi-decrease key in constant time. The priority queue can be implemented on the EREW PRAM and can perform any sequence of n operations in O(n) time and O(m log n) work, m being the total number of keyes inserted and/or updated. A main application is a parallel implementation of Dijkstra's algorithm for the single-source shortest path problem, which runs in O(n) time and O(m log n) work on a CREW PRAM on graphs with n vertices and m edges. This is a logarithmic factor improvement in the running time compared with previous approaches.
Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering - ICPE '15, 2015
As core counts increase and as heterogeneity becomes more common in parallel computing, we face the prospect of programming hundreds or even thousands of concurrent threads in a single shared-memory system. At these scales, even highly-efficient concurrent algorithms and data structures can become bottlenecks, unless they are designed from the ground up with throughput as their primary goal. In this paper, we present three contributions: (1) a characterization of queue designs in terms of modern multi-and many-core architectures, (2) the design of a high-throughput concurrent FIFO queue for many-core architectures that avoids the bottlenecks common in modern queue designs, and (3) a thorough evaluation of concurrent queue throughput across CPU, GPU, and co-processor devices. Our evaluation shows that focusing on throughput, rather than progress guarantees, allows our queue to scale to as much as three orders of magnitude (1000×) faster than lock-free and combining queues on GPU platforms and two times (2×) faster on CPU devices. These results deliver critical insight into the design of data structures for highly concurrent systems: (1) progress guarantees do not guarantee scalability, and (2) allowing an algorithm to block can actually increase throughput.
Computers, IEEE Transactions on, 1988
Alternative majority voting methods for real-time computing systems," IEEE Trans. Reliability, to be published. T. K. Srikanth and S. Toueg, "Optimal clock synchronization," J. -, "Simulating authenticated broadcasts to derive simple faulttolerant algorithms," Tech. Rep. 84-623, Dep.
ACM Transactions on Algorithms, 2008
We introduce a framework for reducing the number of element comparisons performed in priority-queue operations. In particular, we give a priority queue which guarantees the worstcase cost of O(1) per minimum finding and insertion, and the worst-case cost of O(log n) with at most log n + O(1) element comparisons per deletion, improving the bound of 2 log n + O(1) known for binomial queues. Here, n denotes the number of elements stored in the data structure prior to the operation in question, and log n equals log 2 (max {2, n}). As an immediate application of the priority queue developed, we obtain a sorting algorithm that is optimally adaptive with respect to the inversion measure of disorder, and that sorts a sequence having n elements and I inversions with at most n log(I /n) + O(n) element comparisons.
IEEE Transactions on …, 1992
1996
Drawing ideas from previous authors, we present a new non-blocking concurrent queue algorithm and a new twolock queue algorithm in which one enqueue and one dequeue can proceed concurrently. Both algorithms are simple, fast, and practical; we were surprised not to find them in the literature. Experiments on a 12-node SGI Challenge multiprocessor indicate that the new non-blocking queue consistently outperforms the best known alternatives; it is the clear algorithm of choice for machines that provide a universal atomic primitive (e.g. compare and swap or load linked/store conditional). The two-lock concurrent queue outperforms a single lock when several processes are competing simultaneously for access; it appears to be the algorithm of choice for busy queues on machines with non-universal atomic primitives (e.g. test and set). Since much of the motivation for non-blocking algorithms is rooted in their immunity to large, unpredictable delays in process execution, we report experimental results both for systems with dedicated processors and for systems with several processes multiprogrammed on each processor.
Mathematical Systems Theory, 1976
We present a data structure, based upon a hierarchically decomposed tree, which enables us to manipulate on-line a priority queue whose priorities are selected from the interval 1,..., n with a worst case processing time of (9 (log log n) per instruction. The structure can be used to obtain a mergeable heap whose time requirements are about as good. Full details are explained based upon an implementation of the structure in a PASCAL program contained in the paper. * Work supported by grant CR 62-50. Netherlands Organization for the Advancement of Pure Research (Z.W.O.). 100 P. VAN EMDE BOAS, R. KAAS AND E. ZIJLSTRA
Journal of Parallel and Distributed Computing, 1990
A simultaneous access priority queue design which handles p accesses in every 0(log p) time is presented. A processor is free to perform another task after issuing an insert, whereas it has to wait 0(log p) time to receive the response of a delete-min. Compared with all sequential access designs which require O(p) time to process p accesses, our design achieves a significant performance improvement. For a fixed number of priorities, we propose a design which can pipeline accesses in constant time. For both designs, the strict highest-priority-out-first property is maintained.
ACM SIGPLAN Notices, 2014
Many task-parallel applications can benefit from attempting to execute tasks in a specific order, as for instance indicated by priorities associated with the tasks. We present three lock-free data structures for priority scheduling with different trade-offs on scalability and ordering guarantees. First we propose a basic extension to work-stealing that provides good scalability, but cannot provide any guarantees for task-ordering in-between threads. Next, we present a centralized priority data structure based on k-fifo queues, which provides strong (but still relaxed with regard to a sequential specification) guarantees. The parameter k allows to dynamically configure the trade-off between scalability and the required ordering guarantee. Third, and finally, we combine both data structures into a hybrid, k-priority data structure, which provides scalability similar to the work-stealing based approach for larger k, while giving strong ordering guarantees for smaller k. We argue for using the hybrid data structure as the best compromise for generic, priority-based task-scheduling. We analyze the behavior and trade-offs of our data structures in the context of a simple parallelization of Dijkstra's single-source shortest path algorithm. Our theoretical analysis and simulations show that both the centralized and the hybrid k-priority based data structures can give strong guarantees on the useful work performed by the parallel Dijkstra algorithm. We support our results with experimental evidence on an 80-core Intel Xeon system.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.