0% found this document useful (0 votes)

15 views23 pages

Unit 6 Notes

Uploaded by

Juhi Patil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views23 pages

Unit 6 Notes

Uploaded by

Juhi Patil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 23

Unit VI : Multithreaded and Distributed Algorithms

Concepts of Dynamic Multithreading

Multithreading is a crucial topic for modern computing. Parallel machines are getting
cheaper and in fact are now ubiquitous ...

 supercomputers: custom architectures and networks

 computer clusters with dedicated networks (distributed memory)
 multi-core integrated circuit chips (shared memory)
 GPUs (graphics processing units) with multiple processors

Our emphasis here will be parallel algorithms, that is, multithreading a single
algorithm so that some of its instructions may be executed simultaneously. Parallelism
can also be applied to scheduling and managing multiple algorithms, each running
concurrently in their own thread and possibly sharing resources, as studied in courses
on operating systems and concurrent and high performance computing.

Static and Dynamic Multithreading

Static threading provides the programmer with an abstraction of virtual processors

that are managed explicitly. It's "static" because the programmer must specify in
advance how many processors to use at each point. This can be difficult and inflexible
with respect to evolving conditions.

Rather than managing threads explicitly, our model is dynamic multithreading in

which programmers specify opportunities for parallelism, and a concurrency
platform manages the decisions of mapping these opportunities to actual static
threads.

Concurrency Constructs:

We will use three keywords in our pseudocode, reflecting current parallel-computing

practice:

 parallel: add to loop construct such as for to indicate each iteration can be
executed in parallel.
 spawn: create a parallel subprocess, then keep executing the current process
(parallel procedure call).
 sync: wait here until all active parallel threads created by this instance of the
program finish.
These keywords specify opportunities for parallelism without affecting whether (or
not) the corresponding sequential program obtained by removing them is correct. In
other words, if we ignore the parallel keywords the program can be analyzed as a
single threaded program. We exploit this in analysis.

Logical Parallelism

The parallel and spawn keywords do not force parallelism: they just says that it is
permissible. This is logical parallelism. A scheduler will make the decision
concerning allocation to processors. We return to the question of scheduling at the end
of this document, after approriate concepts have been introduced.

However, if parallelism is used, sync must be respected. For safety, there is an

implicit sync at the end of every procedure.

Example: Parallel Fibonacci

For illustration, we take a really slow algorithm and make it parallel. (There are much
better ways to compute Fibonacci numbers; this is just for illustration.) Here is the
definition of Fibonacci numbers:

F0 = 0.
F1 = 1.
Fi = Fi-1 + Fi-2, for i ≥ 2.

Here is a recursive non-parallel algorithm for computing Fibonacci numbers modeled

on the above definition, along with its recursion tree:
Fib has recurrence relation T(n) = T(n − 1) + T(n − 2) + Θ(1), which has the solution
T(n) = Θ(Fn) (see the text for substitution method proof). This grows exponentially
in n, so it's not very efficient. (A straightforward iterative algorithm is much better.)

Noticing that the recursive calls operate independently of each other, let's see what
improvement we can get by computing the two recursive calls in parallel. This will
illustrate the concurrency keywords and also be an example for analysis:

Notice that without the parallel keywords it is the same as the serial program above.

We will return to this example when we analyze multithreading.

Modeling Dynamic Multithreading

First we need a formal model to describe parallel computations.
A Model of Multithreaded Execution

We will model a multithreaded computation as a computation DAG (directed acyclic

graph) G = (V, E):

 Vertices in V represent instructions. To simplify the graph, each vertex can

represent a strand: a sequence of non-parallel instructions, as they will all be
treated the same as far as parallelism is concerned.

∈ E means u must execute before v. (Edge are categorized in ways elaborated

 Edges in E represent dependencies between instructions or strands: (u, v)

below.)

 If G has a directed path from u to v they are logically in series; otherwise they
are logically parallel.

 A strand with multiple successors means all but one of them must have
spawned. A strand with multiple predecessors means they join at a sync
statement.

We assume an ideal parallel computer with sequentially consistent memory, meaning

it behaves as if the instructions were executed sequentially in some full ordering
consistent with orderings within each thread (i.e., consistent with the partial ordering
of the computation DAG).

Visualizing the Model

The model can be visualized as exemplified below for the computation DAG for P-
Fib(4):
Vertices (strands) are visualized as circles in the figure.

 The rounded rectangles are not part of the formal model, but they help organize
the visualization by collecting together all strands for a given call.
 The colors are specific to this example and indicate the corresponding code:
black indicates that the strand is for lines 1-3; grey for line 4; and white for
lines 5-6.

Edges are categorized and visualized as follows:

 Continuation Edges (u, v) are drawn horizontally and indicate that v is the
successor to u in the sequential procedure.
 Call Edges (u, v) point downwards, indicating that u called v as a normal
subprocedure call. In this example they come out of the grey circles.
 Spawn Edges (u, v) also point downwards, indicating that u spawned v in
parallel. In this example they come out of the black circles.
 Return edges point upwards to indicate the next strand executed after returning
from a normal procedure call, or after parallel spawning at a sync point. In this
example they return to the white circles.

Measuring Dynamic Multithreading

We write TP to indicate the running time of an algorithm on P processors. Then we
define these measures and laws:

Work
T1 = the total time to execute an algorithm on one processor. This is called work in
analogy to work in physics: the total amount of computational work that gets done.

An ideal parallel computer with P processors can do at most P units of work in one
time step. So, in TP time it can do at most P⋅TP work. Since the total work is T1,
P⋅TP ≥ T1, or dividing by P we get the work law:

TP ≥ T1 / P

The work law can be read as saying that the speedup for P processors can be no better
than the time with one processor divided by P. That is,

parallelism on P processors at best gives constant speedup where the

constant is 1/P.

Parallelism will not change the asymptotic class of an algorithm: it's not a substitute
for careful design of asymptotically fast algorithms.

Span

Define T∞ = the total time to execute an algorithm on an infinite number of processors

— or, more practically speaking, on just as many processors as are needed to allow
parallelism wherever it is possible.

T∞ is called the span because it is the

longest path through the DAG, and
corresponds to the longest time to
execute the strands along any path in
the computation DAG. It is the fastest
we can possibly expect — an Ω
bound -- because no matter how many
processors you have, the algorithm
must take this long.

(Readers in our classes may recall the

class excercise on finding the shortest
time you can complete a set of
interdependent jobs by finding the longest path in the job DAG: the concept here is
similar.)

The span in our P-Fib example is represented by the shaded edges in the figure.
The span law states that a P-processor ideal parallel computer cannot run faster than
one with an infinite number of processors:

TP ≥ T∞

This is because at some point the span will limit the speedup possible: No matter how
many processors you have, you still must do these strands in sequence, taking the time
they require.

Speedup

The ratio T1 / TP defines how much speedup you get with P processors as compared
to one.

By the work law,

TP ≥ T1 / P, so T1 / TP ≤ P:

one cannot have any more speedup than the number of processors.

This is important enough to repeat: parallelism provides only constant time

improvements (the constant being the number of processors) to any
algorithm! Parallelism cannot move an algorithm from a higher to lower complexity
class (e.g., exponential to polynomial, or quadratic to linear). Parallelism is not a
silver bullet: good algorithm design and analysis is still needed.

When the speedup T1 / TP = Θ(P) we have linear speedup: the speedup is linear in the
number of processors.

When T1 / TP = P we have perfect linear speedup: we got the maximum amount of

speedup possible from each processor.

Parallelism

The ratio T1 / T∞ of the work to the span gives the (potential) parallelism of the
computation. It can be interpreted in three ways:

 Ratio : The average amount of work that can be performed for each step of
parallel execution time.
 Upper Bound : the maximum possible speedup that can be achieved on any
number of processors.

 Limit: The limit on the possibility of attaining perfect linear speedup. Once the
number of processors exceeds the parallelism, the computation cannot possibly
achieve perfect linear speedup. The more processors we use beyond
parallelism, the less perfect the speedup.

This latter way of looking at T1 / T∞ leads to the concept of parallel slackness:

(T1 / T∞) / P = T1 / (P⋅T∞),

the factor by which the parallelism of the computation exceeds the number of
processors in the machine. We have three cases:

 If slackness is less than 1 then perfect linear speedup is not possible: you have
more processors than you can make use of.
 If slackness is greater than 1, then the work per processor is the limiting
constraint and a scheduler can strive for linear speedup by distributing the work
across more processors.
 If slackness is 1, (T1 / T∞) / P = 1 so T1 / T∞ = P: we get perfect linear speedup
with P processors.

Analysis of Multithreaded Algorithms

Analyzing work is simple: ignore the parallel constructs and analyze the serial
algorithm.

For example, we already noted previously that the work of P-Fib(n) is

T(n) = T(n − 1) + T(n − 2)+ Θ(1),

which has the solution T(n) = Θ(Fn), the work of P-Fib(n).

Analyzing span requires a different approach. (I hope you did the exercises above:
they will make you appreciate the following all the more.)

Analyzing Span
If a set of subcomputations (or the vertices representing them) are in series, the span is
the sum of the spans of the subcomputations. This is like normal sequential analysis
(as was just exemplified above with the sum T(n − 1) + T(n − 2)).

If a set of subcomputations (or the vertices representing them) are in parallel, the span
is the maximum of the spans of the computations. This is where analysis of
multithreaded algorithms differs.

Returning to our example, the span of the parallel recursive calls of P-Fib(n) is
computed by taking the max rather than the sum:

T∞ (n) = max(T∞(n−1), T∞ (n−2)) + Θ(1)

= T∞(n−1) + Θ(1).

The recurrence T∞ (n) = T∞(n−1) + Θ(1) has solution Θ(n). So the span of P-Fib(n) is
Θ(n).

We can now compute the parallelism of P-Fib(n) in general (not just the specific case
of n=4 that we computed earlier) by dividing its work Θ(Fn) by its span Θ(n):

T1(n) / T∞ = Θ(Fn) / Θ(n) = Θ(Fn/n)

This grows dramatically, as Fn grows much faster than n.

For any given number of processors P, there is considerable parallel slackness

Θ(Fn/n)/P. For any P above some n there is likely to be something for additional
processors to do. Thus there is potential for near perfect linear speedup as n grows.

(Of course in this example it's because we chose an inefficent way to compute
Fibonacci numbers, but this was only for illustration. These ideas apply to other well
designed algorithms.)

Parallel Loops
So far we have used spawn, but not the parallel keyword, which is used with loop
constructs such as for. Here is an example.

Suppose we want to multiply an n x n matrix A = (aij) by an n-vector x = (xj). This

yields an n-vector y = (yi) where:

The following algorithm does this in parallel:

The parallel for keywords indicate that each iteration of the loop can be executed
concurrently. (Notice that the inner for loop is not parallel; a possible point of
improvement to be discussed.)

Implementing Parallel Loops

It is not realistic to think that all n subcomputations in these loops can be spawned
immediately with no extra work. (For some operations on some hardware up to a
constant n this may be possible; e.g., hardware designed for matrix operations; but we
are concerned with the general case.) How might this parallel spawning be done, and
how does this affect the analysis?

Parallel for spawning can be accomplished by a compiler with a divide and conquer
approach, itself implemented with parallelism. The procedure shown below is called
with Mat-Vec-Main-Loop(A, x, y, n, 1, n). Lines 2 and 3 are the lines originally within
the loop.
The computation DAG is also shown. It appears that a lot of work is being done to
spawn the n leaf node computations, but the increase is not asymptotic.

The work of Mat-Vec is T1(n) = Θ(n2) due to the nested loops in 5-7.

Since the tree is a full binary tree, the number of internal nodes is 1 fewer than
the n leaf nodes, so this extra work is Θ(n).

Each leaf node corresponds to one iteration of loop, and the extra work of recursive
spawning can be amortized across the work of the iterations, so that it contributes only
constant work.

Concurrency platforms sometimes coarsen the recursion tree by executing several

iterations in each leaf, reducing the amount of recursive spawning.

The span is increased by Θ(lg n) due to the addition of the recursion tree for Mat-Vec-
Main-Loop, which is of height Θ(lg n). In some cases (such as this one), this increase is
washed out by other dominating factors (e.g., the span in this example is dominated
by the nested loops).

Nested Parallelism
Continuing with our example, the span is Θ(n) because even with full utilization of
parallelism the inner for loop still requires Θ(n). Since the work is Θ(n2) the
parallelism is Θ(n2)/Θ(n) = Θ(n). Can we improve on this?

Perhaps we could make the inner for loop parallel as well? Compare the original to
the revised version Mat-Vec':

Would it work? We need to introduce a new concept ...

Race Conditions

Deterministic algorithms do the same thing on the same input;

while nondeterministic algorithms may give different results on different runs.

The above Mat-Vec' algorithm is subject to a potential problem called

a determinancy race: when the outcome of a computation could be nondeterministic
(unpredictable). This can happen when two logically parallel computations access the
same memory and one performs a write.

Determinancy races are hard to detect with empirical testing: many execution
sequences would give correct results. This kind of software bug is consequential:
Race condition bugs caused the Therac-25 radiation machine to overdose patients,
killing three; and caused the North American Blackout of 2003.

For example, the code shown below

might output 1 or 2 depending on
the order in which access to x is
interleaved by the two threads:
The value of x must first be read into a register r before it is operated on. In this case,
there are two registers. It is incremented in the register and then written back out to
memory. The table indicates one possible computation sequence that gives the
unexpected result.

After you understand that simple example, let's look at our (renamed) matrix-vector
example again:

Exercise: Do you see how yi might be updated differently depending on the order in
which parallel invocations of line 7 (including access to current value of y i and
writing new ones) are executed?

Problem Solving using Multithreaded Algorithms

Example: Matrix Multiplication
Multithreading the basic algorithm

Here is an algorithm for multithreaded matrix multiplication, based on the T 1(n) =

Θ(n3) algorithm:
Exercise: How does this procedure compare to MAT-VEC-WRONG? Both of them
have nested parallel for loops: Is P-SQUARE-MATRIX-MULTIPLY also subject
to a race condition? Why or why not?

The span of this algorithm is T ∞(n) = Θ(n), due to the path for spawning the outer
and inner parallel loop executions and then the n executions of the innermost for loop.
So the parallelism is T1(n) / T∞(n) = Θ(n3) / Θ(n) = Θ(n2)

Exercise: Could we get the span down to Θ(1) if we parallelized the

inner for with parallel for? You should be able to answer this based on the previous
exercise.

Multithreading the divide and conquer algorithm

Here is a parallel version of the divide and conquer algorithm from Chapter 4 of
CLRS (not in these web notes):
See the text for analysis, which concludes that while the work is still Θ(n3), the span is
reduced to Θ(lg2n). Thus, while the work is the same as the basic algorithm the
parallelism is Θ(n3) / Θ(lg2n), which makes good use of parallel resources.

Example: Merge Sort

Divide and conquer algorithms are good candidates for parallelism, because they
break the problem into independent subproblems that can be solved separately. We
look briefly at merge sort.

Parallelizing Merge-Sort

The dividing is in the main procedure MERGE-SORT, and we can parallelize it by

spawning the first recursive call:
MERGE remains a serial algorithm, so its work and span are Θ(n) as before.

The recurrence for the work MS'1(n) of MERGE-SORT' is the same as the serial version:

The recurrence for the span MS'∞(n) of MERGE-SORT' is based on the fact that the
recursive calls run in parallel, so there is only one n/2 term (they are the same,
so min takes either):

The parallelism is thus MS'1(n) / MS'∞(n) = Θ(n lg n / n) = Θ(lg n).

This is low parallelism, meaning that even for large input we would not benefit from
having hundreds of processors. How about speeding up the serial MERGE?

Parallelizing Merge

MERGE takes two sorted lists and steps through them together to construct a single
sorted list. This seems intrinsically serial, but there is a clever way to make it parallel.

A divide-and-conquer strategy can rely on the fact that the lists are sorted to break the
lists into four lists, two of which will be merged to form the head of the final list and
the other two merged to form the tail.

To find the four lists for which this works, we

1. Choose the longer list to be the first list, T[p1 .. r1] in the figure below.
2. Find the middle element (median) of the first list (x at q1).
3. Use binary search to find the position (q2) of this element if it were to be
inserted in the second list T[p2 .. r2].
4. Recursively merge
o The first list up to just before the median T[p1 .. q1-1] and the second list
up to the insertion point T[p2 .. q2-1].
o The first list from just after the median T[q1+1 .. r1] and the second list
after the insertion point T[q2 .. r2].
5. Assemble the results with the median element placed between them, as shown
below.

The text presents the BINARY-SEARCH pseudocode and analysis of Θ(lg n) worst case;
this should be review for you. It then assembles these ideas into a parallel merge
procedure that merges into a second array Z at location p3 (r3 is not provided as it can
be computed from the other parameters):

Analysis
My main purpose in showing this to you is to see that even apparently serial
algorithms sometimes have a parallel alternative, so we won't get into details, but here
is an outline of the analysis:

The span of P-MERGE is the maximum span of a parallel recursive call. Notice that
although we divide the first list in half, it could turn out that x's insertion point q2 is at
the beginning or end of the second list. Thus (informally), the maximum recursive
span is 3n/4 (as at best we have "chopped off" 1/4 of the first list).

The text derives the recurrence shown below; it does not meet the Master Theorem, so
an approach from a prior exercise is used to solve it:

Given 1/4 ≤ α ≤ 3/4 for the unknown dividing of the second array, the work
recurrence turns out to be:

With some more work, PM1(n) = Θ(n) is derived. Thus the parallelism is Θ(n / lg2n)

Some adjustment to the MERGE-SORT' code is needed to use this P-MERGE; see the text.
Further analysis shows that the work for the new sort, P-MERGE-SORT, is PMS1(n lg n) =
Θ(n), and the span PMS∞(n) = Θ(lg3n). This gives parallelism of Θ(n / lg2n), which is
much better than Θ(lg n) in terms of the potential use of additional processors
as n grows.

The chapter ends with a comment on coarsening the parallelism by using an ordinary
serial sort once the lists get small. One might consider whether P-MERGE-SORT is still a
stable sort, and choose the serial sort to retain this property if it is desirable.

Scheduling
At the beginning, we noted that we rely on a concurrency platform to determine how
to allocate potentially parallel threads of computation to available processors. This is
the scheduling problem. Scheduling parallel computations is a complex problem, and
sophisticated schedulers have been designed that are beyond what we can discuss
here.
Centralized schedulers are those that have information on the global state of
computation, but must make decisions in real time rather than in batch. A simple
approach to centralized scheduling is a greedy scheduler, which assigns as many
strands to available processors as possible at any given time step. The CLRS texts
proves a theorem concerning the performance of a greedy scheduler, with interesting
corollaries:

Theorem: On an ideal parallel computer with P processors, a greedy scheduler

executes a multithreaded computation with work T1 and span T∞ in
time TP ≤ T1 + T∞.
Corollary: The running time TP of any multithreaded computation scheduled by
a greedy scheduler on an ideal parallel computer with P processors is within a
factor of 2 of optimal.
Corollary: As slackness grows a greedy scheduler achieves near-perfect linear
speedup on any multithreaded computation.

The proofs are not difficult to understand: see the text if you are interested. I think we
have said enough here to introduce the concepts of multithreading.
Distributed Algorithms - Introduction
An Introduction to Distributed Algorithms takes up some of the main concepts and
algorithms, ranging from basic to advanced techniques and applications, that underlie
the programming of distributed-memory systems such as computer networks,
networks of workstations, and multiprocessors. Written from the broad perspective of
distributed-memory systems in general it includes topics such as algorithms for
maximum flow, program debugging, and simulation that do not appear in more
orthodox texts on distributed algorithms. Moving from fundamentals to advances and
applications, ten chapters—with exercises and bibliographic notes—cover a variety of
topics. These include models of distributed computation, information propagation,
leader election, distributed snapshots, network synchronization, self-stability,
termination detection, deadlock detection, graph algorithms, mutual exclusion,
program debugging, and simulation.

All of the algorithms are presented in a clear, template-based format for the
description of message-passing computations among the nodes of a connected graph.
Such a generic setting allows the treatment of problems originating from many
different application areas. The main ideas and algorithms are described in a way that
balances intuition and formal rigor—most are preceded by a general intuitive
discussion and followed by formal statements as to correctness complexity or other
properties.

Distributed breadth first search

Algorithms for build a BreadthFirstSearch tree in a network. All assume that there is a
designated initiator node that starts the algorithm. At end of algorithm each node
except the initiator has a parent pointer and every node has a list of children. These
are consistent and define a BFS tree i.e. nodes at distance k from the initiator appear at
level k of the tree.

Asynchronous algorithms

Here the complication is that we can no longer rely on synchronous communication to

reach all nodes at distance d at the same time. So instead we need to keep track of
distancs explicitly, or possibly enforce some approximation to synchrony in the
algorithm. (A general version of this last approach is to apply a synchronizer to one of
the synchronous algorithms: see Synchronizers).

To keep things simple, we'll drop the requirement that a parent learn the IDs of its
children, since this can be tacked on as a separate notification protocol, in which each
child just sends one message to its parent once it figures out who its parent is.
2.1. A simple algorithm using explicit distances

It's a very simple algorithm, closely related to Dijkstra's algorithm for shortest paths,
but there is otherwise no particular reason to use it; it is dominated by the O(D) time
and O(DE) message complexity synchronizer-based algorithm described later.

The idea is to run an AsynchronousBroadcast with distances attached. Each node sets
its distance to 1 plus the smallest distance sent by its neighbors and its parent to the
neighbor supplying that smallest distance. A node notifies all its neighbors of its new
distance whenever its distance changes.

In pseudocode:

States: distance, initially 0 for initiator and inf for all other nodes, internal send
buffers

Initiator initialization code:

send distance to all neighbors

All processes:
upon receiving d from p:
if d+1 < distance:
distance := d+1
parent := p
send distance to all neighbors

(See LynchBook for a precondition-effect description, which also includes code for
buffering outgoing messages.)

The claim is that after at most O(VE) messages and O(D) time, all distance values are
equal to the length of the shortest path from the initiator to the appropriate node. The
proof is by showing the following

Invariant
distancep is always the length of some path from initiator to p, and any
message sent by p is also the length of some path from initiator to p.
Proof
The second part follows from the first; any message sent equals p's current
value of distance. For the first part, suppose p updates its distance; then it sets it
to 1+the length of some path from initiator to p', which is the length of that
same path extended by adding the pp' edge.
We also need a liveness argument that says that distancep = d(initiator, p) no later
than time d(initiator, p). Note that we can't detect this condition occurring without a
lot of additional work.

Distributed Minimum Spanning Tree

The distributed minimum spanning tree (MST) problem involves the construction of
a minimum spanning tree by a distributed algorithm, in a network where nodes
communicate by message passing. It is radically different from the classical sequential
problem, although the most basic approach resembles Borůvka's algorithm. One
important application of this problem is to find a tree that can be used
for broadcasting. In particular, if the cost for a message to pass through an edge in a
graph is significant, a MST can minimize the total cost for a source process to
communicate with all the other processes in the network.

String Matching

Given a text txt[0..n-1] and a pattern pat[0..m-1], write a function search(char pat[], char
txt[]) that prints all occurrences of pat[] in txt[]. You may assume that n > m.
Examples:

Input: txt[] = "THIS IS A TEST TEXT"

pat[] = "TEST"

Output: Pattern found at index 10

Input: txt[] = "AABAACAADAABAABA"

pat[] = "AABA"

Output: Pattern found at index 0

Pattern found at index 9

Pattern found at index 12

Pattern searching is an important problem in computer science. When we do search for
a string in notepad/word file or browser or database, pattern searching algorithms are
used to show the search results.

Naive Pattern Searching:

Slide the pattern over text one by one and check for a match. If a match is found, then
slides by 1 again to check for subsequent matches.

Daa 1
No ratings yet
Daa 1
40 pages
Multi Threading
No ratings yet
Multi Threading
96 pages
Multithreading Algorithms
No ratings yet
Multithreading Algorithms
36 pages
Dynamic Multithreaded Algorithms Tutorial
No ratings yet
Dynamic Multithreaded Algorithms Tutorial
13 pages
Daa 6
No ratings yet
Daa 6
59 pages
Apt05 2024S2
No ratings yet
Apt05 2024S2
23 pages
Unit 4
No ratings yet
Unit 4
42 pages
HPC Parallel
No ratings yet
HPC Parallel
122 pages
Lec7 PDF
No ratings yet
Lec7 PDF
16 pages
CS3006 Parallel Computing Course Overview
100% (1)
CS3006 Parallel Computing Course Overview
46 pages
002 IntroHPC
No ratings yet
002 IntroHPC
33 pages
Introduction to Parallel Computing
No ratings yet
Introduction to Parallel Computing
28 pages
Unit1 2 and 3
No ratings yet
Unit1 2 and 3
76 pages
OOAD
No ratings yet
OOAD
67 pages
Introduction to Parallel Algorithms
No ratings yet
Introduction to Parallel Algorithms
36 pages
08 Systems Programming-Concurrent Programming
No ratings yet
08 Systems Programming-Concurrent Programming
61 pages
L19-20 PA Design Intro
No ratings yet
L19-20 PA Design Intro
31 pages
Chapter 7 - Parallel Programming Issues
No ratings yet
Chapter 7 - Parallel Programming Issues
68 pages
Parallel Programming Course Overview
No ratings yet
Parallel Programming Course Overview
36 pages
PC 1
No ratings yet
PC 1
53 pages
Distributed Computing Seminar
No ratings yet
Distributed Computing Seminar
37 pages
Module 3
No ratings yet
Module 3
104 pages
SOE413 Parellel Distributed Cloud
No ratings yet
SOE413 Parellel Distributed Cloud
21 pages
Multiprocessing Vs Multithreading 2
No ratings yet
Multiprocessing Vs Multithreading 2
16 pages
Parallel Computing: Types of Parallelism
No ratings yet
Parallel Computing: Types of Parallelism
27 pages
Parallel Computing: Performance Evaluation
No ratings yet
Parallel Computing: Performance Evaluation
40 pages
Parallel Computing: Pros and Cons
No ratings yet
Parallel Computing: Pros and Cons
45 pages
RS - Pds-Oe 3010
No ratings yet
RS - Pds-Oe 3010
8 pages
PDS Merged
No ratings yet
PDS Merged
182 pages
UNIT - I: Parallel and Distributed Computing
No ratings yet
UNIT - I: Parallel and Distributed Computing
58 pages
PA Midsem
No ratings yet
PA Midsem
20 pages
Unit 1
No ratings yet
Unit 1
11 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
47 pages
Dis Top Tim Notes 1
No ratings yet
Dis Top Tim Notes 1
3 pages
HPC Lab: Parallel Computing Basics
No ratings yet
HPC Lab: Parallel Computing Basics
58 pages
Parallel Computing Concepts Explained
No ratings yet
Parallel Computing Concepts Explained
90 pages
Pda 1
No ratings yet
Pda 1
72 pages
Untitled Document
No ratings yet
Untitled Document
63 pages
Chapter 02 - Asynchronous and Parallel Programming in
No ratings yet
Chapter 02 - Asynchronous and Parallel Programming in
55 pages
Untitled Document
No ratings yet
Untitled Document
39 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
CS0051 - Module 01
No ratings yet
CS0051 - Module 01
52 pages
2 ND
No ratings yet
2 ND
19 pages
P&DC Unit-1
No ratings yet
P&DC Unit-1
116 pages
Presented by
No ratings yet
Presented by
23 pages
QNS. Parallel Computing
No ratings yet
QNS. Parallel Computing
44 pages
What Is Serial Computing?: Traditionally, Software Has Been Written For Serial Computation
No ratings yet
What Is Serial Computing?: Traditionally, Software Has Been Written For Serial Computation
22 pages
PDC ch#5
No ratings yet
PDC ch#5
12 pages
Parallel Computation Models: Slide 1
No ratings yet
Parallel Computation Models: Slide 1
28 pages
Intro to Serial & Parallel Computing
No ratings yet
Intro to Serial & Parallel Computing
39 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Week1 Parallel and Distributed Computing
No ratings yet
Week1 Parallel and Distributed Computing
55 pages
Dijkstra's Algorithm Overview: Mergesort Example: Merge As We Return From Recursive Calls
No ratings yet
Dijkstra's Algorithm Overview: Mergesort Example: Merge As We Return From Recursive Calls
4 pages
Introduction to Parallel Computing
No ratings yet
Introduction to Parallel Computing
90 pages
Parallel Framework, The Need
No ratings yet
Parallel Framework, The Need
6 pages
CS 180 Midterm Exam 2010
No ratings yet
CS 180 Midterm Exam 2010
6 pages
Lyft - LeetCode
No ratings yet
Lyft - LeetCode
3 pages
Low-Density Parity-Check Codes
No ratings yet
Low-Density Parity-Check Codes
6 pages
Transportation Problem
No ratings yet
Transportation Problem
38 pages
Sorting and Searching Algorithms Guide
No ratings yet
Sorting and Searching Algorithms Guide
12 pages
Math 10: Remainder & Factor Theorems
No ratings yet
Math 10: Remainder & Factor Theorems
1 page
SAP Module Data Handling
No ratings yet
SAP Module Data Handling
11 pages
AIM:-To Implement A Wumpus World Problem - Program/Code
No ratings yet
AIM:-To Implement A Wumpus World Problem - Program/Code
15 pages
CF 1
No ratings yet
CF 1
30 pages
Transportation Problem Optimization Techniques
100% (1)
Transportation Problem Optimization Techniques
29 pages
Polynomials Unit Test Review
No ratings yet
Polynomials Unit Test Review
16 pages
Quiz-Week 4-Q2-Domain, Range, Vertical Line Test
No ratings yet
Quiz-Week 4-Q2-Domain, Range, Vertical Line Test
1 page
Niharika 01
No ratings yet
Niharika 01
24 pages
Post-Quiz - AA Attempt Review
No ratings yet
Post-Quiz - AA Attempt Review
2 pages
Unit - 01
No ratings yet
Unit - 01
26 pages
KNN Is A Very Simple Algorithm Used To Solve Classification Problems. KNN Stands For K-Nearest Neighbors. K Is The Number of Neighbors in KNN
0% (1)
KNN Is A Very Simple Algorithm Used To Solve Classification Problems. KNN Stands For K-Nearest Neighbors. K Is The Number of Neighbors in KNN
9 pages
College Math 1A Graphs Worksheet
No ratings yet
College Math 1A Graphs Worksheet
1 page
Combinatorics and Graph Theory With Mathematica: Steven Skiena
No ratings yet
Combinatorics and Graph Theory With Mathematica: Steven Skiena
31 pages
Theory of Computation Course Overview
No ratings yet
Theory of Computation Course Overview
2 pages
Simplex Method Concept
No ratings yet
Simplex Method Concept
31 pages
Linear Time Sorting
No ratings yet
Linear Time Sorting
5 pages
Computation of The DFT of Real Sequences
No ratings yet
Computation of The DFT of Real Sequences
45 pages
AI Assignment-I 1
No ratings yet
AI Assignment-I 1
2 pages
List of Algorithms - Wikipedia, The Free Encyclopedia
No ratings yet
List of Algorithms - Wikipedia, The Free Encyclopedia
34 pages
Compiler Design: Grammar Basics
No ratings yet
Compiler Design: Grammar Basics
10 pages
Information Theory and Coding SEE QP 2021-22
100% (2)
Information Theory and Coding SEE QP 2021-22
7 pages
Algorithms Illuminated: Part 2: Graph Algorithms and Data Structures Tim Roughgarden
No ratings yet
Algorithms Illuminated: Part 2: Graph Algorithms and Data Structures Tim Roughgarden
28 pages
Python Linear Classifiers Course
No ratings yet
Python Linear Classifiers Course
16 pages
Algorithms: Notes For Professionals
100% (1)
Algorithms: Notes For Professionals
252 pages
Decrypted Save File
No ratings yet
Decrypted Save File
15 pages

Unit 6 Notes

Uploaded by

Unit 6 Notes

Uploaded by

Unit VI : Multithreaded and Distributed Algorithms

Concepts of Dynamic Multithreading

 supercomputers: custom architectures and networks

Static and Dynamic Multithreading

Static threading provides the programmer with an abstraction of virtual processors

Rather than managing threads explicitly, our model is dynamic multithreading in

We will use three keywords in our pseudocode, reflecting current parallel-computing

However, if parallelism is used, sync must be respected. For safety, there is an

Example: Parallel Fibonacci

Here is a recursive non-parallel algorithm for computing Fibonacci numbers modeled

We will return to this example when we analyze multithreading.

Modeling Dynamic Multithreading

We will model a multithreaded computation as a computation DAG (directed acyclic

 Vertices in V represent instructions. To simplify the graph, each vertex can

∈ E means u must execute before v. (Edge are categorized in ways elaborated

We assume an ideal parallel computer with sequentially consistent memory, meaning

Visualizing the Model

Edges are categorized and visualized as follows:

Measuring Dynamic Multithreading

parallelism on P processors at best gives constant speedup where the

Define T∞ = the total time to execute an algorithm on an infinite number of processors

T∞ is called the span because it is the

(Readers in our classes may recall the

By the work law,

This is important enough to repeat: parallelism provides only constant time

When T1 / TP = P we have perfect linear speedup: we got the maximum amount of

This latter way of looking at T1 / T∞ leads to the concept of parallel slackness:

(T1 / T∞) / P = T1 / (P⋅T∞),

Analysis of Multithreaded Algorithms

For example, we already noted previously that the work of P-Fib(n) is

T(n) = T(n − 1) + T(n − 2)+ Θ(1),

which has the solution T(n) = Θ(Fn), the work of P-Fib(n).

T∞ (n) = max(T∞(n−1), T∞ (n−2)) + Θ(1)

T1(n) / T∞ = Θ(Fn) / Θ(n) = Θ(Fn/n)

For any given number of processors P, there is considerable parallel slackness

Suppose we want to multiply an n x n matrix A = (aij) by an n-vector x = (xj). This

The following algorithm does this in parallel:

Implementing Parallel Loops

Concurrency platforms sometimes coarsen the recursion tree by executing several

Would it work? We need to introduce a new concept ...

Deterministic algorithms do the same thing on the same input;

The above Mat-Vec' algorithm is subject to a potential problem called

For example, the code shown below

Problem Solving using Multithreaded Algorithms

Here is an algorithm for multithreaded matrix multiplication, based on the T 1(n) =

Exercise: Could we get the span down to Θ(1) if we parallelized the

Multithreading the divide and conquer algorithm

Example: Merge Sort

The dividing is in the main procedure MERGE-SORT, and we can parallelize it by

The parallelism is thus MS'1(n) / MS'∞(n) = Θ(n lg n / n) = Θ(lg n).

To find the four lists for which this works, we

Theorem: On an ideal parallel computer with P processors, a greedy scheduler

Distributed breadth first search

Here the complication is that we can no longer rely on synchronous communication to

Initiator initialization code:

Distributed Minimum Spanning Tree

Input: txt[] = "THIS IS A TEST TEXT"

Output: Pattern found at index 10

Input: txt[] = "AABAACAADAABAABA"

Output: Pattern found at index 0

Pattern found at index 9

Pattern found at index 12

Naive Pattern Searching:

You might also like