Lecture 24
Lecture 24
Announcements
• Final exam: Dec. 14th, 8:30 AM -- 11:30 AM
• Open book test: any sort of paper-based product, e.g., book,
notes, magazine, old tests.
• No electronic devices
• Problem sets and previous exams are helpful.
2
Carnegie Mellon
Today
• From process to threads
• Basic thread execution model
• Multi-threading programming
• Hardware support of threads
• Single core
• Multi-core
• Hyper-threading
• Cache coherence
3
Carnegie Mellon
4
Carnegie Mellon
shared libraries
stack 1 stack 2
run-time heap
Thread 1 context: Thread 2 context: read/write data
Data registers Data registers read-only code/data
Condition codes Condition codes 0
SP1 SP2
PC1 PC2 Kernel context:
VM structures
Descriptor table
brk pointer
5
Carnegie Mellon
T2 P0
T4
T1
P1
shared code, data
and kernel context
sh sh sh
T5 T3
foo
bar
6
Carnegie Mellon
Concurrent Threads
• Two threads are concurrent if their flows overlap in
time
• Otherwise, they are sequential
Thread A Thread B Thread C
• Examples:
• Concurrent: A & B, A&C
• Sequential: B & C
Time
7
Carnegie Mellon
Time
9
Carnegie Mellon
10
Carnegie Mellon
int main()
{
pthread_t tid;
Pthread_create(&tid, NULL, thread, NULL);
Pthread_join(tid, NULL);
exit(0);
} hello.c
int main()
{
pthread_t tid;
Pthread_create(&tid, NULL, thread, NULL);
Pthread_join(tid, NULL);
exit(0);
} hello.c
int main()
{
pthread_t tid;
Pthread_create(&tid, NULL, thread, NULL);
Pthread_join(tid, NULL);
exit(0);
} hello.c
int main()
{ Thread routine
pthread_t tid;
Pthread_create(&tid, NULL, thread, NULL);
Pthread_join(tid, NULL);
exit(0);
} hello.c
int main()
{ Thread routine
pthread_t tid;
Pthread_create(&tid, NULL, thread, NULL);
Pthread_join(tid, NULL); Thread arguments
exit(0);
} hello.c (void *p)
int main()
{ Thread routine
pthread_t tid;
Pthread_create(&tid, NULL, thread, NULL);
Pthread_join(tid, NULL); Thread arguments
exit(0);
} hello.c (void *p)
Return value
void *thread(void *vargp) /* thread routine */ (void **p)
{
printf("Hello, world!\n");
return NULL;
} hello.c
11
Carnegie Mellon
12
Carnegie Mellon
call Pthread_create()
12
Carnegie Mellon
call Pthread_create()
12
Carnegie Mellon
call Pthread_create()
Peer thread
12
Carnegie Mellon
call Pthread_create()
Pthread_create() returns Peer thread
12
Carnegie Mellon
call Pthread_create()
Pthread_create() returns Peer thread
call Pthread_join()
12
Carnegie Mellon
call Pthread_create()
Pthread_create() returns Peer thread
call Pthread_join()
12
Carnegie Mellon
call Pthread_create()
Pthread_create() returns Peer thread
call Pthread_join()
printf()
Main thread waits for return NULL;
peer thread to terminate Peer thread
terminates
12
Carnegie Mellon
call Pthread_create()
Pthread_create() returns Peer thread
call Pthread_join()
printf()
Main thread waits for return NULL;
peer thread to terminate Peer thread
terminates
Pthread_join() returns
12
Carnegie Mellon
call Pthread_create()
Pthread_create() returns Peer thread
call Pthread_join()
printf()
Main thread waits for return NULL;
peer thread to terminate Peer thread
terminates
Pthread_join() returns
exit()
Terminates
main thread and
any peer threads
12
Carnegie Mellon
Today
• From process to threads
• Basic thread execution model
• Multi-threading programming
• Hardware support of threads
• Single core
• Multi-core
• Cache coherence
13
Carnegie Mellon
run-time heap
stack 1 stack 2 read/write data
read-only code/data
Thread 1 context: Thread 2 context: 0
Data registers Data registers
Condition codes Condition codes Kernel context:
SP1 SP2 VM structures
PC1 PC2 Descriptor table
brk pointer
14
Carnegie Mellon
Synchronizing Threads
• Shared variables are handy...
• …but introduce the possibility of nasty synchronization errors.
16
Carnegie Mellon
Improper Synchronization
/* Global shared variable */ /* Thread routine */
volatile long cnt = 0; /* Counter */ void *thread(void *vargp)
{
int main(int argc, char **argv) long i, niters =
{ *((long *)vargp);
pthread_t tid1, tid2;
long niters = 10000; for (i = 0; i < niters; i++)
cnt++;
Pthread_create(&tid1, NULL,
thread, &niters);
return NULL;
Pthread_create(&tid2, NULL,
thread, &niters); }
Pthread_join(tid1, NULL);
Pthread_join(tid2, NULL);
/* Check result */
if (cnt != (2 * 10000))
printf("BOOM! cnt=%ld\n", cnt);
else
printf("OK cnt=%ld\n", cnt);
exit(0);
} badcnt.c
[Link] 17
Carnegie Mellon
Improper Synchronization
/* Global shared variable */ /* Thread routine */
volatile long cnt = 0; /* Counter */ void *thread(void *vargp)
{
int main(int argc, char **argv) long i, niters =
{ *((long *)vargp);
pthread_t tid1, tid2;
long niters = 10000; for (i = 0; i < niters; i++)
cnt++;
Pthread_create(&tid1, NULL,
thread, &niters);
return NULL;
Pthread_create(&tid2, NULL,
thread, &niters); }
Pthread_join(tid1, NULL);
Pthread_join(tid2, NULL); linux> ./badcnt
OK cnt=20000
/* Check result */
if (cnt != (2 * 10000)) linux> ./badcnt
printf("BOOM! cnt=%ld\n", cnt); BOOM! cnt=13051
else
printf("OK cnt=%ld\n", cnt);
exit(0); cnt should be 20,000.
} badcnt.c
[Link]
What went wrong? 17
Carnegie Mellon
18
Carnegie Mellon
Concurrent Execution
• Key observation: In general, any sequentially consistent
interleaving is possible, but some give an unexpected result!
cnt
i (thread) instri %rdx1 %rdx2
(shared)
Thread 1 critical
1 L1 0 - 0 section
1 U1 1 - 0
1 S1 1 - 1 Thread 2 critical
2 L2 - 1 1 section
2 U2 - 2 1
2 S2 - 2 2
Li movq cnt(%rip),%rdx
Ui addq $1, %rdx
Si movq %rdx, cnt(%rip)
19
Carnegie Mellon
cnt
i (thread) instri %rdx1 %rdx2
(shared)
1 L1 0 - 0
1 U1 1 - 0
2 L2 - 0 0
1 S1 1 - 1
2 U2 - 1 1
2 S2 - 1 1
Li movq cnt(%rip),%rdx
Ui addq $1, %rdx
Si movq %rdx, cnt(%rip)
20
Carnegie Mellon
Critical Section
• Code section (a sequence of instructions) where no more than one
thread should be executing concurrently.
• Critical section refers to code, but its intention is to protect data!
Critical Section
• Code section (a sequence of instructions) where no more than one
thread should be executing concurrently.
• Critical section refers to code, but its intention is to protect data!
• Threads need to have mutually exclusive access to critical section. That
is, the execution of the critical section must be atomic: instructions in a
CS either are executed entirely without interruption or not executed at all.
movq (%rdi), %rcx
testq %rcx,%rcx
jle .L2 Hi : Head
movl $0, %eax
critical .L3:
movq cnt(%rip),%rdx Li : Load cnt
section addq $1, %rdx Ui : Update cnt
wrt cnt movq %rdx, cnt(%rip) Si : Store cnt
addq $1, %rax
cmpq %rcx, %rax
jne .L3 Ti : Tail
.L2:
22
Carnegie Mellon
• Classic solution:
• Semaphores/mutex (Edsger Dijkstra)
• Other approaches
• Condition variables
• Monitors (Java)
• 254/258 discusses these
23
Carnegie Mellon
24
Carnegie Mellon
24
Carnegie Mellon
24
Carnegie Mellon
24
Carnegie Mellon
24
Carnegie Mellon
24
Carnegie Mellon
24
Carnegie Mellon
24
Carnegie Mellon
Proper Synchronization
• Define and initialize a mutex for the shared variable cnt:
volatile long cnt = 0; /* Counter */
sem_t mutex; /* Semaphore that protects cnt */
Problem?
• Wouldn’t there be a problem when multiple threads access the
mutex? How do we ensure exclusive accesses to mutex itself?
26
Carnegie Mellon
Problem?
• Wouldn’t there be a problem when multiple threads access the
mutex? How do we ensure exclusive accesses to mutex itself?
• Hardware MUST provide mechanisms for atomic accesses to the
mutex variable.
26
Carnegie Mellon
Problem?
• Wouldn’t there be a problem when multiple threads access the
mutex? How do we ensure exclusive accesses to mutex itself?
• Hardware MUST provide mechanisms for atomic accesses to the
mutex variable.
• Checking mutex value and setting its value must be an atomic
unit: they either are performed entirely or not performed at all.
26
Carnegie Mellon
Problem?
• Wouldn’t there be a problem when multiple threads access the
mutex? How do we ensure exclusive accesses to mutex itself?
• Hardware MUST provide mechanisms for atomic accesses to the
mutex variable.
• Checking mutex value and setting its value must be an atomic
unit: they either are performed entirely or not performed at all.
• on x86: the atomic test-and-set instruction.
26
Carnegie Mellon
Deadlock
• Def: A process/thread is deadlocked if and only if it is waiting for
a condition that will never be true
• General to concurrent/parallel programming (threads,
processes)
• Typical Scenario
• Processes 1 and 2 needs two resources (A and B) to proceed
• Process 1 acquires A, waits for B
• Process 2 acquires B, waits for A
• Both will wait forever!
27
Carnegie Mellon
28
Carnegie Mellon
Tid[0]: Tid[1]:
P(s0); P(s1);
P(s1); P(s0);
cnt++; cnt++;
V(s0); V(s1);
V(s1); V(s0);
Tid[0]: Tid[1]:
P(s0); P(s0);
P(s1); P(s1);
cnt++; cnt++;
V(s0); V(s1);
V(s1); V(s0);
29
Carnegie Mellon
30
Carnegie Mellon
31
Carnegie Mellon
31
Carnegie Mellon
31
Carnegie Mellon
32
Carnegie Mellon
32
Carnegie Mellon
Amdahl’s Law
• Gene Amdahl (1922 – 2015). Giant in computer architecture
• Captures the difficulty of using parallelism to speed things up
Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” 1967.
33
Carnegie Mellon
Amdahl’s Law
• Gene Amdahl (1922 – 2015). Giant in computer architecture
• Captures the difficulty of using parallelism to speed things up
• Amdahl’s Law
• f: Parallelizable fraction of a program
• N: Number of processors (i.e., maximal achievable speedup)
Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” 1967.
33
Carnegie Mellon
Amdahl’s Law
• Gene Amdahl (1922 – 2015). Giant in computer architecture
• Captures the difficulty of using parallelism to speed things up
• Amdahl’s Law
• f: Parallelizable fraction of a program
• N: Number of processors (i.e., maximal achievable speedup)
1-f
Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” 1967.
33
Carnegie Mellon
Amdahl’s Law
• Gene Amdahl (1922 – 2015). Giant in computer architecture
• Captures the difficulty of using parallelism to speed things up
• Amdahl’s Law
• f: Parallelizable fraction of a program
• N: Number of processors (i.e., maximal achievable speedup)
1-f +
Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” 1967.
33
Carnegie Mellon
Amdahl’s Law
• Gene Amdahl (1922 – 2015). Giant in computer architecture
• Captures the difficulty of using parallelism to speed things up
• Amdahl’s Law
• f: Parallelizable fraction of a program
• N: Number of processors (i.e., maximal achievable speedup)
f
1-f + N
Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” 1967.
33
Carnegie Mellon
Amdahl’s Law
• Gene Amdahl (1922 – 2015). Giant in computer architecture
• Captures the difficulty of using parallelism to speed things up
• Amdahl’s Law
• f: Parallelizable fraction of a program
• N: Number of processors (i.e., maximal achievable speedup)
1
Speedup =
f
1-f + N
Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” 1967.
33
Carnegie Mellon
Amdahl’s Law
• Gene Amdahl (1922 – 2015). Giant in computer architecture
• Captures the difficulty of using parallelism to speed things up
• Amdahl’s Law
• f: Parallelizable fraction of a program
• N: Number of processors (i.e., maximal achievable speedup)
1
Speedup =
f
1-f + N
Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” 1967.
33
Carnegie Mellon
Amdahl’s Law
• Gene Amdahl (1922 – 2015). Giant in computer architecture
• Captures the difficulty of using parallelism to speed things up
• Amdahl’s Law
• f: Parallelizable fraction of a program
• N: Number of processors (i.e., maximal achievable speedup)
1
Speedup =
f
1-f + N
Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” 1967.
33
Carnegie Mellon
Amdahl’s Law
• Gene Amdahl (1922 – 2015). Giant in computer architecture
• Captures the difficulty of using parallelism to speed things up
• Amdahl’s Law
• f: Parallelizable fraction of a program
• N: Number of processors (i.e., maximal achievable speedup)
1
Speedup =
f
1-f + N
Today
• From process to threads
• Basic thread execution model
• Multi-threading programming
• Hardware support of threads
• Single core
• Multi-core
• Cache coherence
34
Carnegie Mellon
35
Carnegie Mellon
Any benefits?
• Can single-core multi-threading provide any performance gains?
36
Carnegie Mellon
Any benefits?
• Can single-core multi-threading provide any performance gains?
Cache
Miss!
36
Carnegie Mellon
Any benefits?
• Can single-core multi-threading provide any performance gains?
Cache
Miss!
36
Carnegie Mellon
Any benefits?
• Can single-core multi-threading provide any performance gains?
• If Thread A has a cache miss and the pipeline gets stalled,
switch to Thread C. Improves the overall performance.
Thread A Thread B Thread C
Cache
Miss!
36
Carnegie Mellon
When to Switch?
• Coarse grained
• Event based, e.g., switch on L3 cache miss
• Quantum based (every thousands of cycles)
37
Carnegie Mellon
When to Switch?
• Coarse grained
• Event based, e.g., switch on L3 cache miss
• Quantum based (every thousands of cycles)
• Fine grained
• Cycle by cycle
• Thornton, “CDC 6600: Design of a Computer,” 1970.
• Burton Smith, “A pipelined, shared resource MIMD computer,” ICPP
1978. The HEP machine. A seminal paper that shows that using multi-
threading can avoid branch prediction.
37
Carnegie Mellon
When to Switch?
• Coarse grained
• Event based, e.g., switch on L3 cache miss
• Quantum based (every thousands of cycles)
• Fine grained
• Cycle by cycle
• Thornton, “CDC 6600: Design of a Computer,” 1970.
• Burton Smith, “A pipelined, shared resource MIMD computer,” ICPP
1978. The HEP machine. A seminal paper that shows that using multi-
threading can avoid branch prediction.
• Either way, need to save/restore thread context upon
switching.
37
Carnegie Mellon
Fine-Grained Switching
• One big bonus of fine-grained switching: no need for
branch predictor!!
Stall Stall
irmovq $1, %rax # Fall Through F D E M W
L1 irmovq $4, %rcx # Target F D E M W
irmovq $3, %rax # Target + 1 F D E
38
Carnegie Mellon
Fine-Grained Switching
• One big bonus of fine-grained switching: no need for
branch predictor!!
39
Carnegie Mellon
Fine-Grained Switching
• One big bonus of fine-grained switching: no need for branch
predictor!!
Fine-Grained Switching
• One big bonus of fine-grained switching: no need for branch
predictor!!
• Context switching overhead would be very high! Use separate
hardware contexts for each thread (e.g., separate register files).
Fine-Grained Switching
• One big bonus of fine-grained switching: no need for branch
predictor!!
• Context switching overhead would be very high! Use separate
hardware contexts for each thread (e.g., separate register files).
• GPUs do this (among other things). More later.
41