0% found this document useful (0 votes)
24 views101 pages

Lecture 24

Uploaded by

minulo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views101 pages

Lecture 24

Uploaded by

minulo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CSC 252/452: Computer Organization

Fall 2024: Lecture 24

Instructor: Yanan Guo

Department of Computer Science


University of Rochester
Carnegie Mellon

Announcements
• Final exam: Dec. 14th, 8:30 AM -- 11:30 AM
• Open book test: any sort of paper-based product, e.g., book,
notes, magazine, old tests.
• No electronic devices
• Problem sets and previous exams are helpful.

2
Carnegie Mellon

Today
• From process to threads
• Basic thread execution model
• Multi-threading programming
• Hardware support of threads
• Single core
• Multi-core
• Hyper-threading
• Cache coherence

3
Carnegie Mellon

Programmers View of A Process


• Process = process context + code, data, and stack

Process context Code, data, and stack


Program context: Stack
SP
Data registers
Condition codes Shared libraries
Stack pointer (SP)
Program counter (PC) brk
Run-time heap
Kernel context: Read/write data
VM structures PC Read-only code/data
Descriptor table
0
brk pointer

4
Carnegie Mellon

A Process With Multiple Threads


• Multiple threads can be associated with a process
• Each thread has its own logical control flow
• Each thread shares the same code, data, and kernel context
• Each thread has its own stack for local variables
• but not protected from other threads
• Each thread has its own thread id (TID)
Thread 1 (main thread) Thread 2 (peer thread) Shared code and data

shared libraries
stack 1 stack 2
run-time heap
Thread 1 context: Thread 2 context: read/write data
Data registers Data registers read-only code/data
Condition codes Condition codes 0
SP1 SP2
PC1 PC2 Kernel context:
VM structures
Descriptor table
brk pointer
5
Carnegie Mellon

Logical View of Threads


• Threads associated with process form a pool of peers
• Unlike processes which form a tree hierarchy

Threads associated with process foo Process hierarchy

T2 P0
T4
T1
P1
shared code, data
and kernel context
sh sh sh

T5 T3
foo

bar
6
Carnegie Mellon

Concurrent Threads
• Two threads are concurrent if their flows overlap in
time
• Otherwise, they are sequential
Thread A Thread B Thread C
• Examples:
• Concurrent: A & B, A&C
• Sequential: B & C
Time

7
Carnegie Mellon

Concurrent Thread Execution


• Single Core Processor • Multi Core Processor
• Simulate parallelism by • Threads can have true
time slicing parallelisms

Thread A Thread B Thread C Thread A Thread B Thread C

Time

Run 3 threads on 2 cores


8
Carnegie Mellon

Threads vs. Processes


• How threads and processes are similar
• Each has its own logical control flow
• Each can run concurrently with others (possibly on different cores)
• Each is context switched, controlled by kernel

9
Carnegie Mellon

Threads vs. Processes


• How threads and processes are similar
• Each has its own logical control flow
• Each can run concurrently with others (possibly on different cores)
• Each is context switched, controlled by kernel

• How threads and processes are different


• Threads share all code and data (except local stacks)
• Processes (typically) do not
• Threads are less expensive than processes
• Space: threads share the same virtual address space except stacks, but
processes have their own virtual address space
• Process control (creating and reaping) twice as expensive
• Typical Linux numbers:
• ~20K cycles to create and reap a process
• ~10K cycles (or less) to create and reap a thread
9
Carnegie Mellon

Posix Threads (Pthreads) Interface


•Pthreads: Standard interface for ~60 functions that manipulate threads from
C programs
• Creating and reaping threads
• pthread_create()
• pthread_join()
• Determining your thread ID
• pthread_self()
• Terminating threads
• pthread_cancel()
• pthread_exit()
• exit() [terminates all threads] , return()[terminates current thread]
• Synchronizing access to shared variables
• pthread_mutex_init
• pthread_mutex_[un]lock

10
Carnegie Mellon

The Pthreads "hello, world" Program


/*
* hello.c - Pthreads "hello, world" program
*/
#include "csapp.h"
void *thread(void *vargp);

int main()
{
pthread_t tid;
Pthread_create(&tid, NULL, thread, NULL);
Pthread_join(tid, NULL);
exit(0);
} hello.c

void *thread(void *vargp) /* thread routine */


{
printf("Hello, world!\n");
return NULL;
} hello.c
11
Carnegie Mellon

The Pthreads "hello, world" Program


/*
* hello.c - Pthreads "hello, world" program
*/
#include "csapp.h" Thread ID
void *thread(void *vargp);

int main()
{
pthread_t tid;
Pthread_create(&tid, NULL, thread, NULL);
Pthread_join(tid, NULL);
exit(0);
} hello.c

void *thread(void *vargp) /* thread routine */


{
printf("Hello, world!\n");
return NULL;
} hello.c
11
Carnegie Mellon

The Pthreads "hello, world" Program


/*
* hello.c - Pthreads "hello, world" program
*/ Thread attributes
#include "csapp.h" Thread ID
(usually NULL)
void *thread(void *vargp);

int main()
{
pthread_t tid;
Pthread_create(&tid, NULL, thread, NULL);
Pthread_join(tid, NULL);
exit(0);
} hello.c

void *thread(void *vargp) /* thread routine */


{
printf("Hello, world!\n");
return NULL;
} hello.c
11
Carnegie Mellon

The Pthreads "hello, world" Program


/*
* hello.c - Pthreads "hello, world" program
*/ Thread attributes
#include "csapp.h" Thread ID
(usually NULL)
void *thread(void *vargp);

int main()
{ Thread routine
pthread_t tid;
Pthread_create(&tid, NULL, thread, NULL);
Pthread_join(tid, NULL);
exit(0);
} hello.c

void *thread(void *vargp) /* thread routine */


{
printf("Hello, world!\n");
return NULL;
} hello.c
11
Carnegie Mellon

The Pthreads "hello, world" Program


/*
* hello.c - Pthreads "hello, world" program
*/ Thread attributes
#include "csapp.h" Thread ID
(usually NULL)
void *thread(void *vargp);

int main()
{ Thread routine
pthread_t tid;
Pthread_create(&tid, NULL, thread, NULL);
Pthread_join(tid, NULL); Thread arguments
exit(0);
} hello.c (void *p)

void *thread(void *vargp) /* thread routine */


{
printf("Hello, world!\n");
return NULL;
} hello.c
11
Carnegie Mellon

The Pthreads "hello, world" Program


/*
* hello.c - Pthreads "hello, world" program
*/ Thread attributes
#include "csapp.h" Thread ID
(usually NULL)
void *thread(void *vargp);

int main()
{ Thread routine
pthread_t tid;
Pthread_create(&tid, NULL, thread, NULL);
Pthread_join(tid, NULL); Thread arguments
exit(0);
} hello.c (void *p)

Return value
void *thread(void *vargp) /* thread routine */ (void **p)
{
printf("Hello, world!\n");
return NULL;
} hello.c
11
Carnegie Mellon

Execution of Threaded “hello, world”


Main thread

12
Carnegie Mellon

Execution of Threaded “hello, world”


Main thread

call Pthread_create()

12
Carnegie Mellon

Execution of Threaded “hello, world”


Main thread

call Pthread_create()

12
Carnegie Mellon

Execution of Threaded “hello, world”


Main thread

call Pthread_create()
Peer thread

12
Carnegie Mellon

Execution of Threaded “hello, world”


Main thread

call Pthread_create()
Pthread_create() returns Peer thread

12
Carnegie Mellon

Execution of Threaded “hello, world”


Main thread

call Pthread_create()
Pthread_create() returns Peer thread
call Pthread_join()

12
Carnegie Mellon

Execution of Threaded “hello, world”


Main thread

call Pthread_create()
Pthread_create() returns Peer thread
call Pthread_join()

Main thread waits for


peer thread to terminate

12
Carnegie Mellon

Execution of Threaded “hello, world”


Main thread

call Pthread_create()
Pthread_create() returns Peer thread
call Pthread_join()
printf()
Main thread waits for return NULL;
peer thread to terminate Peer thread
terminates

12
Carnegie Mellon

Execution of Threaded “hello, world”


Main thread

call Pthread_create()
Pthread_create() returns Peer thread
call Pthread_join()
printf()
Main thread waits for return NULL;
peer thread to terminate Peer thread
terminates
Pthread_join() returns

12
Carnegie Mellon

Execution of Threaded “hello, world”


Main thread

call Pthread_create()
Pthread_create() returns Peer thread
call Pthread_join()
printf()
Main thread waits for return NULL;
peer thread to terminate Peer thread
terminates
Pthread_join() returns

exit()
Terminates
main thread and
any peer threads

12
Carnegie Mellon

Today
• From process to threads
• Basic thread execution model
• Multi-threading programming
• Hardware support of threads
• Single core
• Multi-core
• Cache coherence

13
Carnegie Mellon

Shared Variables in Threaded C Programs


• One great thing about threads is that they can share same
program variables.
• Question: Which variables in a threaded C program are shared?
• Intuitively, the answer is as simple as “global variables are
shared” and “stack variables are private”. Not so simple in reality.
Shared code and data
Thread 1 (main thread) Thread 2 (peer thread) shared libraries

run-time heap
stack 1 stack 2 read/write data
read-only code/data
Thread 1 context: Thread 2 context: 0
Data registers Data registers
Condition codes Condition codes Kernel context:
SP1 SP2 VM structures
PC1 PC2 Descriptor table
brk pointer
14
Carnegie Mellon

Example Program to Illustrate Sharing


char **ptr; /* global var */
Main thread stack
void *thread(void *vargp)
{
long myid = (long)vargp;
static int cnt = 0;
printf("[%ld]: %s (cnt=%d)\n",
myid, ptr[myid], ++cnt); Peer thread 0 stack
return NULL;
}
Peer thread 1 stack
int main()
{
long i;
pthread_t tid; Memory mapped region
char *msgs[2] = {
for shared libraries
"Hello from foo",
"Hello from bar"
};
ptr = msgs; Runtime heap (malloc)
for (i = 0; i < 2; i++)
Uninitialized data (.bss)
pthread_create(&tid,
NULL,
thread,
(void *)i);
Initialized data (.data)
pthread_exit(NULL);
} sharing.c Program text (.text) 15
Carnegie Mellon

Example Program to Illustrate Sharing


char **ptr; /* global var */
Main thread stack
void *thread(void *vargp)
{
long myid = (long)vargp;
static int cnt = 0;
printf("[%ld]: %s (cnt=%d)\n",
myid, ptr[myid], ++cnt); Peer thread 0 stack
return NULL;
}
Peer thread 1 stack
int main()
{
long i;
pthread_t tid; Memory mapped region
char *msgs[2] = {
for shared libraries
"Hello from foo",
"Hello from bar"
};
ptr = msgs; Runtime heap (malloc)
for (i = 0; i < 2; i++)
Uninitialized data (.bss)
pthread_create(&tid,
NULL, ptr
thread,
(void *)i);
Initialized data (.data)
pthread_exit(NULL);
} sharing.c Program text (.text) 15
Carnegie Mellon

Example Program to Illustrate Sharing


char **ptr; /* global var */
Main thread stack
void *thread(void *vargp)
{ i tid
long myid = (long)vargp;
static int cnt = 0; msgs
printf("[%ld]: %s (cnt=%d)\n",
myid, ptr[myid], ++cnt); Peer thread 0 stack
return NULL;
}
Peer thread 1 stack
int main()
{
long i;
pthread_t tid; Memory mapped region
char *msgs[2] = {
for shared libraries
"Hello from foo",
"Hello from bar"
};
ptr = msgs; Runtime heap (malloc)
for (i = 0; i < 2; i++)
Uninitialized data (.bss)
pthread_create(&tid,
NULL, ptr
thread,
(void *)i);
Initialized data (.data)
pthread_exit(NULL);
} sharing.c Program text (.text) 15
Carnegie Mellon

Example Program to Illustrate Sharing


char **ptr; /* global var */
Main thread stack
void *thread(void *vargp)
{ i tid
long myid = (long)vargp;
static int cnt = 0; msgs
printf("[%ld]: %s (cnt=%d)\n",
myid, ptr[myid], ++cnt); Peer thread 0 stack
return NULL; myid
}
Peer thread 1 stack
int main()
{
long i;
pthread_t tid; Memory mapped region
char *msgs[2] = {
for shared libraries
"Hello from foo",
"Hello from bar"
};
ptr = msgs; Runtime heap (malloc)
for (i = 0; i < 2; i++)
Uninitialized data (.bss)
pthread_create(&tid,
NULL, ptr
thread,
(void *)i);
Initialized data (.data)
pthread_exit(NULL);
} sharing.c Program text (.text) 15
Carnegie Mellon

Example Program to Illustrate Sharing


char **ptr; /* global var */
Main thread stack
void *thread(void *vargp)
{ i tid
long myid = (long)vargp;
static int cnt = 0; msgs
printf("[%ld]: %s (cnt=%d)\n",
myid, ptr[myid], ++cnt); Peer thread 0 stack
return NULL; myid
}
Peer thread 1 stack
int main() myid
{
long i;
pthread_t tid; Memory mapped region
char *msgs[2] = {
for shared libraries
"Hello from foo",
"Hello from bar"
};
ptr = msgs; Runtime heap (malloc)
for (i = 0; i < 2; i++)
Uninitialized data (.bss)
pthread_create(&tid,
NULL, ptr
thread,
(void *)i);
Initialized data (.data)
pthread_exit(NULL);
} sharing.c Program text (.text) 15
Carnegie Mellon

Example Program to Illustrate Sharing


char **ptr; /* global var */
Main thread stack
void *thread(void *vargp)
{ i tid
long myid = (long)vargp;
static int cnt = 0; msgs
printf("[%ld]: %s (cnt=%d)\n",
myid, ptr[myid], ++cnt); Peer thread 0 stack
return NULL; myid
}
Peer thread 1 stack
int main() myid
{
long i;
pthread_t tid; Memory mapped region
char *msgs[2] = {
for shared libraries
"Hello from foo",
"Hello from bar"
};
ptr = msgs; Runtime heap (malloc)
for (i = 0; i < 2; i++)
Uninitialized data (.bss)
pthread_create(&tid,
NULL, ptr
thread,
(void *)i);
Initialized data (.data)
pthread_exit(NULL); cnt
} sharing.c Program text (.text) 15
Carnegie Mellon

Example Program to Illustrate Sharing


char **ptr; /* global var */
Main thread stack
void *thread(void *vargp)
{ i tid
long myid = (long)vargp;
static int cnt = 0; msgs
printf("[%ld]: %s (cnt=%d)\n",
myid, ptr[myid], ++cnt); Peer thread 0 stack
return NULL; myid
}
Peer thread 1 stack
int main() myid
{
long i;
pthread_t tid; Memory mapped region
char *msgs[2] = {
for shared libraries
"Hello from foo",
"Hello from bar"
};
ptr = msgs; Runtime heap (malloc)
for (i = 0; i < 2; i++)
Uninitialized data (.bss)
pthread_create(&tid, p0 p1 main
NULL, ptr
thread,
(void *)i);
Initialized data (.data)
pthread_exit(NULL); cnt
} sharing.c Program text (.text) 15
Carnegie Mellon

Example Program to Illustrate Sharing


char **ptr; /* global var */
Main thread stack
void *thread(void *vargp)
{ i tid
long myid = (long)vargp;
static int cnt = 0; msgs
printf("[%ld]: %s (cnt=%d)\n",
myid, ptr[myid], ++cnt); Peer thread 0 stack
return NULL; myid
}
Peer thread 1 stack
int main() myid
{
long i;
pthread_t tid; Memory mapped region
char *msgs[2] = {
for shared libraries
"Hello from foo",
"Hello from bar"
};
ptr = msgs; Runtime heap (malloc)
for (i = 0; i < 2; i++)
Uninitialized data (.bss)
pthread_create(&tid, p0 p1 main
NULL, ptr
thread,
Initialized data (.data)
(void *)i);
cnt p0 p1
pthread_exit(NULL);
} sharing.c Program text (.text) 15
Carnegie Mellon

Example Program to Illustrate Sharing


char **ptr; /* global var */
Main thread stack
void *thread(void *vargp)
{ i tid main
long myid = (long)vargp;
static int cnt = 0; msgs
printf("[%ld]: %s (cnt=%d)\n",
myid, ptr[myid], ++cnt); Peer thread 0 stack
return NULL; myid
}
Peer thread 1 stack
int main() myid
{
long i;
pthread_t tid; Memory mapped region
char *msgs[2] = {
for shared libraries
"Hello from foo",
"Hello from bar"
};
ptr = msgs; Runtime heap (malloc)
for (i = 0; i < 2; i++)
Uninitialized data (.bss)
pthread_create(&tid, p0 p1 main
NULL, ptr
thread,
Initialized data (.data)
(void *)i);
cnt p0 p1
pthread_exit(NULL);
} sharing.c Program text (.text) 15
Carnegie Mellon

Example Program to Illustrate Sharing


char **ptr; /* global var */
Main thread stack
void *thread(void *vargp)
{ i tid main
long myid = (long)vargp;
static int cnt = 0; msgs p0 p1 main
printf("[%ld]: %s (cnt=%d)\n",
myid, ptr[myid], ++cnt); Peer thread 0 stack
return NULL; myid
}
Peer thread 1 stack
int main() myid
{
long i;
pthread_t tid; Memory mapped region
char *msgs[2] = {
for shared libraries
"Hello from foo",
"Hello from bar"
};
ptr = msgs; Runtime heap (malloc)
for (i = 0; i < 2; i++)
Uninitialized data (.bss)
pthread_create(&tid, p0 p1 main
NULL, ptr
thread,
Initialized data (.data)
(void *)i);
cnt p0 p1
pthread_exit(NULL);
} sharing.c Program text (.text) 15
Carnegie Mellon

Example Program to Illustrate Sharing


char **ptr; /* global var */
Main thread stack
void *thread(void *vargp)
{ i tid main
long myid = (long)vargp;
static int cnt = 0; msgs p0 p1 main
printf("[%ld]: %s (cnt=%d)\n",
myid, ptr[myid], ++cnt); Peer thread 0 stack
return NULL; myid p0
}
Peer thread 1 stack
int main() myid
{
long i;
pthread_t tid; Memory mapped region
char *msgs[2] = {
for shared libraries
"Hello from foo",
"Hello from bar"
};
ptr = msgs; Runtime heap (malloc)
for (i = 0; i < 2; i++)
Uninitialized data (.bss)
pthread_create(&tid, p0 p1 main
NULL, ptr
thread,
Initialized data (.data)
(void *)i);
cnt p0 p1
pthread_exit(NULL);
} sharing.c Program text (.text) 15
Carnegie Mellon

Example Program to Illustrate Sharing


char **ptr; /* global var */
Main thread stack
void *thread(void *vargp)
{ i tid main
long myid = (long)vargp;
static int cnt = 0; msgs p0 p1 main
printf("[%ld]: %s (cnt=%d)\n",
myid, ptr[myid], ++cnt); Peer thread 0 stack
return NULL; myid p0
}
Peer thread 1 stack
int main() myid p1
{
long i;
pthread_t tid; Memory mapped region
char *msgs[2] = {
for shared libraries
"Hello from foo",
"Hello from bar"
};
ptr = msgs; Runtime heap (malloc)
for (i = 0; i < 2; i++)
Uninitialized data (.bss)
pthread_create(&tid, p0 p1 main
NULL, ptr
thread,
Initialized data (.data)
(void *)i);
cnt p0 p1
pthread_exit(NULL);
} sharing.c Program text (.text) 15
Carnegie Mellon

Synchronizing Threads
• Shared variables are handy...
• …but introduce the possibility of nasty synchronization errors.

16
Carnegie Mellon

Improper Synchronization
/* Global shared variable */ /* Thread routine */
volatile long cnt = 0; /* Counter */ void *thread(void *vargp)
{
int main(int argc, char **argv) long i, niters =
{ *((long *)vargp);
pthread_t tid1, tid2;
long niters = 10000; for (i = 0; i < niters; i++)
cnt++;
Pthread_create(&tid1, NULL,
thread, &niters);
return NULL;
Pthread_create(&tid2, NULL,
thread, &niters); }
Pthread_join(tid1, NULL);
Pthread_join(tid2, NULL);

/* Check result */
if (cnt != (2 * 10000))
printf("BOOM! cnt=%ld\n", cnt);
else
printf("OK cnt=%ld\n", cnt);
exit(0);
} badcnt.c

[Link] 17
Carnegie Mellon

Improper Synchronization
/* Global shared variable */ /* Thread routine */
volatile long cnt = 0; /* Counter */ void *thread(void *vargp)
{
int main(int argc, char **argv) long i, niters =
{ *((long *)vargp);
pthread_t tid1, tid2;
long niters = 10000; for (i = 0; i < niters; i++)
cnt++;
Pthread_create(&tid1, NULL,
thread, &niters);
return NULL;
Pthread_create(&tid2, NULL,
thread, &niters); }
Pthread_join(tid1, NULL);
Pthread_join(tid2, NULL); linux> ./badcnt
OK cnt=20000
/* Check result */
if (cnt != (2 * 10000)) linux> ./badcnt
printf("BOOM! cnt=%ld\n", cnt); BOOM! cnt=13051
else
printf("OK cnt=%ld\n", cnt);
exit(0); cnt should be 20,000.
} badcnt.c

[Link]
What went wrong? 17
Carnegie Mellon

Assembly Code for Counter Loop


C code for counter loop in thread i
for (i = 0; i < niters; i++)
cnt++;

Asm code for thread i


movq (%rdi), %rcx
testq %rcx,%rcx
jle .L2 Hi : Head
movl $0, %eax
.L3:
movq cnt(%rip),%rdx Li : Load cnt
addq $1, %rdx Ui : Update cnt
movq %rdx, cnt(%rip) Si : Store cnt
addq $1, %rax
cmpq %rcx, %rax
jne .L3 Ti : Tail
.L2:

18
Carnegie Mellon

Concurrent Execution
• Key observation: In general, any sequentially consistent
interleaving is possible, but some give an unexpected result!

cnt
i (thread) instri %rdx1 %rdx2
(shared)
Thread 1 critical
1 L1 0 - 0 section
1 U1 1 - 0
1 S1 1 - 1 Thread 2 critical
2 L2 - 1 1 section
2 U2 - 2 1
2 S2 - 2 2

Li movq cnt(%rip),%rdx
Ui addq $1, %rdx
Si movq %rdx, cnt(%rip)
19
Carnegie Mellon

Concurrent Execution (cont)


• A legal (feasible) but undesired ordering: two threads increment
the counter, but the result is 1 instead of 2

cnt
i (thread) instri %rdx1 %rdx2
(shared)

1 L1 0 - 0
1 U1 1 - 0
2 L2 - 0 0
1 S1 1 - 1
2 U2 - 1 1
2 S2 - 1 1

Li movq cnt(%rip),%rdx
Ui addq $1, %rdx
Si movq %rdx, cnt(%rip)
20
Carnegie Mellon

Assembly Code for Counter Loop


C code for counter loop in thread i
for (i = 0; i < niters; i++)
cnt++;

Asm code for thread i


movq (%rdi), %rcx
testq %rcx,%rcx
jle .L2 Hi : Head
movl $0, %eax
critical .L3:
movq cnt(%rip),%rdx Li : Load cnt
section addq $1, %rdx Ui : Update cnt
wrt cnt movq %rdx, cnt(%rip) Si : Store cnt
addq $1, %rax
cmpq %rcx, %rax
jne .L3 Ti : Tail
.L2:
21
Carnegie Mellon

Critical Section
• Code section (a sequence of instructions) where no more than one
thread should be executing concurrently.
• Critical section refers to code, but its intention is to protect data!

movq (%rdi), %rcx


testq %rcx,%rcx
jle .L2 Hi : Head
movl $0, %eax
critical .L3:
movq cnt(%rip),%rdx Li : Load cnt
section addq $1, %rdx Ui : Update cnt
wrt cnt movq %rdx, cnt(%rip) Si : Store cnt
addq $1, %rax
cmpq %rcx, %rax
jne .L3 Ti : Tail
.L2:
22
Carnegie Mellon

Critical Section
• Code section (a sequence of instructions) where no more than one
thread should be executing concurrently.
• Critical section refers to code, but its intention is to protect data!
• Threads need to have mutually exclusive access to critical section. That
is, the execution of the critical section must be atomic: instructions in a
CS either are executed entirely without interruption or not executed at all.
movq (%rdi), %rcx
testq %rcx,%rcx
jle .L2 Hi : Head
movl $0, %eax
critical .L3:
movq cnt(%rip),%rdx Li : Load cnt
section addq $1, %rdx Ui : Update cnt
wrt cnt movq %rdx, cnt(%rip) Si : Store cnt
addq $1, %rax
cmpq %rcx, %rax
jne .L3 Ti : Tail
.L2:
22
Carnegie Mellon

Enforcing Mutual Exclusion


• We must coordinate/synchronize the execution of the threads
• i.e., need to guarantee mutually exclusive access for each critical section.

• Classic solution:
• Semaphores/mutex (Edsger Dijkstra)

• Other approaches
• Condition variables
• Monitors (Java)
• 254/258 discusses these

23
Carnegie Mellon

Using Semaphores for Mutual Exclusion


• Basic idea:

24
Carnegie Mellon

Using Semaphores for Mutual Exclusion


• Basic idea:
• Associate each shared variable (or related set of shared variables) with
a unique variable, called semaphore, initially 1.

24
Carnegie Mellon

Using Semaphores for Mutual Exclusion


• Basic idea:
• Associate each shared variable (or related set of shared variables) with
a unique variable, called semaphore, initially 1.
• Every time a thread tries to enter the critical section, it first checks the
semaphore value. If it’s still 1, the thread decrements the mutex value to
0 (through a P operation) and enters the critical section. If it’s 0, wait.

24
Carnegie Mellon

Using Semaphores for Mutual Exclusion


• Basic idea:
• Associate each shared variable (or related set of shared variables) with
a unique variable, called semaphore, initially 1.
• Every time a thread tries to enter the critical section, it first checks the
semaphore value. If it’s still 1, the thread decrements the mutex value to
0 (through a P operation) and enters the critical section. If it’s 0, wait.
• Every time a thread exits the critical section, it increments the
semaphore value to 1 (through a V operation) so that other threads are
now allowed to enter the critical section.

24
Carnegie Mellon

Using Semaphores for Mutual Exclusion


• Basic idea:
• Associate each shared variable (or related set of shared variables) with
a unique variable, called semaphore, initially 1.
• Every time a thread tries to enter the critical section, it first checks the
semaphore value. If it’s still 1, the thread decrements the mutex value to
0 (through a P operation) and enters the critical section. If it’s 0, wait.
• Every time a thread exits the critical section, it increments the
semaphore value to 1 (through a V operation) so that other threads are
now allowed to enter the critical section.
• No more than one thread can be in the critical section at a time.

24
Carnegie Mellon

Using Semaphores for Mutual Exclusion


• Basic idea:
• Associate each shared variable (or related set of shared variables) with
a unique variable, called semaphore, initially 1.
• Every time a thread tries to enter the critical section, it first checks the
semaphore value. If it’s still 1, the thread decrements the mutex value to
0 (through a P operation) and enters the critical section. If it’s 0, wait.
• Every time a thread exits the critical section, it increments the
semaphore value to 1 (through a V operation) so that other threads are
now allowed to enter the critical section.
• No more than one thread can be in the critical section at a time.
• Terminology

24
Carnegie Mellon

Using Semaphores for Mutual Exclusion


• Basic idea:
• Associate each shared variable (or related set of shared variables) with
a unique variable, called semaphore, initially 1.
• Every time a thread tries to enter the critical section, it first checks the
semaphore value. If it’s still 1, the thread decrements the mutex value to
0 (through a P operation) and enters the critical section. If it’s 0, wait.
• Every time a thread exits the critical section, it increments the
semaphore value to 1 (through a V operation) so that other threads are
now allowed to enter the critical section.
• No more than one thread can be in the critical section at a time.
• Terminology
• Binary semaphore is also called mutex (i.e., the semaphore value
could only be 0 or 1)

24
Carnegie Mellon

Using Semaphores for Mutual Exclusion


• Basic idea:
• Associate each shared variable (or related set of shared variables) with
a unique variable, called semaphore, initially 1.
• Every time a thread tries to enter the critical section, it first checks the
semaphore value. If it’s still 1, the thread decrements the mutex value to
0 (through a P operation) and enters the critical section. If it’s 0, wait.
• Every time a thread exits the critical section, it increments the
semaphore value to 1 (through a V operation) so that other threads are
now allowed to enter the critical section.
• No more than one thread can be in the critical section at a time.
• Terminology
• Binary semaphore is also called mutex (i.e., the semaphore value
could only be 0 or 1)
• Think of P operation as “locking”, and V as “unlocking”.

24
Carnegie Mellon

Proper Synchronization
• Define and initialize a mutex for the shared variable cnt:
volatile long cnt = 0; /* Counter */
sem_t mutex; /* Semaphore that protects cnt */

Sem_init(&mutex, 0, 1); /* mutex = 1 */

• Surround critical section with P and V:


for (i = 0; i < niters; i++) {
linux> ./goodcnt 10000
P(&mutex); OK cnt=20000
cnt++; linux> ./goodcnt 10000
V(&mutex); OK cnt=20000
} goodcnt.c
linux>

Warning: It’s orders of magnitude slower


than badcnt.c.
25
Carnegie Mellon

Problem?
• Wouldn’t there be a problem when multiple threads access the
mutex? How do we ensure exclusive accesses to mutex itself?

for (i = 0; i < niters; i++) {


P(&mutex);
cnt++;
V(&mutex);
} goodcnt.c

26
Carnegie Mellon

Problem?
• Wouldn’t there be a problem when multiple threads access the
mutex? How do we ensure exclusive accesses to mutex itself?
• Hardware MUST provide mechanisms for atomic accesses to the
mutex variable.

for (i = 0; i < niters; i++) {


P(&mutex);
cnt++;
V(&mutex);
} goodcnt.c

26
Carnegie Mellon

Problem?
• Wouldn’t there be a problem when multiple threads access the
mutex? How do we ensure exclusive accesses to mutex itself?
• Hardware MUST provide mechanisms for atomic accesses to the
mutex variable.
• Checking mutex value and setting its value must be an atomic
unit: they either are performed entirely or not performed at all.

for (i = 0; i < niters; i++) {


P(&mutex);
cnt++;
V(&mutex);
} goodcnt.c

26
Carnegie Mellon

Problem?
• Wouldn’t there be a problem when multiple threads access the
mutex? How do we ensure exclusive accesses to mutex itself?
• Hardware MUST provide mechanisms for atomic accesses to the
mutex variable.
• Checking mutex value and setting its value must be an atomic
unit: they either are performed entirely or not performed at all.
• on x86: the atomic test-and-set instruction.

for (i = 0; i < niters; i++) {


P(&mutex);
cnt++;
V(&mutex);
} goodcnt.c

26
Carnegie Mellon

Deadlock
• Def: A process/thread is deadlocked if and only if it is waiting for
a condition that will never be true
• General to concurrent/parallel programming (threads,
processes)
• Typical Scenario
• Processes 1 and 2 needs two resources (A and B) to proceed
• Process 1 acquires A, waits for B
• Process 2 acquires B, waits for A
• Both will wait forever!

27
Carnegie Mellon

Deadlocking With Semaphores


void *count(void *vargp)
{
int i;
int id = (int) vargp;
for (i = 0; i < NITERS; i++) {
P(&mutex[id]); P(&mutex[1-id]);
cnt++;
V(&mutex[id]); V(&mutex[1-id]); Tid[0]: Tid[1]:
} P(s0); P(s1);
return NULL;
} P(s1); P(s0);
int main() cnt++; cnt++;
{ V(s0); V(s1);
pthread_t tid[2];
Sem_init(&mutex[0], 0, 1); /* mutex[0] = 1 */ V(s1); V(s0);
Sem_init(&mutex[1], 0, 1); /* mutex[1] = 1 */
Pthread_create(&tid[0], NULL, count, (void*) 0);
Pthread_create(&tid[1], NULL, count, (void*) 1);
Pthread_join(tid[0], NULL);
Pthread_join(tid[1], NULL);
printf("cnt=%d\n", cnt);
exit(0);
}

28
Carnegie Mellon

Avoiding Deadlock Acquire shared resources in same order

Tid[0]: Tid[1]:
P(s0); P(s1);
P(s1); P(s0);
cnt++; cnt++;
V(s0); V(s1);
V(s1); V(s0);

Tid[0]: Tid[1]:
P(s0); P(s0);
P(s1); P(s1);
cnt++; cnt++;
V(s0); V(s1);
V(s1); V(s0);

29
Carnegie Mellon

Summary of Multi-threading Programming


• Concurrent/parallel threads access shared variables
• Need to protect concurrent accesses to guarantee correctness
• Semaphores (e.g., mutex) provide a simple solution
• Can lead to deadlock if not careful
• Take CSC 254/258 to know more about avoiding deadlocks
(and parallel programming in general)

30
Carnegie Mellon

Thread-level Parallelism (TLP)


• Thread-Level Parallelism
• Splitting a task into independent sub-tasks
• Each thread is responsible for a sub-task

31
Carnegie Mellon

Thread-level Parallelism (TLP)


• Thread-Level Parallelism
• Splitting a task into independent sub-tasks
• Each thread is responsible for a sub-task
• Example: Parallel summation of N number
• Partition values 0, …, n-1 into t ranges, n/t values each range
• Each of t threads processes one range (sub-task)
• Sum all sub-sums in the end

31
Carnegie Mellon

Thread-level Parallelism (TLP)


• Thread-Level Parallelism
• Splitting a task into independent sub-tasks
• Each thread is responsible for a sub-task
• Example: Parallel summation of N number
• Partition values 0, …, n-1 into t ranges, n/t values each range
• Each of t threads processes one range (sub-task)
• Sum all sub-sums in the end
• Question: if you parallel you work N ways, do you always an N
times speedup?

31
Carnegie Mellon

Why the Sequential Bottleneck?


• Maximum speedup limited by the
sequential portion
• Main cause: Non-parallelizable
operations on data

32
Carnegie Mellon

Why the Sequential Bottleneck?


• Maximum speedup limited by the
sequential portion
• Main cause: Non-parallelizable
operations on data
• Parallel portion is usually not
perfectly parallel as well
• e.g., Synchronization overhead

32
Carnegie Mellon

Why the Sequential Bottleneck?


• Maximum speedup limited by the
sequential portion
• Main cause: Non-parallelizable
operations on data
• Parallel portion is usually not
perfectly parallel as well
• e.g., Synchronization overhead
Each thread:
loop {
Compute
P(A)
Update shared data
V(A)
}
32
Carnegie Mellon

Why the Sequential Bottleneck?


• Maximum speedup limited by the
sequential portion
• Main cause: Non-parallelizable
operations on data
• Parallel portion is usually not
perfectly parallel as well
• e.g., Synchronization overhead
Each thread:
loop {
Compute N
P(A)
Update shared data
V(A)
}
32
Carnegie Mellon

Why the Sequential Bottleneck?


• Maximum speedup limited by the
sequential portion
• Main cause: Non-parallelizable
operations on data
• Parallel portion is usually not
perfectly parallel as well
• e.g., Synchronization overhead
Each thread:
loop {
Compute N
P(A)
Update shared data
V(A) C
}
32
Carnegie Mellon

Why the Sequential Bottleneck?


• Maximum speedup limited by the
sequential portion
• Main cause: Non-parallelizable
operations on data
• Parallel portion is usually not
perfectly parallel as well
• e.g., Synchronization overhead
Each thread:
loop {
Compute N
P(A)
Update shared data
V(A) C
}
32
Carnegie Mellon

Amdahl’s Law
• Gene Amdahl (1922 – 2015). Giant in computer architecture
• Captures the difficulty of using parallelism to speed things up

Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” 1967.
33
Carnegie Mellon

Amdahl’s Law
• Gene Amdahl (1922 – 2015). Giant in computer architecture
• Captures the difficulty of using parallelism to speed things up
• Amdahl’s Law
• f: Parallelizable fraction of a program
• N: Number of processors (i.e., maximal achievable speedup)

Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” 1967.
33
Carnegie Mellon

Amdahl’s Law
• Gene Amdahl (1922 – 2015). Giant in computer architecture
• Captures the difficulty of using parallelism to speed things up
• Amdahl’s Law
• f: Parallelizable fraction of a program
• N: Number of processors (i.e., maximal achievable speedup)

1-f

Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” 1967.
33
Carnegie Mellon

Amdahl’s Law
• Gene Amdahl (1922 – 2015). Giant in computer architecture
• Captures the difficulty of using parallelism to speed things up
• Amdahl’s Law
• f: Parallelizable fraction of a program
• N: Number of processors (i.e., maximal achievable speedup)

1-f +

Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” 1967.
33
Carnegie Mellon

Amdahl’s Law
• Gene Amdahl (1922 – 2015). Giant in computer architecture
• Captures the difficulty of using parallelism to speed things up
• Amdahl’s Law
• f: Parallelizable fraction of a program
• N: Number of processors (i.e., maximal achievable speedup)

f
1-f + N

Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” 1967.
33
Carnegie Mellon

Amdahl’s Law
• Gene Amdahl (1922 – 2015). Giant in computer architecture
• Captures the difficulty of using parallelism to speed things up
• Amdahl’s Law
• f: Parallelizable fraction of a program
• N: Number of processors (i.e., maximal achievable speedup)

1
Speedup =
f
1-f + N

Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” 1967.
33
Carnegie Mellon

Amdahl’s Law
• Gene Amdahl (1922 – 2015). Giant in computer architecture
• Captures the difficulty of using parallelism to speed things up
• Amdahl’s Law
• f: Parallelizable fraction of a program
• N: Number of processors (i.e., maximal achievable speedup)

1
Speedup =
f
1-f + N

• Completely parallelizable (f = 1): Speedup = N

Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” 1967.
33
Carnegie Mellon

Amdahl’s Law
• Gene Amdahl (1922 – 2015). Giant in computer architecture
• Captures the difficulty of using parallelism to speed things up
• Amdahl’s Law
• f: Parallelizable fraction of a program
• N: Number of processors (i.e., maximal achievable speedup)

1
Speedup =
f
1-f + N

• Completely parallelizable (f = 1): Speedup = N


• Completely sequential (f = 0): Speedup = 1

Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” 1967.
33
Carnegie Mellon

Amdahl’s Law
• Gene Amdahl (1922 – 2015). Giant in computer architecture
• Captures the difficulty of using parallelism to speed things up
• Amdahl’s Law
• f: Parallelizable fraction of a program
• N: Number of processors (i.e., maximal achievable speedup)

1
Speedup =
f
1-f + N

• Completely parallelizable (f = 1): Speedup = N


• Completely sequential (f = 0): Speedup = 1
• Mostly parallelizable (f = 0.9, N = 1000): Speedup = 9.9
Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” 1967.
33
Carnegie Mellon

Today
• From process to threads
• Basic thread execution model
• Multi-threading programming
• Hardware support of threads
• Single core
• Multi-core
• Cache coherence

34
Carnegie Mellon

Can A Single Core Support Multi-threading?


• Need to multiplex between different threads (time slicing)
Sequential Multi-threaded

Thread A Thread B Thread C

35
Carnegie Mellon

Any benefits?
• Can single-core multi-threading provide any performance gains?

Thread A Thread B Thread C

36
Carnegie Mellon

Any benefits?
• Can single-core multi-threading provide any performance gains?

Thread A Thread B Thread C

Cache
Miss!

36
Carnegie Mellon

Any benefits?
• Can single-core multi-threading provide any performance gains?

Thread A Thread B Thread C

Cache
Miss!

36
Carnegie Mellon

Any benefits?
• Can single-core multi-threading provide any performance gains?
• If Thread A has a cache miss and the pipeline gets stalled,
switch to Thread C. Improves the overall performance.
Thread A Thread B Thread C

Cache
Miss!

36
Carnegie Mellon

When to Switch?
• Coarse grained
• Event based, e.g., switch on L3 cache miss
• Quantum based (every thousands of cycles)

37
Carnegie Mellon

When to Switch?
• Coarse grained
• Event based, e.g., switch on L3 cache miss
• Quantum based (every thousands of cycles)
• Fine grained
• Cycle by cycle
• Thornton, “CDC 6600: Design of a Computer,” 1970.
• Burton Smith, “A pipelined, shared resource MIMD computer,” ICPP
1978. The HEP machine. A seminal paper that shows that using multi-
threading can avoid branch prediction.

37
Carnegie Mellon

When to Switch?
• Coarse grained
• Event based, e.g., switch on L3 cache miss
• Quantum based (every thousands of cycles)
• Fine grained
• Cycle by cycle
• Thornton, “CDC 6600: Design of a Computer,” 1970.
• Burton Smith, “A pipelined, shared resource MIMD computer,” ICPP
1978. The HEP machine. A seminal paper that shows that using multi-
threading can avoid branch prediction.
• Either way, need to save/restore thread context upon
switching.

37
Carnegie Mellon

Fine-Grained Switching
• One big bonus of fine-grained switching: no need for
branch predictor!!

The stalling approach


1 2 3 4 5 6 7 8 9

xorg %rax, %rax F D E M W


jne L1 # Not taken F D E M W
Stall Stall

Stall Stall
irmovq $1, %rax # Fall Through F D E M W
L1 irmovq $4, %rcx # Target F D E M W
irmovq $3, %rax # Target + 1 F D E

38
Carnegie Mellon

Fine-Grained Switching
• One big bonus of fine-grained switching: no need for
branch predictor!!

The branch prediction approach

39
Carnegie Mellon

Fine-Grained Switching
• One big bonus of fine-grained switching: no need for branch
predictor!!

The fine-grained multi-threading approach


1 2 3 4 5 6 7 8 9

xorg %rax, %rax F D E M W


Inst x from TID=1 F D E M W
Inst y from TID=2 F D E M W
jne L1 # Not taken F D E M W
Inst x+1 from TID=1 F D E M W
Inst y+1 from TID=2 F D E M W
irmovq $1, %rax # Fall Through F D E
Inst x+2 from TID=1 F D
Inst y+2 from TID=2 F
… …
40
Carnegie Mellon

Fine-Grained Switching
• One big bonus of fine-grained switching: no need for branch
predictor!!
• Context switching overhead would be very high! Use separate
hardware contexts for each thread (e.g., separate register files).

The fine-grained multi-threading approach


1 2 3 4 5 6 7 8 9

xorg %rax, %rax F D E M W


Inst x from TID=1 F D E M W
Inst y from TID=2 F D E M W
jne L1 # Not taken F D E M W
Inst x+1 from TID=1 F D E M W
Inst y+1 from TID=2 F D E M W
irmovq $1, %rax # Fall Through F D E
Inst x+2 from TID=1 F D
Inst y+2 from TID=2 F
… …
40
Carnegie Mellon

Fine-Grained Switching
• One big bonus of fine-grained switching: no need for branch
predictor!!
• Context switching overhead would be very high! Use separate
hardware contexts for each thread (e.g., separate register files).
• GPUs do this (among other things). More later.

The fine-grained multi-threading approach


1 2 3 4 5 6 7 8 9

xorg %rax, %rax F D E M W


Inst x from TID=1 F D E M W
Inst y from TID=2 F D E M W
jne L1 # Not taken F D E M W
Inst x+1 from TID=1 F D E M W
Inst y+1 from TID=2 F D E M W
irmovq $1, %rax # Fall Through F D E
Inst x+2 from TID=1 F D
Inst y+2 from TID=2 F
… …
40
Carnegie Mellon

41

You might also like