D ISTRIBUTED S YSTEMS [COMP9243] Lecture 2: System Architecture & Communication
Slide 1
System Architectures Processes & Server Architecture Communication in a Distributed System Communication Abstractions
Slide 3
A RCHITECTURE
System Architecture:
placement of machines placement of software on machines
B UILDING
Slide 2 Two questions:
D ISTRIBUTED S YSTEM
Slide 4
Where to place?:
processing capacity, load balancing communication capacity locality
Where to place the hardware? Where to place the software?
Mapping of services to servers:
Partitioning Replication Caching
A RCHITECTURE
A RCHITECTURE
C LIENT-S ERVER
Software Architecture: Logical organisation and roles of software components
Layered Object-oriented
Request Client
Slide 7
Server Reply
Slide 5
Data-centered Event-based
There is no single best architecture
depends on application requirements and the environement!
Kernel
Kernel
Client-Server from another perspective:
Client
Wait for result
Slide 6
A RCHITECTURAL PATTERNS
Slide 8
Server
Request
Reply
Provide service
Time
C LIENT-S ERVER
C LIENT-S ERVER
Example client-server code in Erlang: % Client code using the increment server client (Server) -> Server ! {self (), 10}, receive {From, Reply} -> io:format ("Result: ~w~n", [Reply]) end. Slide 9 % Server loop for increment server loop () -> receive {From, Msg} -> From ! {self (), Msg + 1}, loop (); stop -> true end. % Initiate the server start_server() -> spawn (fun () -> loop () end).
server(void) { struct sockaddr_in cin, sin; int sd, sd_client; sd = socket(AF_INET,SOCK_STREAM,0); bind(sd,(struct sockaddr *)&sin,sizeof(sin)); listen(sd, queuesize); while (true) { sd_client = accept(sd,(struct sockaddr *)&cin,&addrlen)); recv(sd_client,buffer,sizeof(buffer),0); DoService(buffer); send(sd_client,buffer,strlen(buffer),0); close (sd_client); } close (sd); }
Slide 11
Example client-server code in C: Splitting Functionality:
Client machine User interface User interface User interface Application User interface Application User interface Application Database
client(void) { struct sockaddr_in cin; char buffer[bufsize]; int sd; Slide 10 sd = socket(AF_INET,SOCK_STREAM,0); connect(sd,(void *)&cin,sizeof(cin)); send(sd,buffer,strlen(buffer),0); recv(sd,buffer,bufsize,0); close (sd); } Slide 12
User interface Application Database Application Database Application Database Server machine (a) (b) (c) (d) (e) Database Database
C LIENT-S ERVER
V ERTICAL D ISTRIBUTION (M ULTI - TIER )
V ER TICAL D ISTRIBUTION (M ULTI - TIER )
Front end handling incoming requests
H ORIZONTAL D ISTRIBUTION
Request Client Reply Kernel
App. Server
Request Reply
Dbase Server
Replicated Web servers each containing the same Web pages Requests handled in round-robin fashion Disks
Kernel
Kernel
Slide 13 Three layers of functionality: User interface Processing/Application logic Data
Logically different components on different machines
Slide 15
Internet Internet
Logically equivalent components replicated on different machines
How scalable is this?
How scalable is this?
P EER
Vertical Distribution from another perspective:
request reply
TO
P EER
Peer
Peer
User interface (presentation) Request operation
Wait for result
Kernel
reply request
Kernel
request reply
Return result Wait for data
Slide 14
Application server Request data Database server
Slide 16
Return data
request
Peer
Peer
reply
Kernel Kernel Peer
Time
Kernel
All processes have client and server roles: servent
Why is this special?
H ORIZONTAL D ISTRIBUTION
P EER TO P EER AND OVERLAY N ETWORKS
U NSTRUCTURED OVERLAY P EER
TO
P EER
AND
OVERLAY N ETWORKS
How do peers keep track of all other peers?
static structure: you already know dynamic structure: Overlay Network structured unstructured
Slide 17
Slide 19
Overlay Network:
Application-specic network Addressing Routing Specialised features (e.g., encryption, multicast, etc.) Data stored at random nodes Partial view: nodes list of neighbours Exchange partial views with neighbours to update
Whats a problem with this?
Example:
111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000
S TRUCTURED OVERLAY
Distributed Hash Table:
Actual node
11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00
15 14 13 12 {8,9,10,11,12}
1 {0,1} 2 3
1111111 0000000 1111111 0000000 11111111111111111 00000000000000000 1111111 0000000 11111111111111111 00000000000000000
1111111 0000000 1111111 0000000 1111111111111111 0000000000000000 1111111 0000000 1111111111111111 0000000000000000
{13,14,15}
1111111 0000000 1111111 0000000 1111111111111111 0000000000000000 1111111 0000000 1111111111111111 0000000000000000
111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000
{2,3,4}
Slide 18
111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000
Slide 20
11 10
Associated data keys {5,6,7} 9 8 7 6
1111111 0000000 1111111 0000000 1111111111111111 0000000000000000 1111111 0000000 1111111111111111 0000000000000000
11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000
1111111 0000000 1111111 0000000 11111111111111111 00000000000000000 1111111 0000000 11111111111111111 00000000000000000
111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000
1111111 0000000 1111111 0000000 11111111111111111 00000000000000000 1111111 0000000 11111111111111111 00000000000000000
1111111 0000000 1111111 0000000 1111111111111111 0000000000000000 1111111 0000000 1111111111111111 0000000000000000
1111111 0000000 1111111 0000000 1111111111111111 0000000000000000 1111111 0000000 1111111111111111 0000000000000000
Nodes have identier and range Data has identier Node is responsible for data that falls in its range Search is routed to appropriate node
Whats a problem with this?
U NSTRUCTURED OVERLAY
H YBRID A RCHITECTURES
10
Collaborative Distributed Systems: Example: BitTorrent
Node downloads chunks of le from many other nodes
H YBRID A RCHITECTURES
Combination of architectures. Examples: Slide 21 Superpeer networks Collaborative distributed systems Edge-server systems Slide 23
Node provides downloaded chunks to other nodes Tracker keeps track of active nodes that have chunks of le Enforce collaboration by penalising selsh nodes
Node 1 File server Node 2
Node 3 Tracker Node 4
What problems does Bit Torrent face?
Superpeer Networks:
Regular peers are clients of superpeers Superpeers are servers for regular peers Superpeers are peers among themselves Superpeers may maintain large index, or act as brokers
Edge-Server Networks:
Servers placed at the edge of the network Servers replicate content Mostly used for content and application distribution Content Distribution Networks
Origin server
Slide 22
Regular peers
ISPs
Slide 24
Internet Enterprise networks
Superpeer network
Replica server Superpeer
What are the challenges?
H YBRID A RCHITECTURES
11
S ERVER D ESIGN
12
S ERVER D ESIGN
Dispatcher thread Request dispatched to a worker thread
C LUSTERED S ERVERS
Server
Worker thread
Logical switch (possibly multiple)
Application/compute servers
Distributed file/database system
Request coming in from the network
Slide 25
Operating system
Slide 27
Client requests
Dispatched request
Model Single-threaded process Threads Finite-state machine
Characteristics No parallelism, blocking system calls Parallelism, blocking system calls Parallelism, non-blocking system calls
First tier Second tier Third tier
S TATEFUL
Stateful:
VS
S TATELESS S ERVERS V IR TUALISATION
Virtual Machines
Keeps persistent information about clients V Improved performance X Expensive crash recovery X Must track clients
Server Guest OS
Server Guest OS
Server Guest OS
Slide 26
Stateless:
Does not keep state of clients soft state design: limited client state V Can change own state without informing clients V No cleanup after crash V Easy to replicate X Increased communication
Slide 28
Virtual Machine Monitor
Host OS
Hardware
What are the benets?
Note: Session state vs. Permanent state
C LUSTERED S ERVERS
13
R EQUEST S WITCHING
14
R EQUEST S WITCHING
Transport layer switch:
Logically a single TCP connection Response Server
Client
Request
Switch
Request (handed off)
Slide 29
Server
Slide 31
C OMMUNICATION
DNS-based:
Round-robin DNS
Application layer switch:
Analyse requests Forward to appropriate server
C ODE M OBILITY
Why move code?
Optimise computation (load balancing) Optmise communication
Weak vs Strong Mobility: Slide 30 Weak transfer only code Strong transfer code and execution segment Sender vs Receiver Intitated migration: Sender Send program to compute server Receiver Download applets What are the challenges of code mobility? Slide 32
Why Communication: Cooperating processes need to communicate.
For synchronisation and control To share data
C OMMUNICATION
15
C OMMUNICATION
16
In a Non-Distributed System: Two approaches to communication:
Shared memory Direct memory access (Threads) Mapped memory (Processes)
In a Non-Distributed System: Two approaches to communication:
Shared memory Direct memory access (Threads) Mapped memory (Processes) Message passing OSs IPC mechanisms
Slide 33
Slide 35
Shared Memory:
Message Passing:
x=12
i=x
Slide 34
Process A
Process B
Slide 36
Shared memory
Process A
Process B
Address space 1
Address space 2
Address space 1
Address space 2
C OMMUNICATION
17
C OMMUNICATION IN A D ISTRIBUTED S YSTEM
18
C OMMUNICATION
IN A
D ISTRIBUTED S YSTEM C OMMUNICATION M ODES
Data oriented vs control oriented communication
Previous slides assumed a uniprocessor or a multiprocessor. In a distributed system (multicomputer) things change: Shared Memory: Slide 37
There is no way to physically share memory
Slide 39
Synchronous vs asynchronous communication Transient vs persistent communication Provider-initiated vs consumer-initiated communication Direct -addressing vs indirect-addressing communication
Message Passing:
Over the network Introduces latencies Introduces higher chances of failure Heterogeneity introduces possible incompatibilities
M ESSAGE PASSING
Basics:
send() receive()
Variations:
Connection oriented vs Connectionless
Data-Oriented vs Control-Oriented Communication: Data-oriented communication
Facilitates data exchange between threads Shared address space, shared memory & message passing
Slide 38
Point-to-point vs Group Synchronous vs Asynchronous Buffered vs Unbuffered Reliable vs Unreliable Message ordering guarantees
Slide 40
Control-oriented communication
Associates a transfer of control with communication Active messages, remote procedure call (RPC) & remote method invocation (RMI)
Data Representation:
Marshalling Endianness
C OMMUNICATION M ODES
19
C OMMUNICATION M ODES
20
Synchronous vs Asynchronous Communication: Synchronous
Sender blocks until message received Often sender blocked until message is processed and a reply received Sender and receiver must be active at the same time Receiver waits for requests, processes them (ASAP), and returns reply Client-Server generally uses synchronous communication
Provider-Initiated vs Consumer-Initiated Communication: Provider-Initiated
Message sent when data is available
Slide 41
Slide 43
Example: notications
Consumer-Initiated
Request sent for data Example: HTTP request
Asynchronous
Sender continues execution after sending message (does not block waiting for reply) Message may be queued if receiver not active Message may be processed later at receivers convenience
When is Synchronous suitable? Asynchronous?
Transient vs Persistent Communication: Transient
Message discarded if cannot be delivered to receiver immediately
Direct-Addressing vs Indirect-Addressing Communication: Direct-Addressing
Message sent directly to receiver Example: HTTP request
Slide 42
Example: HTTP request
Slide 44
Persistent
Message stored (somewhere) until receiver can accept it Example: email
Indirect-Addressing
Message not sent to a particular receiver Example: broadcast, publish/subscribe
Coupling: Time
Coupling: Space
C OMMUNICATION M ODES
21
C OMMUNICATION M ODES
22
Combinations:
A A Accepted B Persistent Asynchronous B Persistent Synchronous A ACK B Transient Synchronous (Receipt Based) Starts processing request Request Received B Transient Synchronous (Delivery Based) Accepted A Request Received B Transient Synchronous (Response Based) Accepted B Transient Asynchronous A Message can be sent only if B is running
M ESSAGE -O RIENTED C OMMUNICATION
Communication models based on message passing Traditional send()/receive() provides:
Asynchronous and Synchronous communication
Slide 45
Slide 47
Transient communication
What more does it provide than send()/receive()?
Persistent communication (Message queues) Hides implementation details Marshalling
Examples?
C OMMUNICATION A BSTRACTIONS
Abstractions above simple message passing make communication easier for the programmer. Provided by higher level APIs Slide 46
Message-Oriented Communication Request-Reply, Remote Procedure Call (RPC) & Remote Method Invocation (RMI) Group Communication Event-based Communication Shared Space
Example: Message Passing Interface (MPI):
Designed for parallel applications Makes use of available underlying network Tailored to transient communication No persistent communication Primitives for all forms of transient communication Group communication
Slide 48
MPI is BIG. Standard reference has over 100 functions and is over 350 pages long!
M ESSAGE -O RIENTED C OMMUNICATION
23
M ESSAGE -O RIENTED C OMMUNICATION
24
MPI primitives:
Basic queue interface:
Slide 49
Slide 51
Message Queue Architecture Example:
M ESSAGE Q UEUING S YSTEMS
Provides:
Persistent communication Message Queues Transfer of messages between queues
Sender A Application Receive queue Message
Application
R2
Slide 50
Model:
Application-specic queues Messages addressed to specic queues Only guarantee delivery to queue. Not when. Message transfer can be in the order of minutes
Slide 52
Send queue Application
R1 Receiver B Application Router
Very similar to email but more general purpose (i.e., enables communication between applications and not just people)
M ESSAGE Q UEUING S YSTEMS
25
M ESSAGE Q UEUING S YSTEMS
26
Examples: IBM MQSeries
Message channels connect queue managers Message Channel Agent (MCA) manages a channel end (sends and receives transport level messages)
R EMOTE P ROCEDURE CALL (RPC)
Idea: Replace I/O oriented message passing model by execution of a procedure call on a remote node [BN84]:
Synchronous - based on blocking messages
Slide 53
Source and destination MCAs must be running to use channel Queue manager responsible for routing
Slide 55
Message-passing details hidden from application Procedure call parameters used to transmit data Client calls local stub which does messaging and marshalling
Java Message Service
Java API to messaging system (e.g., MQ Series) Implementation of own messaging system Provides Point-to-point (usual message queues) and Publish/Subscribe
Confusion of local and remote operations can be dangerous. More on that later...
R EQUEST-R EPLY
Request:
a service data
RPC Implementation:
Client machine Server machine
Client process
Server process Implementation of inc
Slide 54
Reply:
result of executing service data
Slide 56
j = inc(i);
2
proc: "inc" int: val(i)
Client stub
Server stub
j = inc(i); proc: "inc" int: val(i)
Requirement:
Message formatting Protocol
Client OS Message proc: "inc" int: val(i)
Server OS
R EMOTE P ROCEDURE CALL (RPC)
27
R EMOTE P ROCEDURE CALL (RPC)
28
RPC Implementation:
client calls client stub (normal procedure call) client stub packs parameters into message data structure client stub performs send() syscall and blocks kernel transfers message to remote kernel remote kernel delivers to server stub, blocked in receive() server stub unpacks message, calls server (normal proc call) server returns to stub, which packs result into message server stub performs send() syscall kernel delivers to client stub, which unpacks and returns
Example server stub in Erlang: % increment implementation inc (Value) -> Value + 1. % RPC Server dispatch loop server () -> receive {From, inc, Value} -> From ! {self(), inc, inc(Value)} end, server().
Slide 57
Slide 59
Example client stub in Erlang: % Client code using RPC stub client (Server) -> register(server, Server), Result = inc (10), io:format ("Result: ~w~n", [Result]). Slide 58 % RPC stub for the increment server inc (Value) -> server ! {self (), inc, Value}, receive {From, inc, Reply} -> Reply end. Slide 60 Parameter marshalling:
stub must pack (marshal) parameters into message structure message data must be pointer free (by-reference data must be passed by-value) may have to perform other conversions: byte order (big endian vs little endian) oating point format dealing with pointers convert everything to standard (network) format, or message indicates format, receiver converts if necessary stubs may be generated automatically from interface specs
R EMOTE P ROCEDURE CALL (RPC)
29
R EMOTE P ROCEDURE CALL (RPC)
30
Sun RPC - client code: #include <rpc/rpc.h> /* standard RPC include file */ #include "date.h" /* this file is generated by rpcgen */ ... main(int argc, char **argv) { CLIENT *cl; /* RPC handle */ ... cl = clnt_create(argv[1], DATE_PROG, DATE_VERS, "udp"); Slide 63 lresult = bin_date_1(NULL, cl); printf("time on host %s = %ld\n", server, *lresult); sresult = str_date_1(lresult, cl); printf("time on host %s = %s", server, *sresult); clnt_destroy(cl); } /* done with the handle */
Examples of RPC frameworks:
SUN RPC (aka ONC RPC): Internet RFC1050 (V1), RFC1831 (V2) Based on XDR data representation (RFC1014)(RFC1832) Basis of standard distributed services, such as NFS and NIS Distributed Computing Environment (DCE) RPC
Slide 61
XML (data representation) and HTTP (transport) Text-based data stream is easier to debug HTTP simplies integration with web servers and works through rewalls For example, XML-RPC (lightweight) and SOAP (more powerful, but often unnecessarily complex)
Sun RPC - server code: #include <rpc/rpc.h> #include "date.h" /* standard RPC include file */ /* this file is generated by rpcgen */
Sun RPC - interface denition: program DATE_PROG { version DATE_VERS { long BIN_DATE(void) = 1; string STR_DATE(long) = 2; } = 1; } = 0x31234567;
Slide 62
/* /* /* /*
proc num = 1 */ proc num = 2 */ version = 1 */ prog num */
Slide 64
long * bin_date_1() { static long timeval; /* must be static */ long time(); /* Unix function */ timeval = time((long *) 0); return(&timeval); } char ** str_date_1(long *bintime) { static char *ptr; /* must be static */ char *ctime(); /* Unix function */ ptr = ctime(bintime); /* convert to local time */ return(&ptr); /* return the address of pointer */ }
R EMOTE P ROCEDURE CALL (RPC)
31
DYNAMIC B INDING
32
R EMOTE M ETHOD I NVOCATION (RMI) DYNAMIC B INDING
How to locate a service?
Well-known naming service, binder: register(name, version, handle, UID) deregister(name, version, UID) lookup(name, version) (handle, UID) handle is some physical address (IP address, process ID, ...) UID is used to distinguish between servers offering the same service
The transition from Remote Procedure Call (RPC) to Remote Method Invocation (RMI) is a transition from the server metaphor to the object metaphor. Why is this important? Slide 67
RPC: explicit handling of host identication to determine the destination RMI: addressed to a particular object Objects are rst-class citizens Can pass object references as parameters More natural resource management and error handling But still, only a small evolutionary step
Slide 65
A SYNCHRONOUS RPC
Client Call remote procedure Request Wait for result Return from call Reply Time Server Client Wait for acceptance Return from call Accept request Time
T RANSPARENCY CAN B E DANGEROUS
Why is the transparency provided by RPC and RMI dangerous?
Remote operations can fail in different ways Remote operations can have arbitrary latency Remote operations have a different memory access model
Call remote procedure Request
Slide 66
Server
Call local procedure and return results (a)
Call local procedure (b)
Slide 68
Remote operations can involve concurrency in subtle ways
What happens if this is ignored?
Unreliable services and applications Limited scalability Bad performance
When no reply is required When reply isnt needed immediately (2 asynchronous RPCs deferred synchronous RPC)
See A note on distributed computing [WWWK94]
R EMOTE M ETHOD I NVOCATION (RMI)
33
G ROUP -B ASED C OMMUNICATION
34
G ROUP -B ASED C OMMUNICATION
machine B machine C
G OSSIP -B ASED C OMMUNICATION
Technique that relies on epidemic behaviour, e.g. spreading diseases among people. Variant: rumour spreading, or gossiping. When node P receives data item x, it tries to push it to arbitrary node Q. If x is new to Q, then P keeps on spreading x to other nodes. If node Q already has x, P stops spreading x with certain probability. Analogy from real life: Spreading rumours among people.
Slide 69
machine A machine D machine E
Slide 71
Sender performs a single send()
What are the difculties with group communication?
Two kinds of group communication:
Broadcast (messgae sent to everyone) Multicast (message sent to specic group)
E VENT-B ASED C OMMUNICATION
Communication through propagation of events Generally associated with publish/subscribe systems Sender process publishes events Receiver process subscribes to events and receives only the ones it is interested in.
Used for:
Replication of services Replication of data Service discovery
Slide 70
Event notication
Slide 72
Issues:
Reliability Ordering
Component Event delivery Publish Component
Component
Event bus
Example:
IP multicast Flooding
G OSSIP -B ASED C OMMUNICATION
35
S HARED S PACE
36
S HARED S PACE
Distributed Shared Memory:
Shared global address space
10 11 12 13 14 15 16
R EADING L IST
Slide 75 Implementing Remote Procedure Calls A classic paper about the design and implementation of one of the rst RPC systems.
Slide 73
0 9 2 5 1 8 3 10 6 4 12 7 14
11
13
15
16
CPU 1
CPU 2
CPU 3
CPU 4
Tuple Space:
A Write A B Write B T Read T C Look for tuple that matches T A B C Return C (and optionally remove it)
Slide 74
Insert a copy of A
Insert a copy of B B B
A Tuple instance
A JavaSpace
R EADING L IST
37
R EADING L IST
38