CEPH:
A SCALABLE, HIGH-PERFORMANCE
DISTRIBUTED FILE SYSTEM
S. A. Weil, S. A. Brandt, E. L. Miller
D. D. E. Long, C. Maltzahn
U. C. Santa Cruz
OSDI 2006
Paper highlights
Yet another distributed file system using
object storage devices
Designed for scalability
Main contributions
1. Uses hashing to achieve distributed dynamic
metadata management
2. Pseudo-random data distribution function
replaces object lists
System objectives
Excellent performance and reliability
Unparallel scalability thanks to
Distribution of metadata workload inside
metadata cluster
Use of object storage devices (OSDs)
Designed for very large systems
Petabyte scale (106 gigabytes)
Characteristics of very large systems
Built incrementally
Node failures are the norm
Quality and character of workload changes over
time
SYSTEM OVERVIEW
System architecture
Key ideas
Decoupling data and metadata
Metadata management
Autonomic distributed object storage
System Architecture (I)
System Architecture (II)
Clients
Export a near-POSIX file system interface
Cluster of OSDs
Store all data and metadata
Communicate directly with clients
Metadata server cluster
Manages the namespace (files + directories)
Security, consistency and coherence
Key ideas
Separate data and metadata management tasks
- Metadata cluster does not have object lists
Dynamic partitioning of metadata data tasks
inside metadata cluster
Avoids hot spots
Let OSDs handle file migration and replication
tasks
Decoupling data and metadata
Metadata cluster handles metadata operations
Clients interact directly with OSD for all file I/O
Low-level bloc allocation is delegated to OSDs
Other OSD still require metadata cluster to hold
object lists
Ceph uses a special pseudo-random data
distribution function (CRUSH)
Old School
File xyz?
Client
Where to find the
container objects
Metadata
server
cluster
Ceph with CRUSH
File xyz?
Client
How to find the
container objects
Client uses CRUSH and
data provided by MDS
cluster to find the file
Metadata
server
cluster
Ceph with CRUSH
Metadata
File xyz?
Client
Client uses CRUSH and
data provided by MDS
cluster to find the file
server
cluster
Here is how to find these
container objects
Metadata management
Dynamic Subtree Partitioning
Lets Ceph dynamically share metadata
workload among tens or hundreds of metadata
servers (MDSs)
Sharing is dynamic and based on current access
patterns
Results in near-linear performance scaling in the
number of MDSs
Autonomic distributed object storage
Distributed storage handles data migration and
data replication tasks
Leverages the computational resources of OSDs
Achieves reliable highly-available scalable object
storage
Reliable implies no data losses
Highly available implies being accessible
almost all the time
THE CLIENT
Performing an I/O
Client synchronization
Namespace operations
Performing an I/O
When client opens a file
Sends a request to the MDS cluster
Receives an i-node number, information about file
size and striping strategy and a capability
Capability specifies authorized operations on file
(not yet encrypted )
Client uses CRUSH to locate object replicas
Client releases capability at close time
Client synchronization (I)
POSIX requires
One-copy serializability
Atomicity of writes
When MDS detects conflicting accesses by
different clients to the same file
Revokes all caching and buffering permissions
Requires synchronous I/O to that file
Client synchronization (II)
Synchronization handled by OSDs
Locks can be used for writes spanning object
boundaries
Synchronous I/O operations have huge latencies
Many scientific workloads do significant amount
of read-write sharing
POSIX extension lets applications
synchronize their concurrent accesses to a file
Namespace operations
Managed by the MDSs
Read and update operations are all synchronously
applied to the metadata
Optimized for common case
readdir returns contents of whole directory (as NFS
readdirplus does)
Guarantees serializability of all operations
Can be relaxed by application
THE MDS CLUSTER
Storing metadata
Dynamic subtree partitioning
Mapping subdirectories to MDSs
Storing metadata
Most requests likely to be satisfied from MDS inmemory cache
Each MDS lodges its update operations in lazilyflushed journal
Facilitates recovery
Directories
Include i-nodes
Stored on a OSD cluster
Dynamic subtree partitioning
Ceph uses primary copy approach to cached
metadata management
Ceph adaptively distributes cached metadata
across MDS nodes
Each MDS measures popularity of data
within a directory
Ceph migrates and/or replicates hot spots
Mapping subdirectories to MDSs
DISTRIBUTED OBJECT STORAGE
Data distribution with CRUSH
Replication
Data safety
Recovery and cluster updates
EBOFS
Data distribution with CRUSH (I)
Wanted to avoid storing object addresses in
MDS cluster
Ceph firsts maps objects into placement groups
(PG) using a hash function
Placement groups are then assigned to OSDs
using a pseudo-random function (CRUSH)
Clients know that function
Data distribution with CRUSH (II)
To access an object, client needs to know
Its placement group
The OSD cluster map
The object placement rules used by CRUSH
Replication level
Placement constraints
How files are striped
Replication
Cephs Reliable Autonomic Data Object Store
autonomously manages object replication
First non-failed OSD in objects replication list
acts as a primary copy
Applies each update locally
Increments objects version number
Propagates the update
Data safety
Achieved by update process
1. Primary forwards updates to other replicas
2. Sends ACK to client once all replicas have
received the update
Slower but safer
3. Replicas send final commit once they have
committed update to disk
Committing writes
Recovery and cluster updates
RADOS (Reliable and Autonomous Distributed
Object Store) monitors OSDs to detect failures
Recovery handled by same mechanism as
deployment of new storage
Entirely driven by individual OSDs
Low-level storage management
Most DFS use an existing local file system to
manage low-level storage
Hard understand when object updates are
safely committed on disk
Could use journaling or synchronous writes
Big performance penalty
Low-level storage management
Each Ceph OSD manages its local object
storage with EBOFS (Extent and B-Tree based
Object File System)
B-Tree service locates objects on disk
Block allocation is conducted in term of
extents to keep data compact
Well-defined update semantics
PERFORMANCE AND SCALABILITY
Want to measure
Cost of updating replicated data
Throughput and latency
Overall system performance
Scalability
Impact of MDS cluster size on latency
Impact of replication (I)
Impact of replication (II)
Transmission times dominate for large synchronized writes
File system performance
Scalability
Switch is saturated at 24 OSDs
Impact of MDS cluster size on latency
Conclusion
Ceph addresses three critical challenges of
modern DFS
Scalability
Performance
Reliability
Achieved though reducing the workload of MDS
CRUSH
Autonomous repairs of OSD