Ceph Storage in a World of
AI/ML Workloads
Live Webinar
January 30, 2025
10:00 am PT / 1:00 pm ET
1 | © SNIA. All Rights Reserved.
Today’s Presenters
Please replace
▪ Michael Hoard ▪ Kyle Bader ▪ Phil Williams
▪ SNIA ▪ IBM Storage ▪ Canonical
▪ CST Chair ▪ Principal Portfolio Architect ▪ Product Manager
2 | © SNIA. All Rights Reserved.
3 | © SNIA. All Rights Reserved.
Cloud Storage Technologies (CST) Community
Committed to the adoption, growth and standardization of
intelligent data storage usage in cloud infrastructures.
This encompasses data services, orchestration and
management, as well as the promotion of portability of
data in multi-cloud and hybrid cloud environments .
Learn more at snia.org/cloud
4 | © SNIA. All Rights Reserved.
5 | © SNIA. All Rights Reserved.
Agenda
▪ AI workloads and lifecycle
▪ Performance needs of Training, Checkpointing and
Inference
▪ Importance of storage in AI infrastructure
▪ Why Ceph?
▪ Increasing storage efficiency
▪ Use cases
▪ Find out more
6 | © SNIA. All Rights Reserved.
AI Workloads / Lifecycle
Raw data Training data Models Results
retraining
7 | © SNIA. All Rights Reserved.
Training
▪ Usually limited by
▪ Network Bandwidth
▪ Pre-processing
▪ Model architecture
▪ Typical GPU
▪ Upto 4 petaFLOPs (FP8)
▪ 5 GBps storage throughput recommended
▪ 20 GB/s recommended per reference system
8 | © SNIA. All Rights Reserved.
LLMs: Granite 13b Data Pile
GPT3
45TB
570GB
300 billion
tokens
Granite Foundation Models
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
9 | © SNIA. All Rights Reserved.
Checkpointing
Model Size Estimated @ 35GB/s
Checkpoint (GiB)
Granite 13b 170 5s
Llama3 70b 913 28s
GPT3 175b 2282 70s
Llama3 405b 5281 162s
DLRM-2021 1t 13039 400s
Checkpoint size estimates based on use of ADAM optimizer
Reducing checkpointing times
Check-N-Run: a Checkpointing System for Training Deep Learning Recommendation Models
10 | © SNIA. All Rights Reserved.
Recommendation Systems
Feature
Serving Engineering
(filter/label/join)
event
Application Splits
Lakehouse
feature
“At Facebook’s datacenter fleet, for example, deep
[1] Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training recommendation models consume more than 80% of
the machine learning inference cycles and more than
[2] Check-N-Run: a Checkpointing System for Training Deep Learning Recommendation Models 50% of the training cycles.” [2]
11 | © SNIA. All Rights Reserved.
Event Driven Inference
Inference
Update obj metadata
Object event
X-Ray Analysis Automated Pipeline
12 | © SNIA. All Rights Reserved.
Why is Storage Important?
Read Data
Dataset
Write processing
13 | © SNIA. All Rights Reserved.
Why is Storage Important?
Storage economics
Performance Capacity Reliability Cost
● App expectations ● Storage needs only ● Data cannot be lost ● Shrinking budgets
● Business value increase over time ● Data must be ● Rising costs
available 24/7/365
14 | © SNIA. All Rights Reserved.
Why Ceph?
Multi-protocol by default
Block File Object
- Linux kernel - Linux kernel - S3
- QEMU/KVM - NFS - IAM / STS
- NVMe/TCP - SMB - NFS
15 | © SNIA. All Rights Reserved.
Why Ceph?
Hardware agnostic
CPU and Memory Network Media
Higher clock speeds High-bandwidth Capacity
RAM for cluster ops Low-latency Performance
16 | © SNIA. All Rights Reserved.
Why Ceph?
Scale from a few nodes to hundreds
Scale Up
20 GBps Read 40 GBps Read 80 GBps Read 160 GBps Read
Learn more: https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/
17 | © SNIA. All Rights Reserved.
Accessibility
▪ Availability now
▪ Proven in production
▪ Fully open source
▪ Download from https://ceph.io
▪ Source at https://github.com/ceph
▪ Docs at https://docs.ceph.com/en/squid/
▪ Support
▪ Community mailing list https://ceph.io/en/community/connect/
▪ Supported by a wide ecosystem of vendors and practitioners
▪ No speciality system for AI needed
▪ Just the correct planning and design
18 | © SNIA. All Rights Reserved.
Increasing Storage Efficiency
▪ Compression
▪ Reduces TCO
▪ Applies to all storage media
▪ Can lead to CPU overheads
▪ Negatively affecting performance
▪ Hardware Accelerators
▪ On-die or PCIe add-in cards:
▪ Compression (RGW and Bluestore)
▪ SSL
▪ Ceph S3 object compression
▪ Increase >250% write throughput
▪ Increase >180% read throughput
▪ Minimal additional hardware cost
19 | © SNIA. All Rights Reserved.
Example Use Cases with Ceph
▪ Ceph
▪ 4 Nodes nic nic
nic 1 2
workernode node01
▪ 2x CPU (32 core ea.) 1
cluster network
client network
▪ 512GB RAM workernode
nic
nic
1
nic
2
1 node02
▪ 2x 100GbE
nic nic
▪ 24x TLC RI NVMe workernode
nic
1 node03 1 2
▪ Workers workernode
nic
1
node04
nic
1
nic
2
▪ 4x GPUs
○ Ave Read 30 GB/s ○ Ave Write 4.66 GBps
20 | © SNIA. All Rights Reserved.
Network
low oversub
21 | © SNIA. All Rights Reserved.
Where to Learn More About Ceph?
▪ Ceph.io
▪ Ceph Days
▪ Bengaluru, India - 23rd Jan 2025
▪ San Jose, USA - 25th March 2025
▪ London, UK - 4th June 2025
▪ Cephalocon
▪ 2024 hosted by CERN in Geneva, Switzerland - slides and recordings
▪ 2025 TBA
▪ SNIA Educational Library
▪ Ceph: The Linux of Storage Today
22 | © SNIA. All Rights Reserved.
Takeaways
▪ GPUs are expensive, high utilisation is paramount for
reducing TCO
▪ Ceph’s approach to scaling helps meet growth demands
▪ Network planning is key to scaling out
▪ Hardware agnostic Ceph provides flexibility
▪ Pluggable architecture allows for integration with hardware
offload(s)
23 | © SNIA. All Rights Reserved.
Q&A
24 | © SNIA. All Rights Reserved.
Dark Slide Title
▪ Bullets 1
▪ Bullets 2
▪ Bullets 3
▪ Bullets 4
▪ Bullets 5
25 | © SNIA. All Rights Reserved.
Thank you
26 | © SNIA. All Rights Reserved.