0% found this document useful (0 votes)
38 views26 pages

Ceph Storage in A World of AI - ML Workloads

The webinar on January 30, 2025, focuses on the role of Ceph Storage in AI/ML workloads, discussing performance needs, storage importance, and use cases. Presenters from SNIA, IBM Storage, and Canonical will cover topics such as AI workload lifecycles, checkpointing, and increasing storage efficiency. The event aims to highlight Ceph's scalability, hardware-agnostic nature, and its significance in optimizing storage for AI applications.

Uploaded by

babu sasidhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views26 pages

Ceph Storage in A World of AI - ML Workloads

The webinar on January 30, 2025, focuses on the role of Ceph Storage in AI/ML workloads, discussing performance needs, storage importance, and use cases. Presenters from SNIA, IBM Storage, and Canonical will cover topics such as AI workload lifecycles, checkpointing, and increasing storage efficiency. The event aims to highlight Ceph's scalability, hardware-agnostic nature, and its significance in optimizing storage for AI applications.

Uploaded by

babu sasidhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Ceph Storage in a World of

AI/ML Workloads
Live Webinar
January 30, 2025
10:00 am PT / 1:00 pm ET

1 | © SNIA. All Rights Reserved.


Today’s Presenters

Please replace

▪ Michael Hoard ▪ Kyle Bader ▪ Phil Williams


▪ SNIA ▪ IBM Storage ▪ Canonical
▪ CST Chair ▪ Principal Portfolio Architect ▪ Product Manager

2 | © SNIA. All Rights Reserved.


3 | © SNIA. All Rights Reserved.
Cloud Storage Technologies (CST) Community
Committed to the adoption, growth and standardization of
intelligent data storage usage in cloud infrastructures.

This encompasses data services, orchestration and


management, as well as the promotion of portability of
data in multi-cloud and hybrid cloud environments .

Learn more at snia.org/cloud

4 | © SNIA. All Rights Reserved.


5 | © SNIA. All Rights Reserved.
Agenda

▪ AI workloads and lifecycle


▪ Performance needs of Training, Checkpointing and
Inference
▪ Importance of storage in AI infrastructure
▪ Why Ceph?
▪ Increasing storage efficiency
▪ Use cases
▪ Find out more

6 | © SNIA. All Rights Reserved.


AI Workloads / Lifecycle

Raw data Training data Models Results

retraining

7 | © SNIA. All Rights Reserved.


Training

▪ Usually limited by
▪ Network Bandwidth
▪ Pre-processing
▪ Model architecture

▪ Typical GPU
▪ Upto 4 petaFLOPs (FP8)
▪ 5 GBps storage throughput recommended
▪ 20 GB/s recommended per reference system

8 | © SNIA. All Rights Reserved.


LLMs: Granite 13b Data Pile
GPT3
45TB

570GB

300 billion
tokens

Granite Foundation Models


The Pile: An 800GB Dataset of Diverse Text for Language Modeling

9 | © SNIA. All Rights Reserved.


Checkpointing

Model Size Estimated @ 35GB/s


Checkpoint (GiB)

Granite 13b 170 5s

Llama3 70b 913 28s

GPT3 175b 2282 70s

Llama3 405b 5281 162s

DLRM-2021 1t 13039 400s

Checkpoint size estimates based on use of ADAM optimizer


Reducing checkpointing times
Check-N-Run: a Checkpointing System for Training Deep Learning Recommendation Models

10 | © SNIA. All Rights Reserved.


Recommendation Systems

Feature
Serving Engineering
(filter/label/join)

event
Application Splits
Lakehouse
feature

“At Facebook’s datacenter fleet, for example, deep


[1] Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training recommendation models consume more than 80% of
the machine learning inference cycles and more than
[2] Check-N-Run: a Checkpointing System for Training Deep Learning Recommendation Models 50% of the training cycles.” [2]

11 | © SNIA. All Rights Reserved.


Event Driven Inference

Inference
Update obj metadata

Object event

X-Ray Analysis Automated Pipeline

12 | © SNIA. All Rights Reserved.


Why is Storage Important?

Read Data
Dataset
Write processing

13 | © SNIA. All Rights Reserved.


Why is Storage Important?
Storage economics

Performance Capacity Reliability Cost

● App expectations ● Storage needs only ● Data cannot be lost ● Shrinking budgets
● Business value increase over time ● Data must be ● Rising costs
available 24/7/365

14 | © SNIA. All Rights Reserved.


Why Ceph?
Multi-protocol by default

Block File Object


- Linux kernel - Linux kernel - S3
- QEMU/KVM - NFS - IAM / STS
- NVMe/TCP - SMB - NFS

15 | © SNIA. All Rights Reserved.


Why Ceph?
Hardware agnostic

CPU and Memory Network Media


Higher clock speeds High-bandwidth Capacity

RAM for cluster ops Low-latency Performance

16 | © SNIA. All Rights Reserved.


Why Ceph?
Scale from a few nodes to hundreds
Scale Up

20 GBps Read 40 GBps Read 80 GBps Read 160 GBps Read

Learn more: https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/

17 | © SNIA. All Rights Reserved.


Accessibility

▪ Availability now
▪ Proven in production
▪ Fully open source
▪ Download from https://ceph.io
▪ Source at https://github.com/ceph
▪ Docs at https://docs.ceph.com/en/squid/
▪ Support
▪ Community mailing list https://ceph.io/en/community/connect/
▪ Supported by a wide ecosystem of vendors and practitioners
▪ No speciality system for AI needed
▪ Just the correct planning and design

18 | © SNIA. All Rights Reserved.


Increasing Storage Efficiency

▪ Compression
▪ Reduces TCO
▪ Applies to all storage media
▪ Can lead to CPU overheads
▪ Negatively affecting performance

▪ Hardware Accelerators
▪ On-die or PCIe add-in cards:
▪ Compression (RGW and Bluestore)
▪ SSL
▪ Ceph S3 object compression
▪ Increase >250% write throughput
▪ Increase >180% read throughput
▪ Minimal additional hardware cost
19 | © SNIA. All Rights Reserved.
Example Use Cases with Ceph

▪ Ceph
▪ 4 Nodes nic nic
nic 1 2
workernode node01
▪ 2x CPU (32 core ea.) 1

cluster network
client network
▪ 512GB RAM workernode
nic
nic
1
nic
2
1 node02
▪ 2x 100GbE
nic nic
▪ 24x TLC RI NVMe workernode
nic
1 node03 1 2

▪ Workers workernode
nic
1
node04
nic
1
nic
2

▪ 4x GPUs

○ Ave Read 30 GB/s ○ Ave Write 4.66 GBps

20 | © SNIA. All Rights Reserved.


Network

low oversub

21 | © SNIA. All Rights Reserved.


Where to Learn More About Ceph?
▪ Ceph.io
▪ Ceph Days
▪ Bengaluru, India - 23rd Jan 2025
▪ San Jose, USA - 25th March 2025
▪ London, UK - 4th June 2025

▪ Cephalocon
▪ 2024 hosted by CERN in Geneva, Switzerland - slides and recordings
▪ 2025 TBA

▪ SNIA Educational Library


▪ Ceph: The Linux of Storage Today

22 | © SNIA. All Rights Reserved.


Takeaways

▪ GPUs are expensive, high utilisation is paramount for


reducing TCO
▪ Ceph’s approach to scaling helps meet growth demands
▪ Network planning is key to scaling out
▪ Hardware agnostic Ceph provides flexibility
▪ Pluggable architecture allows for integration with hardware
offload(s)

23 | © SNIA. All Rights Reserved.


Q&A

24 | © SNIA. All Rights Reserved.


Dark Slide Title

▪ Bullets 1
▪ Bullets 2
▪ Bullets 3
▪ Bullets 4
▪ Bullets 5

25 | © SNIA. All Rights Reserved.


Thank you

26 | © SNIA. All Rights Reserved.

You might also like