0% found this document useful (0 votes)

38 views26 pages

Ceph Storage in A World of AI - ML Workloads

The webinar on January 30, 2025, focuses on the role of Ceph Storage in AI/ML workloads, discussing performance needs, storage importance, and use cases. Presenters from SNIA, IBM Storage, and Canonical will cover topics such as AI workload lifecycles, checkpointing, and increasing storage efficiency. The event aims to highlight Ceph's scalability, hardware-agnostic nature, and its significance in optimizing storage for AI applications.

Uploaded by

babu sasidhar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views26 pages

Ceph Storage in A World of AI - ML Workloads

Uploaded by

babu sasidhar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Ceph Storage in a World of

AI/ML Workloads
Live Webinar
January 30, 2025
10:00 am PT / 1:00 pm ET

1 | © SNIA. All Rights Reserved.

Today’s Presenters

Please replace

▪ Michael Hoard ▪ Kyle Bader ▪ Phil Williams

▪ SNIA ▪ IBM Storage ▪ Canonical
▪ CST Chair ▪ Principal Portfolio Architect ▪ Product Manager

2 | © SNIA. All Rights Reserved.

3 | © SNIA. All Rights Reserved.
Cloud Storage Technologies (CST) Community
Committed to the adoption, growth and standardization of
intelligent data storage usage in cloud infrastructures.

This encompasses data services, orchestration and

management, as well as the promotion of portability of
data in multi-cloud and hybrid cloud environments .

Learn more at snia.org/cloud

4 | © SNIA. All Rights Reserved.

5 | © SNIA. All Rights Reserved.
Agenda

▪ AI workloads and lifecycle

▪ Performance needs of Training, Checkpointing and
Inference
▪ Importance of storage in AI infrastructure
▪ Why Ceph?
▪ Increasing storage efficiency
▪ Use cases
▪ Find out more

6 | © SNIA. All Rights Reserved.

AI Workloads / Lifecycle

Raw data Training data Models Results

retraining

7 | © SNIA. All Rights Reserved.

Training

▪ Usually limited by
▪ Network Bandwidth
▪ Pre-processing
▪ Model architecture

▪ Typical GPU
▪ Upto 4 petaFLOPs (FP8)
▪ 5 GBps storage throughput recommended
▪ 20 GB/s recommended per reference system

8 | © SNIA. All Rights Reserved.

LLMs: Granite 13b Data Pile
GPT3
45TB

570GB

300 billion
tokens

Granite Foundation Models

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

9 | © SNIA. All Rights Reserved.

Checkpointing

Model Size Estimated @ 35GB/s

Checkpoint (GiB)

Granite 13b 170 5s

Llama3 70b 913 28s

GPT3 175b 2282 70s

Llama3 405b 5281 162s

DLRM-2021 1t 13039 400s

Checkpoint size estimates based on use of ADAM optimizer

Reducing checkpointing times
Check-N-Run: a Checkpointing System for Training Deep Learning Recommendation Models

10 | © SNIA. All Rights Reserved.

Recommendation Systems

Feature
Serving Engineering
(filter/label/join)

event
Application Splits
Lakehouse
feature

“At Facebook’s datacenter fleet, for example, deep

[1] Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training recommendation models consume more than 80% of
the machine learning inference cycles and more than
[2] Check-N-Run: a Checkpointing System for Training Deep Learning Recommendation Models 50% of the training cycles.” [2]

11 | © SNIA. All Rights Reserved.

Event Driven Inference

Inference
Update obj metadata

Object event

X-Ray Analysis Automated Pipeline

12 | © SNIA. All Rights Reserved.

Why is Storage Important?

Read Data
Dataset
Write processing

13 | © SNIA. All Rights Reserved.

Why is Storage Important?
Storage economics

Performance Capacity Reliability Cost

● App expectations ● Storage needs only ● Data cannot be lost ● Shrinking budgets
● Business value increase over time ● Data must be ● Rising costs
available 24/7/365

14 | © SNIA. All Rights Reserved.

Why Ceph?
Multi-protocol by default

Block File Object

- Linux kernel - Linux kernel - S3
- QEMU/KVM - NFS - IAM / STS
- NVMe/TCP - SMB - NFS

15 | © SNIA. All Rights Reserved.

Why Ceph?
Hardware agnostic

CPU and Memory Network Media

Higher clock speeds High-bandwidth Capacity

RAM for cluster ops Low-latency Performance

Why Ceph?
Scale from a few nodes to hundreds
Scale Up

20 GBps Read 40 GBps Read 80 GBps Read 160 GBps Read

Learn more: https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/

Accessibility

▪ Availability now
▪ Proven in production
▪ Fully open source
▪ Download from https://ceph.io
▪ Source at https://github.com/ceph
▪ Docs at https://docs.ceph.com/en/squid/
▪ Support
▪ Community mailing list https://ceph.io/en/community/connect/
▪ Supported by a wide ecosystem of vendors and practitioners
▪ No speciality system for AI needed
▪ Just the correct planning and design

Increasing Storage Efficiency

▪ Compression
▪ Reduces TCO
▪ Applies to all storage media
▪ Can lead to CPU overheads
▪ Negatively affecting performance

▪ Hardware Accelerators
▪ On-die or PCIe add-in cards:
▪ Compression (RGW and Bluestore)
▪ SSL
▪ Ceph S3 object compression
▪ Increase >250% write throughput
▪ Increase >180% read throughput
▪ Minimal additional hardware cost
19 | © SNIA. All Rights Reserved.
Example Use Cases with Ceph

▪ Ceph
▪ 4 Nodes nic nic
nic 1 2
workernode node01
▪ 2x CPU (32 core ea.) 1

cluster network
client network
▪ 512GB RAM workernode
nic
nic
1
nic
2
1 node02
▪ 2x 100GbE
nic nic
▪ 24x TLC RI NVMe workernode
nic
1 node03 1 2

▪ Workers workernode
nic
1
node04
nic
1
nic
2

▪ 4x GPUs

○ Ave Read 30 GB/s ○ Ave Write 4.66 GBps

Network

low oversub

Where to Learn More About Ceph?
▪ Ceph.io
▪ Ceph Days
▪ Bengaluru, India - 23rd Jan 2025
▪ San Jose, USA - 25th March 2025
▪ London, UK - 4th June 2025

▪ Cephalocon
▪ 2024 hosted by CERN in Geneva, Switzerland - slides and recordings
▪ 2025 TBA

▪ SNIA Educational Library

▪ Ceph: The Linux of Storage Today

Takeaways

▪ GPUs are expensive, high utilisation is paramount for

reducing TCO
▪ Ceph’s approach to scaling helps meet growth demands
▪ Network planning is key to scaling out
▪ Hardware agnostic Ceph provides flexibility
▪ Pluggable architecture allows for integration with hardware
offload(s)

Q&A

Dark Slide Title

▪ Bullets 1
▪ Bullets 2
▪ Bullets 3
▪ Bullets 4
▪ Bullets 5

Thank you

AI Storage With Ceph
No ratings yet
AI Storage With Ceph
17 pages
Ceph Storage Solutions at CSC Finland
No ratings yet
Ceph Storage Solutions at CSC Finland
27 pages
Ceph Cookbook - Sample Chapter
No ratings yet
Ceph Cookbook - Sample Chapter
28 pages
Storage Cluster Lab Guide
No ratings yet
Storage Cluster Lab Guide
2 pages
Designing Implementing and Managing Your Software Defined Massively Scalable Ceph Storage System 6676702
No ratings yet
Designing Implementing and Managing Your Software Defined Massively Scalable Ceph Storage System 6676702
167 pages
Ceph, Storage For CERN Cloud
No ratings yet
Ceph, Storage For CERN Cloud
10 pages
Storage For Data and AI Level 1 Quiz - Attempt Review
No ratings yet
Storage For Data and AI Level 1 Quiz - Attempt Review
7 pages
Rook-Ceph: Bare Mental Persistent Storage Strategies For Kubernetes
No ratings yet
Rook-Ceph: Bare Mental Persistent Storage Strategies For Kubernetes
7 pages
Red Hat Ceph Is Object Storage Infographic
No ratings yet
Red Hat Ceph Is Object Storage Infographic
1 page
Ceph Reference Architecture
100% (1)
Ceph Reference Architecture
12 pages
Learning Ceph Sample Chapter
No ratings yet
Learning Ceph Sample Chapter
23 pages
IBM Storage for Data and AI Quiz Results
100% (1)
IBM Storage for Data and AI Quiz Results
8 pages
Accelerating AI With Storage Scale
No ratings yet
Accelerating AI With Storage Scale
19 pages
Ceph Vs Swift
100% (1)
Ceph Vs Swift
46 pages
AI Lab
No ratings yet
AI Lab
30 pages
IDC White Paper
No ratings yet
IDC White Paper
20 pages
Ceph
No ratings yet
Ceph
168 pages
Ceph Optimizations For Nvme: Chunmei Liu, Intel Corporation
No ratings yet
Ceph Optimizations For Nvme: Chunmei Liu, Intel Corporation
32 pages
Machine Learning
No ratings yet
Machine Learning
102 pages
Virtual Machine Block Storage With The Distributed Storage System
No ratings yet
Virtual Machine Block Storage With The Distributed Storage System
40 pages
Red Hat Ceph Storage Datasheet
No ratings yet
Red Hat Ceph Storage Datasheet
4 pages
S71937 - Enabling Intelligent Storage To Process Data For Ai Application Ibm
No ratings yet
S71937 - Enabling Intelligent Storage To Process Data For Ai Application Ibm
21 pages
Ceph: A Scalable, High-Performance Distributed File System
No ratings yet
Ceph: A Scalable, High-Performance Distributed File System
14 pages
Build The AI Cloud Provision and Manage Tensorflow Cluster With OpenStack8
No ratings yet
Build The AI Cloud Provision and Manage Tensorflow Cluster With OpenStack8
38 pages
Storage For Containers Whitepaper
No ratings yet
Storage For Containers Whitepaper
11 pages
Cloud-AI-Native 6G Powered by eBPF
No ratings yet
Cloud-AI-Native 6G Powered by eBPF
20 pages
Gohad T Zou Y Raghunath A Rethinking Ceph Architecture For Disaggregation Using NVMe-over-Fabrics
No ratings yet
Gohad T Zou Y Raghunath A Rethinking Ceph Architecture For Disaggregation Using NVMe-over-Fabrics
16 pages
White Paper Ceph-Ultra
No ratings yet
White Paper Ceph-Ultra
14 pages
Lenovo Delivers Enterprise Class Open-Source
No ratings yet
Lenovo Delivers Enterprise Class Open-Source
5 pages
Red Hat Ceph Storage-1.2.3-Red Hat Ceph Architecture-En-US
No ratings yet
Red Hat Ceph Storage-1.2.3-Red Hat Ceph Architecture-En-US
24 pages
Paper 04
No ratings yet
Paper 04
6 pages
ST Ceph Storage Intel Configuration Guide Technology Detail f11532 201804 en
No ratings yet
ST Ceph Storage Intel Configuration Guide Technology Detail f11532 201804 en
23 pages
Cloud Computing for ML Engineers
No ratings yet
Cloud Computing for ML Engineers
32 pages
Ceph & Rook: Streamlining CERN Storage
No ratings yet
Ceph & Rook: Streamlining CERN Storage
28 pages
Speaker - A02 - 5747 - Best Practices in Networking For AI
No ratings yet
Speaker - A02 - 5747 - Best Practices in Networking For AI
15 pages
Ceph Workshop: Gridka School 2015
No ratings yet
Ceph Workshop: Gridka School 2015
56 pages
Accelerate With ATG - IBM Storage Ceph S3 Object Storage Demo
No ratings yet
Accelerate With ATG - IBM Storage Ceph S3 Object Storage Demo
55 pages
Redp 5721
No ratings yet
Redp 5721
174 pages
OpenStack Ceph Deployment Guide
No ratings yet
OpenStack Ceph Deployment Guide
168 pages
Zint Les Feedback
No ratings yet
Zint Les Feedback
3 pages
Ceph Performance Anlysis
No ratings yet
Ceph Performance Anlysis
3 pages
Simplifying AI Workflows with NGC
No ratings yet
Simplifying AI Workflows with NGC
18 pages
Evaluating Cephfs Performance vs. Cost On High Density Commodity Disk Servers
No ratings yet
Evaluating Cephfs Performance vs. Cost On High Density Commodity Disk Servers
10 pages
S74435 - Empower Next-Generation AI With NVIDIA SuperPOD - 1741766783856001jmSm
No ratings yet
S74435 - Empower Next-Generation AI With NVIDIA SuperPOD - 1741766783856001jmSm
31 pages
Deep Learning Cookbook Overview
No ratings yet
Deep Learning Cookbook Overview
24 pages
OCS 4.X Troubleshooting
No ratings yet
OCS 4.X Troubleshooting
96 pages
Ceph An Overview
No ratings yet
Ceph An Overview
8 pages
Red Hat & NVIDIA For FSI - Final
No ratings yet
Red Hat & NVIDIA For FSI - Final
18 pages
Kubernetes For MLOps Engineers
No ratings yet
Kubernetes For MLOps Engineers
7 pages
Gaysse Jerome A Comparison of In-Storage Processing Architectures and Technologies
No ratings yet
Gaysse Jerome A Comparison of In-Storage Processing Architectures and Technologies
50 pages
HPEPrivateCloudAIdatasheet PSN1014847366WWEN
No ratings yet
HPEPrivateCloudAIdatasheet PSN1014847366WWEN
5 pages
Sustainable Energy Technologies and Assessments: Jiwen Guan, Yanzhao Su, Ling Su, C.B. Sivaparthipan, Balaanand Muthu
No ratings yet
Sustainable Energy Technologies and Assessments: Jiwen Guan, Yanzhao Su, Ling Su, C.B. Sivaparthipan, Balaanand Muthu
8 pages
Aihub 1017
No ratings yet
Aihub 1017
17 pages
Scalable AI Infrastructure: Designing For Real-World Deep Learning Use Cases
No ratings yet
Scalable AI Infrastructure: Designing For Real-World Deep Learning Use Cases
12 pages
Deploying Deep Learning with Docker
No ratings yet
Deploying Deep Learning with Docker
65 pages
Sysmex White Paper Differential Diagnosis of Thrombocytopenia
No ratings yet
Sysmex White Paper Differential Diagnosis of Thrombocytopenia
5 pages
Global Routings
No ratings yet
Global Routings
2 pages
Bioinformatics Exam Questions 2008
No ratings yet
Bioinformatics Exam Questions 2008
4 pages
1994 J.Y. Wong - On The Role of Mean Maximum Pressure As An Indicator of Cross-Country Mobility For Tracked Vehicles
No ratings yet
1994 J.Y. Wong - On The Role of Mean Maximum Pressure As An Indicator of Cross-Country Mobility For Tracked Vehicles
17 pages
Shrinkage PDF
100% (1)
Shrinkage PDF
4 pages
Power System Protection Guide
No ratings yet
Power System Protection Guide
42 pages
Earth Science (Syllabus)
No ratings yet
Earth Science (Syllabus)
5 pages
3000 Evolution User Manual Eng
No ratings yet
3000 Evolution User Manual Eng
51 pages
Ritabrata Munshi
No ratings yet
Ritabrata Munshi
2 pages
Software Engineering Guide
No ratings yet
Software Engineering Guide
27 pages
Energy-Efficient Wireless Design
No ratings yet
Energy-Efficient Wireless Design
1 page
966H-972H - Serv1815 - TXT
100% (10)
966H-972H - Serv1815 - TXT
233 pages
Free Body Diagram & Transfer Function
No ratings yet
Free Body Diagram & Transfer Function
3 pages
CHAPTER 5 UNIFORM FLOW IN OPEN CHANNEL Edit
100% (1)
CHAPTER 5 UNIFORM FLOW IN OPEN CHANNEL Edit
21 pages
BMI: A Critical Review for Experts
No ratings yet
BMI: A Critical Review for Experts
12 pages
Olympus Szh10 Brochure
No ratings yet
Olympus Szh10 Brochure
16 pages
Understanding U-Tube Manometers
No ratings yet
Understanding U-Tube Manometers
7 pages
Unix Module 5
No ratings yet
Unix Module 5
30 pages
Electrical Concepts: Why SF6 Gas Used in HV/EHV Circuit Breaker?
No ratings yet
Electrical Concepts: Why SF6 Gas Used in HV/EHV Circuit Breaker?
3 pages
Worked Example Question Sheets For D4 HL
No ratings yet
Worked Example Question Sheets For D4 HL
9 pages
10 02 2025 Classwork
No ratings yet
10 02 2025 Classwork
5 pages
Release Notes
No ratings yet
Release Notes
7 pages
邻近堆载作用对既有桩基承载特性的影响分析阙木泰
No ratings yet
邻近堆载作用对既有桩基承载特性的影响分析阙木泰
85 pages
Renal MR and CT Angiography Guide
100% (1)
Renal MR and CT Angiography Guide
14 pages
Computer Science p1 Alevel
83% (6)
Computer Science p1 Alevel
5 pages
Average Study Material PDF 4 PDF
No ratings yet
Average Study Material PDF 4 PDF
6 pages
CHEM10101 2011 Exam Answers
No ratings yet
CHEM10101 2011 Exam Answers
9 pages
Production Test Coupon
50% (2)
Production Test Coupon
4 pages
Smart Three Phase Reference Standard Meter
No ratings yet
Smart Three Phase Reference Standard Meter
4 pages
Heavy Duty Torque Wrench Specifications
No ratings yet
Heavy Duty Torque Wrench Specifications
1 page

Ceph Storage in A World of AI - ML Workloads

Uploaded by

Ceph Storage in A World of AI - ML Workloads

Uploaded by

Ceph Storage in a World of

1 | © SNIA. All Rights Reserved.

▪ Michael Hoard ▪ Kyle Bader ▪ Phil Williams

2 | © SNIA. All Rights Reserved.

This encompasses data services, orchestration and

Learn more at snia.org/cloud

4 | © SNIA. All Rights Reserved.

▪ AI workloads and lifecycle

6 | © SNIA. All Rights Reserved.

Raw data Training data Models Results

7 | © SNIA. All Rights Reserved.

8 | © SNIA. All Rights Reserved.

Granite Foundation Models

9 | © SNIA. All Rights Reserved.

Model Size Estimated @ 35GB/s

Granite 13b 170 5s

Llama3 70b 913 28s

GPT3 175b 2282 70s

Llama3 405b 5281 162s

DLRM-2021 1t 13039 400s

Checkpoint size estimates based on use of ADAM optimizer

10 | © SNIA. All Rights Reserved.

“At Facebook’s datacenter fleet, for example, deep

11 | © SNIA. All Rights Reserved.

X-Ray Analysis Automated Pipeline

12 | © SNIA. All Rights Reserved.

13 | © SNIA. All Rights Reserved.

Performance Capacity Reliability Cost

14 | © SNIA. All Rights Reserved.

Block File Object

15 | © SNIA. All Rights Reserved.

CPU and Memory Network Media

RAM for cluster ops Low-latency Performance

16 | © SNIA. All Rights Reserved.

20 GBps Read 40 GBps Read 80 GBps Read 160 GBps Read

Learn more: https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/

17 | © SNIA. All Rights Reserved.

18 | © SNIA. All Rights Reserved.

○ Ave Read 30 GB/s ○ Ave Write 4.66 GBps

20 | © SNIA. All Rights Reserved.

21 | © SNIA. All Rights Reserved.

▪ SNIA Educational Library

22 | © SNIA. All Rights Reserved.

▪ GPUs are expensive, high utilisation is paramount for

23 | © SNIA. All Rights Reserved.

24 | © SNIA. All Rights Reserved.

25 | © SNIA. All Rights Reserved.

26 | © SNIA. All Rights Reserved.

You might also like