Skip to content

metal: fix buffer allocation for discrete (non-unified) GPUs#20615

Closed
Scottcjn wants to merge 1 commit intoggml-org:masterfrom
Scottcjn:fix-metal-discrete-gpu
Closed

metal: fix buffer allocation for discrete (non-unified) GPUs#20615
Scottcjn wants to merge 1 commit intoggml-org:masterfrom
Scottcjn:fix-metal-discrete-gpu

Conversation

@Scottcjn
Copy link
Copy Markdown

Summary

Fixes Metal backend crash on Macs with discrete AMD GPUs (2013-2019 Mac Pro, iMac with discrete Radeon).

Before: GGML_ASSERT(buf_src) crash on model load — newBufferWithBytesNoCopy with StorageModeShared returns nil on discrete GPUs.

After: Model loads and runs inference on AMD FirePro D500.

Changes

In ggml-metal-device.m, the set_tensor and get_tensor functions unconditionally use newBufferWithBytesNoCopy with MTLResourceStorageModeShared for temporary blit buffers. This fails on discrete GPUs because they cannot access host memory directly.

The fix checks has_unified_memory (already tracked by the device init) and uses:

  • newBufferWithBytes + StorageModeManaged for set_tensor (copies data to GPU-accessible buffer)
  • newBufferWithLength + StorageModeManaged for get_tensor (allocates GPU-accessible buffer)
  • memcpy after blit completion to copy results back to host pointer

The existing NoCopy+Shared path is preserved for unified memory devices (Apple Silicon) via the has_unified_memory conditional.

Test Hardware

  • 2013 Mac Pro ("trashcan") with 2x AMD FirePro D500 (3GB VRAM each)
  • macOS Monterey 12.7.6
  • MTLGPUFamilyCommon3 (3003)
  • has_unified_memory = false, use_shared_buffers = false

Benchmarks

Model Backend pp128 (t/s) tg32 (t/s)
TinyLlama 1.1B Q4 Metal+BLAS 59.49 7.27
TinyLlama 1.1B Q4 CPU-only 62.50 47.40
Qwen2.5 3B Q4 Metal+BLAS 23.44 2.93
Qwen2.5 3B Q4 CPU-only 20.25 15.49

Metal wins prompt processing by 16% on the larger model. Generation is slower due to PCIe host-device copies for each token — expected behavior for discrete GPUs without unified memory.

Affected Hardware

This enables Metal inference on all Macs with discrete AMD GPUs:

  • Mac Pro 2013 (FirePro D300/D500/D700)
  • iMac 2014-2019 (Radeon Pro 555/560/575/580, Vega 48)
  • MacBook Pro 2015-2019 (Radeon Pro 450/455/460/555X/560X, Vega 16/20)

🤖 Generated with Claude Code

On Macs with discrete GPUs (e.g. 2013 Mac Pro with AMD FirePro D500),
`newBufferWithBytesNoCopy` with `MTLResourceStorageModeShared` returns
nil because discrete GPUs cannot directly access host memory.

This patch conditionally uses copy-based buffer allocation for
non-unified memory devices:

- set_tensor: Use `newBufferWithBytes:options:StorageModeManaged`
  instead of `newBufferWithBytesNoCopy:options:StorageModeShared`
- get_tensor: Use `newBufferWithLength:options:StorageModeManaged`
  and memcpy results back to host after blit completion

Tested on:
- 2013 Mac Pro (trashcan) with 2x AMD FirePro D500 (3GB each)
- macOS Monterey 12.7.6, MTLGPUFamilyCommon3 (3003)
- TinyLlama 1.1B Q4: pp128=59.49 t/s, tg32=7.27 t/s
- Qwen2.5 3B Q4: pp128=23.44 t/s (16% faster than CPU-only)
- Without patch: GGML_ASSERT(buf_src) crash on model load

No impact on unified memory devices (Apple Silicon) — the existing
NoCopy+Shared path is preserved via `has_unified_memory` check.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@Scottcjn Scottcjn requested a review from a team as a code owner March 16, 2026 03:09
@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Mar 16, 2026
@ggerganov ggerganov closed this Mar 16, 2026
597226617 pushed a commit to 597226617/trashclaw that referenced this pull request Mar 21, 2026
Added GPU detection and Metal status for discrete AMD GPUs:
- _detect_gpu_info(): Detects discrete vs integrated GPUs on macOS
- /status command: Shows GPU model, type, and Metal support status
- apply-metal-fix.sh: Script to apply StorageModeManaged patch to llama.cpp
- Tests: 12 test functions covering GPU detection, Metal support, error handling

Tested hardware:
- Mac Pro 2013 (AMD FirePro D500/D700)
- iMac 2014-2019 (AMD Radeon)
- MacBook Pro 2015-2019 (AMD Radeon Pro)

Reference: ggml-org/llama.cpp#20615

Wallet: 0xc0A7b634A8b5Ff3a7F310D7cCC786FE3f6270B1f7
597226617 pushed a commit to 597226617/trashclaw that referenced this pull request Mar 27, 2026
- apply-metal-fix.sh: Applies StorageModeManaged fix to llama.cpp
- Reference: ggml-org/llama.cpp#20615
- Supports: Mac Pro 2013, iMac 2014-2019, MacBook Pro 2015-2019
- Wallet: 0xc0A7b634A8b5Ff3a7F310D7cCC786FE3f6270B1f7
Scottcjn pushed a commit to Scottcjn/trashclaw that referenced this pull request Mar 27, 2026
* feat: Add Metal GPU detection and status display (#38)

Added GPU detection and Metal status for discrete AMD GPUs:
- _detect_gpu_info(): Detects discrete vs integrated GPUs on macOS
- /status command: Shows GPU model, type, and Metal support status
- apply-metal-fix.sh: Script to apply StorageModeManaged patch to llama.cpp
- Tests: 12 test functions covering GPU detection, Metal support, error handling

Tested hardware:
- Mac Pro 2013 (AMD FirePro D500/D700)
- iMac 2014-2019 (AMD Radeon)
- MacBook Pro 2015-2019 (AMD Radeon Pro)

Reference: ggml-org/llama.cpp#20615

Wallet: 0xc0A7b634A8b5Ff3a7F310D7cCC786FE3f6270B1f7

* Add /pipe command and enhance generation stats display (#64, #66)

Features implemented:
- /pipe command: Save last assistant response to file
  - Usage: /pipe [filename]
  - Auto-generates timestamp-based filename if not provided
  - Shows file path, size, and line count after saving

- Enhanced /status command: Shows last generation stats
  - Displays [X.X tok/s | XXX tokens | X.Xs] format
  - Shows both session stats and last turn stats

Implementation details:
- Uses existing LAST_ASSISTANT_RESPONSE variable
- Uses existing LAST_GENERATION_STATS and SESSION_STATS
- No external dependencies added
- Fixed global declaration syntax error

Wallet: 孙备

* docs: clarify pyreadline3 is required for Windows readline/tab support

- Changed comment from 'Optional' to 'Required'
- Added note explaining functionality loss without it
- Fixes #69

* feat: Add Metal GPU fix script for discrete AMD GPUs (#38)

- apply-metal-fix.sh: Applies StorageModeManaged fix to llama.cpp
- Reference: ggml-org/llama.cpp#20615
- Supports: Mac Pro 2013, iMac 2014-2019, MacBook Pro 2015-2019
- Wallet: 0xc0A7b634A8b5Ff3a7F310D7cCC786FE3f6270B1f7

---------

Co-authored-by: 孙备 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants