metal: fix buffer allocation for discrete (non-unified) GPUs by Scottcjn · Pull Request #20615 · ggml-org/llama.cpp

Scottcjn · 2026-03-16T03:09:48Z

Summary

Fixes Metal backend crash on Macs with discrete AMD GPUs (2013-2019 Mac Pro, iMac with discrete Radeon).

Before: GGML_ASSERT(buf_src) crash on model load — newBufferWithBytesNoCopy with StorageModeShared returns nil on discrete GPUs.

After: Model loads and runs inference on AMD FirePro D500.

Changes

In ggml-metal-device.m, the set_tensor and get_tensor functions unconditionally use newBufferWithBytesNoCopy with MTLResourceStorageModeShared for temporary blit buffers. This fails on discrete GPUs because they cannot access host memory directly.

The fix checks has_unified_memory (already tracked by the device init) and uses:

newBufferWithBytes + StorageModeManaged for set_tensor (copies data to GPU-accessible buffer)
newBufferWithLength + StorageModeManaged for get_tensor (allocates GPU-accessible buffer)
memcpy after blit completion to copy results back to host pointer

The existing NoCopy+Shared path is preserved for unified memory devices (Apple Silicon) via the has_unified_memory conditional.

Test Hardware

2013 Mac Pro ("trashcan") with 2x AMD FirePro D500 (3GB VRAM each)
macOS Monterey 12.7.6
MTLGPUFamilyCommon3 (3003)
has_unified_memory = false, use_shared_buffers = false

Benchmarks

Model	Backend	pp128 (t/s)	tg32 (t/s)
TinyLlama 1.1B Q4	Metal+BLAS	59.49	7.27
TinyLlama 1.1B Q4	CPU-only	62.50	47.40
Qwen2.5 3B Q4	Metal+BLAS	23.44	2.93
Qwen2.5 3B Q4	CPU-only	20.25	15.49

Metal wins prompt processing by 16% on the larger model. Generation is slower due to PCIe host-device copies for each token — expected behavior for discrete GPUs without unified memory.

Affected Hardware

This enables Metal inference on all Macs with discrete AMD GPUs:

Mac Pro 2013 (FirePro D300/D500/D700)
iMac 2014-2019 (Radeon Pro 555/560/575/580, Vega 48)
MacBook Pro 2015-2019 (Radeon Pro 450/455/460/555X/560X, Vega 16/20)

🤖 Generated with Claude Code

On Macs with discrete GPUs (e.g. 2013 Mac Pro with AMD FirePro D500), `newBufferWithBytesNoCopy` with `MTLResourceStorageModeShared` returns nil because discrete GPUs cannot directly access host memory. This patch conditionally uses copy-based buffer allocation for non-unified memory devices: - set_tensor: Use `newBufferWithBytes:options:StorageModeManaged` instead of `newBufferWithBytesNoCopy:options:StorageModeShared` - get_tensor: Use `newBufferWithLength:options:StorageModeManaged` and memcpy results back to host after blit completion Tested on: - 2013 Mac Pro (trashcan) with 2x AMD FirePro D500 (3GB each) - macOS Monterey 12.7.6, MTLGPUFamilyCommon3 (3003) - TinyLlama 1.1B Q4: pp128=59.49 t/s, tg32=7.27 t/s - Qwen2.5 3B Q4: pp128=23.44 t/s (16% faster than CPU-only) - Without patch: GGML_ASSERT(buf_src) crash on model load No impact on unified memory devices (Apple Silicon) — the existing NoCopy+Shared path is preserved via `has_unified_memory` check. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Added GPU detection and Metal status for discrete AMD GPUs: - _detect_gpu_info(): Detects discrete vs integrated GPUs on macOS - /status command: Shows GPU model, type, and Metal support status - apply-metal-fix.sh: Script to apply StorageModeManaged patch to llama.cpp - Tests: 12 test functions covering GPU detection, Metal support, error handling Tested hardware: - Mac Pro 2013 (AMD FirePro D500/D700) - iMac 2014-2019 (AMD Radeon) - MacBook Pro 2015-2019 (AMD Radeon Pro) Reference: ggml-org/llama.cpp#20615 Wallet: 0xc0A7b634A8b5Ff3a7F310D7cCC786FE3f6270B1f7

- apply-metal-fix.sh: Applies StorageModeManaged fix to llama.cpp - Reference: ggml-org/llama.cpp#20615 - Supports: Mac Pro 2013, iMac 2014-2019, MacBook Pro 2015-2019 - Wallet: 0xc0A7b634A8b5Ff3a7F310D7cCC786FE3f6270B1f7

* feat: Add Metal GPU detection and status display (#38) Added GPU detection and Metal status for discrete AMD GPUs: - _detect_gpu_info(): Detects discrete vs integrated GPUs on macOS - /status command: Shows GPU model, type, and Metal support status - apply-metal-fix.sh: Script to apply StorageModeManaged patch to llama.cpp - Tests: 12 test functions covering GPU detection, Metal support, error handling Tested hardware: - Mac Pro 2013 (AMD FirePro D500/D700) - iMac 2014-2019 (AMD Radeon) - MacBook Pro 2015-2019 (AMD Radeon Pro) Reference: ggml-org/llama.cpp#20615 Wallet: 0xc0A7b634A8b5Ff3a7F310D7cCC786FE3f6270B1f7 * Add /pipe command and enhance generation stats display (#64, #66) Features implemented: - /pipe command: Save last assistant response to file - Usage: /pipe [filename] - Auto-generates timestamp-based filename if not provided - Shows file path, size, and line count after saving - Enhanced /status command: Shows last generation stats - Displays [X.X tok/s | XXX tokens | X.Xs] format - Shows both session stats and last turn stats Implementation details: - Uses existing LAST_ASSISTANT_RESPONSE variable - Uses existing LAST_GENERATION_STATS and SESSION_STATS - No external dependencies added - Fixed global declaration syntax error Wallet: 孙备 * docs: clarify pyreadline3 is required for Windows readline/tab support - Changed comment from 'Optional' to 'Required' - Added note explaining functionality loss without it - Fixes #69 * feat: Add Metal GPU fix script for discrete AMD GPUs (#38) - apply-metal-fix.sh: Applies StorageModeManaged fix to llama.cpp - Reference: ggml-org/llama.cpp#20615 - Supports: Mac Pro 2013, iMac 2014-2019, MacBook Pro 2015-2019 - Wallet: 0xc0A7b634A8b5Ff3a7F310D7cCC786FE3f6270B1f7 --------- Co-authored-by: 孙备 <[email protected]>

Scottcjn requested a review from a team as a code owner March 16, 2026 03:09

github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Mar 16, 2026

ggerganov closed this Mar 16, 2026

Scottcjn mentioned this pull request Mar 16, 2026

Metal GPU acceleration for discrete AMD Macs (15 RTC) Scottcjn/trashclaw#38

Closed

jujujuda mentioned this pull request Mar 16, 2026

AI: Metal GPU acceleration for discrete AMD Macs (15 R Scottcjn/trashclaw#42

Closed

597226617 mentioned this pull request Mar 27, 2026

[BOUNTY #38] Metal GPU acceleration for discrete AMD Macs Scottcjn/trashclaw#178

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metal: fix buffer allocation for discrete (non-unified) GPUs#20615

metal: fix buffer allocation for discrete (non-unified) GPUs#20615
Scottcjn wants to merge 1 commit intoggml-org:masterfrom
Scottcjn:fix-metal-discrete-gpu

Scottcjn commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Scottcjn commented Mar 16, 2026

Summary

Changes

Test Hardware

Benchmarks

Affected Hardware

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants