metal: fix buffer allocation for discrete (non-unified) GPUs#20615
Closed
Scottcjn wants to merge 1 commit intoggml-org:masterfrom
Closed
metal: fix buffer allocation for discrete (non-unified) GPUs#20615Scottcjn wants to merge 1 commit intoggml-org:masterfrom
Scottcjn wants to merge 1 commit intoggml-org:masterfrom
Conversation
On Macs with discrete GPUs (e.g. 2013 Mac Pro with AMD FirePro D500), `newBufferWithBytesNoCopy` with `MTLResourceStorageModeShared` returns nil because discrete GPUs cannot directly access host memory. This patch conditionally uses copy-based buffer allocation for non-unified memory devices: - set_tensor: Use `newBufferWithBytes:options:StorageModeManaged` instead of `newBufferWithBytesNoCopy:options:StorageModeShared` - get_tensor: Use `newBufferWithLength:options:StorageModeManaged` and memcpy results back to host after blit completion Tested on: - 2013 Mac Pro (trashcan) with 2x AMD FirePro D500 (3GB each) - macOS Monterey 12.7.6, MTLGPUFamilyCommon3 (3003) - TinyLlama 1.1B Q4: pp128=59.49 t/s, tg32=7.27 t/s - Qwen2.5 3B Q4: pp128=23.44 t/s (16% faster than CPU-only) - Without patch: GGML_ASSERT(buf_src) crash on model load No impact on unified memory devices (Apple Silicon) — the existing NoCopy+Shared path is preserved via `has_unified_memory` check. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
597226617
pushed a commit
to 597226617/trashclaw
that referenced
this pull request
Mar 21, 2026
Added GPU detection and Metal status for discrete AMD GPUs: - _detect_gpu_info(): Detects discrete vs integrated GPUs on macOS - /status command: Shows GPU model, type, and Metal support status - apply-metal-fix.sh: Script to apply StorageModeManaged patch to llama.cpp - Tests: 12 test functions covering GPU detection, Metal support, error handling Tested hardware: - Mac Pro 2013 (AMD FirePro D500/D700) - iMac 2014-2019 (AMD Radeon) - MacBook Pro 2015-2019 (AMD Radeon Pro) Reference: ggml-org/llama.cpp#20615 Wallet: 0xc0A7b634A8b5Ff3a7F310D7cCC786FE3f6270B1f7
597226617
pushed a commit
to 597226617/trashclaw
that referenced
this pull request
Mar 27, 2026
- apply-metal-fix.sh: Applies StorageModeManaged fix to llama.cpp - Reference: ggml-org/llama.cpp#20615 - Supports: Mac Pro 2013, iMac 2014-2019, MacBook Pro 2015-2019 - Wallet: 0xc0A7b634A8b5Ff3a7F310D7cCC786FE3f6270B1f7
7 tasks
Scottcjn
pushed a commit
to Scottcjn/trashclaw
that referenced
this pull request
Mar 27, 2026
* feat: Add Metal GPU detection and status display (#38) Added GPU detection and Metal status for discrete AMD GPUs: - _detect_gpu_info(): Detects discrete vs integrated GPUs on macOS - /status command: Shows GPU model, type, and Metal support status - apply-metal-fix.sh: Script to apply StorageModeManaged patch to llama.cpp - Tests: 12 test functions covering GPU detection, Metal support, error handling Tested hardware: - Mac Pro 2013 (AMD FirePro D500/D700) - iMac 2014-2019 (AMD Radeon) - MacBook Pro 2015-2019 (AMD Radeon Pro) Reference: ggml-org/llama.cpp#20615 Wallet: 0xc0A7b634A8b5Ff3a7F310D7cCC786FE3f6270B1f7 * Add /pipe command and enhance generation stats display (#64, #66) Features implemented: - /pipe command: Save last assistant response to file - Usage: /pipe [filename] - Auto-generates timestamp-based filename if not provided - Shows file path, size, and line count after saving - Enhanced /status command: Shows last generation stats - Displays [X.X tok/s | XXX tokens | X.Xs] format - Shows both session stats and last turn stats Implementation details: - Uses existing LAST_ASSISTANT_RESPONSE variable - Uses existing LAST_GENERATION_STATS and SESSION_STATS - No external dependencies added - Fixed global declaration syntax error Wallet: 孙备 * docs: clarify pyreadline3 is required for Windows readline/tab support - Changed comment from 'Optional' to 'Required' - Added note explaining functionality loss without it - Fixes #69 * feat: Add Metal GPU fix script for discrete AMD GPUs (#38) - apply-metal-fix.sh: Applies StorageModeManaged fix to llama.cpp - Reference: ggml-org/llama.cpp#20615 - Supports: Mac Pro 2013, iMac 2014-2019, MacBook Pro 2015-2019 - Wallet: 0xc0A7b634A8b5Ff3a7F310D7cCC786FE3f6270B1f7 --------- Co-authored-by: 孙备 <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes Metal backend crash on Macs with discrete AMD GPUs (2013-2019 Mac Pro, iMac with discrete Radeon).
Before:
GGML_ASSERT(buf_src)crash on model load —newBufferWithBytesNoCopywithStorageModeSharedreturns nil on discrete GPUs.After: Model loads and runs inference on AMD FirePro D500.
Changes
In
ggml-metal-device.m, theset_tensorandget_tensorfunctions unconditionally usenewBufferWithBytesNoCopywithMTLResourceStorageModeSharedfor temporary blit buffers. This fails on discrete GPUs because they cannot access host memory directly.The fix checks
has_unified_memory(already tracked by the device init) and uses:newBufferWithBytes+StorageModeManagedfor set_tensor (copies data to GPU-accessible buffer)newBufferWithLength+StorageModeManagedfor get_tensor (allocates GPU-accessible buffer)memcpyafter blit completion to copy results back to host pointerThe existing
NoCopy+Sharedpath is preserved for unified memory devices (Apple Silicon) via thehas_unified_memoryconditional.Test Hardware
MTLGPUFamilyCommon3(3003)has_unified_memory = false,use_shared_buffers = falseBenchmarks
Metal wins prompt processing by 16% on the larger model. Generation is slower due to PCIe host-device copies for each token — expected behavior for discrete GPUs without unified memory.
Affected Hardware
This enables Metal inference on all Macs with discrete AMD GPUs:
🤖 Generated with Claude Code