Neuroimaging is a powerful tool for understanding the function of the healthy brain, pathology, the consequences of aging, guiding treatment, and evaluating treatment. However, the analysis of brain imaging data is very compute intensive. Neuroimaging teams make large investments in the computers used for image processing. The objective of this page is to evaluate different architectures for analyzing brain imaging data.
For example, when purchasing a 32-core server, the competition includes the AMD EPYC 9374F (256mb L3, 3.85-4.1GHz, 320w, $4,780), or 9384X (768mb L3, 3.1-3.5GHz, 320w, $5529), versus the Intel Sapphire Rapids based Xeon Platinum 8462Y (60mb cache, 2.8-4.1GHz, 300w, $5945, 2 socket) or 8454H (82.5mb cache, 2.1-3.7GHz, 270w, $6540, 8 socket). Note the tradeoff in the AMD CPUs: one can choose the faster 9374F or the larger cache of the 9384X. The large cache can benefit some tasks, but not others. Therefore, it is worth evaluating how different vendors (AMD vs Intel) and different tradeoffs (cache versus speed) influence neuroimaging pipelines. Many 3D neuroimaging operations such as spatial normalization, Gaussian blur and matrix-based statistics would seem to benefit from larger caches. However, many of the operations are too large to fit into any cache, and the data is often read in a predictable, sequential order. Here we evaluate the 7950x3d which has 8 fast cores and 8 cores (using the same architecture as the 9374F and 9384X) that have access to a large cache where we can independently turn off cores to directly measure the impact of cache and speed on the tools of our field. Here we also test an Intel i9 12900ks, which uses the same architecture as the 8462Y and 8454H.
Neuroimaging servers tend to choose either Intel or AMD CPUs, as these benefit from a complete ecosystem of tools and have access to NVidia GPUs that dramatically accelerate some tasks (e.g. bedpost, eddy, probtrackx). We have previously described the potential for ARM-based Apple CPUs and include the Apple M2 CPU here for reference. Indeed, NVidia is introducing ARM-based CPUs with their CUDA GPUs which may be a long-term game changer, but is beyond the scope of this page.
While some of the tests below evaluate the influence of multiple threads to accelerate processing, it is worth emphasizing that most server-based neuroimaging is intentionally single threaded. Most neuroimaging studies acquire data from many individuals, and the data from each individual is processed independently on a single thread. Therefore, the parallelism is based on analyzing data from multiple individuals in parallel. Therefore, the Single Threaded Pipeline sections below are most relevant for server purchases.
Several computers were tested:
- MacBook Pro 16 (2022) with Apple M2 Pro 12 (8+4) core, L2 32mb, maximum bandwidth 200 GB/s, 3.5GHz, integrated GPU.
- MacBook Pro 14 (2024) with Apple M4 Pro 14 (10+4) core, L2 36mb, maximum bandwidth 273 GB/s, 4.5GHz, integrated GPU.
- AMD Ryzen 7950x3d 16 core, L2 16mb, L3 128 mb (8 fast cores share 32mb, 8 share 96mb), 64Gb DDR4 6000, Ubuntu 23.04, 4.2-5.5 Ghz, 83.2 GB/s with a 12 Gb NVidia RTX 4070 Ti. We refer to this chip as
7950x3dwhen all 16 cores are enabled,7950fwhen only the fast 8 cores are enabled, and7950cwhen only the 8 cores with the extra 3D cache are enabled. - AMD 7995WX 96 core, L2 96mb, L3 384mb, 2.5-5.1 Ghz, 332.8 GB/s with a 24 Gb NVidia RTX 4090.
- Intel I9 12900ks 16 core (8 efficiency), L2 14mb, L3 30 mb , 64Gb DDR3 3200, Ubuntu 23.04, 3.4-5.5 Ghz, 51.2 GB/s.
AFNI is a popular software suite for brain imaging. It is a collection of executables written in C. It can be compiled to run natively on the Apple Silicon CPUs. AFNI provides a Bigger Speed Test. This benchmark tests 3dDeconvolve where most of the computations can be run in parallel, but where Amdahl's law incurs diminishing returns with increasing number of threads. For the Ryzen 7950x3d the faster cores outperform the large cache cores.
The graph below shows the total time to compute the AFNI_data6 s01.ap.simple script. This is a complete pipeline, analyzing fMRI activity observed in 452 volumes of data. Some programs like 3dClustSim use OpenMP acceleration, while many stages are single-threaded. This is a nice demonstration of the real world performance of this architecture for a typical dataset.
Rank System/CPU (sec)
AMD 7950f 8-core (fast) 252
AMD 7950c 8-core (big cache) 273
AMD 7950x3D 16-core 283
Apple M2 Pro 12-core (8 big) 301
Intel 12900k 16-core (8 big) 365
FSL is a popular pipeline for neuroimaging analysis. The FSL Evaluation and Example Data Suite (FEEDS) is a regression test for the core tools. This suite has expanded over the years to include new tools, so times are specific to the version being tested (here 6.0.5).
Rank System/CPU (sec)
Apple M4 Pro 14-core (10 big) 243
Apple M2 Pro 12-core (8 big) 345
AMD 7950x3D 8-core (fast) 372
AMD 7950x3D 16-core 385
AMD 7950x3D 8-core (big cache) 396
Intel 12900k 16-core (8 big) 546
AMD 7995WX 96-core 617
One could argue that feeds is designed to validate the accuracy of each tool, with relatively small datasets. A more realistic test is the registration dataset from the FSL training course which computes the complex spatial preprocessing for a typical modern dataset. Note that the course material has changed over time, for a more recent test, see the fMRI pipeline below.
Rank System/CPU (sec)
Intel 12900k 16-core (8 big) 706
Apple M2 Pro 12-core (8 big) 796
AMD 7950f 8-core (fast) 788
AMD 7950x3D 16-core 806
AMD 7950c 8-core (big cache) 879
POSIX Threads is a low-level method for handling parallel threads. Here we test C-Ray a simple raytracing benchmark by John Tsiombikas that provides a measure of parallel floating point CPU performance. The small size means it can be contained in cache, so it is not a great measure of typical performance with real world datasets. On the other hand, it is very portable and has been used for many years. This allows us to make historical comparisons. Specifically, here we provide data for SGI Depot Test 3.
Rank System/CPU (msec) Threads
AMD 7995WX 96-core 311 200
Ampere Altra A1 3.0GHz 160-core 413 200
Apple M4 Pro 14-core (10 big) 4.5GHz 1011 64
AMD 7950x3D 16-core 1245 120
Intel 12900k 16-core (8 big) 1548 120
Apple M2 Pro 12-core (8 big) 3.7GHz 1640 64
Apple M4 Pro 14-core (10 big) 4.5GHz 10264 1
Apple M2 Pro 12-core (8 big) 3.7GHz 14014 1
Intel 12900k 16-core (8 big) 16257 1
AMD 7995WX 96-core 17086 1
AMD 7950x3D 16-core 18691 1
Ampere Altra A1 3.0GHz 160-core 34563 1
A handful of neuroimaging tools are dramatically accelerated by using a CUDA-compatible graphics card (GPU) rather than the central processing unit (CPU). This includes the FSL tools Bedpostx, Eddy and Probtrackx. However, these tools are explicitly designed to minimize memory demands and maintain frequent interaction with the CPU. This results in diminishing returns from premium hardware. These observations are specific to the design of these specific FSL tools and should not be generalized to all machine learning applications. To demonstrate the specific behavior of the FSL tools, we run a real-world neuroimaging diffusion weighted imaging workflow.
This benchmark uses a more representative dataset from the Aphasia Recovery Cohort. It includes:
- 43 DWI volumes acquired with anterior-to-posterior (AP) phase encoding
- 43 DWI volumes with reversed phase encoding (PA)
The pipeline performs the following steps:
- Distortion correction with topup and eddy
- Fiber orientation modeling using bedpostx
- Probabilistic tractography with probtrackx to model connections between regions defined by the Harvard-Oxford atlas
The results are shown below. Apple Silicon CPUs (like the M4 Pro) perform competitively for general CPU-bound tasks. However, Apple hardware does not support CUDA. On CUDA-enabled systems, GPU acceleration leads to massive speedups, especially for tractography and model fitting. In practice, the server‑class systems tested (Epyc, 8480CL) were slower than commodity desktop computers for some stages (e.g., eddy), likely because the servers had turbo boost disabled (maintaining a fixed clock speed regardless of active threads) and used error‑correcting memory. Note that graphics cards with a huge amount of RAM like the H200 require a kludge to run probtrackx.
| System | topup | eddy | bedpostx | probtrackx | other | total |
|---|---|---|---|---|---|---|
| AMD Epyc 9355 H200 | 91 | 253 | 149 | 48 | 26 | 567 |
| AMD Epyc 9454 H100 | 100 | 245 | 163 | 51 | 39 | 598 |
| AMD 7995WX RTX4090 | 103 | 198 | 225 | 47 | 27 | 600 |
| AMD 7950x3D RTX4070 | 84 | 203 | 339 | 77 | 24 | 727 |
| AMD 5975WX RTX4070 Ti Super | 108 | 228 | 339 | 69 | 31 | 775 |
| AMD 7965WX RTX 4000 Ada | 91 | 241 | 453 | 82 | 28 | 895 |
| Intel-8480CL A100 | 154 | 442 | 201 | 73 | 48 | 918 |
| AMD 3900X RTX 3080 Ti | 145 | 337 | 408 | 114 | 44 | 1048 |
| Apple M4 Pro | 93 | 4148 | 1613 | 7357 | 17 | 13228 |
This repository includes a simple benchmark using an end-to-end fMRI processing pipeline with FSL. The pipeline performs all steps required for a single-subject analysis, including spatial undistortion, registration, motion correction, statistical modeling, and temporal filtering. The dataset includes 302 fMRI volumes and uses a fieldmap for spatial undistortion, guided by a T1-weighted anatomical scan. The task involves left and right finger tapping.
Rank System/CPU (sec)
Apple M4 Pro 14-core (10 big) 251
AMD 7995WX 96-core 305
AMD Epyc 9355 H200 340
AMD 7950x3D 16-core 371
AMD 7965WX RTX 4000 Ada 395
AMD Epyc 9454 H100 449
AMD 5975WX 32-core 602
Intel-8480CL A100 618
AMD 3900X 714
Artificial Intelligence (AI) models are revolutionizing neuroimaging. While NVidia graphics card hardware is the dominant player in generating AI models, many popular tools in our field like brainchop and some FreeSurfer AI models are able to work on graphics cards from other vendors, such as Apple. This benchmark brain extracts images from the the Aphasia Recovery Cohort and the Stroke Outcome Optimization Project using the MindGrab model.
Rank System/CPU (sec)
AMD Epyc 9355 H200 23
AMD 7995WX 4090 25
AMD 7950x3D 4070 Ti 27
AMD 5975WX 4070 Ti Super 33
AMD Epyc 9454 H100 34
AMD 7965WX RTX 4000 Ada 36
Apple M4 Pro 14-core (10 big) 43
Intel-8480CL A100 43
- In aggregate, AMD's 3D cache does not appear to benefit the tools used in neuroimaging, so the EPYC 9374F is likely to out perform the more expensive 9384X.
- No CPU evaluated dominated the results, and each appears competitive. In this test, single threaded tasks were done an otherwise dormant system (allowing unrestricted boost of a single thread), while typical servers will be running many concrurrent tasks.
- Empirically, Apple M2/M4 have excellent single-threaded performance. This likely reflects its prodigious bandwidth (among other benefits). Clearly, MacBooks provide a great platform for developing pipelines to be deployed for large datasets on the cloud and servers. Unfortunately, some neuroimaging tools rely on NVidia's CUDA that is not available on MacOS. It will be interesting to see how NVidia ARM-based CPUs with CUDA GPUs compete in this domain.
- In general, many neuroimaging processing tools do not exploit modern multi-core CPUs. These tools are often single-threaded or show rapidly diminishing returns beyond a small number of cores. As a result, the benefit of a server with many CPU cores is not in accelerating the processing of a single subject, but rather in enabling parallel processing of multiple subjects simultaneously. This makes high-core-count systems ideal for batch processing datasets with images from many individuals.
This web page was originally created in 2023 to inform my center's workstation and supercomputer purchases. While the findings remain relevant, much of the hardware described is no longer the latest generation. To keep the benchmarks current, data and scripts for the final three tests are provided, allowing users to evaluate their own systems. If you would like to contribute, feel free to submit a pull request or open an issue to share your results.
To run the benchmarks, you will need FSL and brainchop installed. On Linux systems with an NVIDIA GPU, ensure that both clang and nvcc are installed (sudo apt install nvidia-cuda-toolkit clang) Without nvcc, brainchop will fall back to CPU processing instead of using the GPU.
Once the requirements are met, run the following:
git clone https://github.com/neurolabusc/CPUsForNeuroimaging
cd ./CPUsForNeuroimaging/bench
python dwi.py
python fmri.py
python ai.py- Anandtech Ryzen 7950x3d review.
- Puget Systems Ryzen 7950x3d for content creation.
- Phoronix EPYC 9684X review
- Tim Dettmers provides insights for selecting a GPU for deep learning.
