GitHub - neurolabusc/CPUsForNeuroimaging: Review of available CPUs for brain imaging research

About

Neuroimaging is a powerful tool for understanding the function of the healthy brain, pathology, the consequences of aging, guiding treatment, and evaluating treatment. However, the analysis of brain imaging data is very compute intensive. Neuroimaging teams make large investments in the computers used for image processing. The objective of this page is to evaluate different architectures for analyzing brain imaging data.

For example, when purchasing a 32-core server, the competition includes the AMD EPYC 9374F (256mb L3, 3.85-4.1GHz, 320w, $4,780), or 9384X (768mb L3, 3.1-3.5GHz, 320w, $5529), versus the Intel Sapphire Rapids based Xeon Platinum 8462Y (60mb cache, 2.8-4.1GHz, 300w, $5945, 2 socket) or 8454H (82.5mb cache, 2.1-3.7GHz, 270w, $6540, 8 socket). Note the tradeoff in the AMD CPUs: one can choose the faster 9374F or the larger cache of the 9384X. The large cache can benefit some tasks, but not others. Therefore, it is worth evaluating how different vendors (AMD vs Intel) and different tradeoffs (cache versus speed) influence neuroimaging pipelines. Many 3D neuroimaging operations such as spatial normalization, Gaussian blur and matrix-based statistics would seem to benefit from larger caches. However, many of the operations are too large to fit into any cache, and the data is often read in a predictable, sequential order. Here we evaluate the 7950x3d which has 8 fast cores and 8 cores (using the same architecture as the 9374F and 9384X) that have access to a large cache where we can independently turn off cores to directly measure the impact of cache and speed on the tools of our field. Here we also test an Intel i9 12900ks, which uses the same architecture as the 8462Y and 8454H.

Neuroimaging servers tend to choose either Intel or AMD CPUs, as these benefit from a complete ecosystem of tools and have access to NVidia GPUs that dramatically accelerate some tasks (e.g. bedpost, eddy, probtrackx). We have previously described the potential for ARM-based Apple CPUs and include the Apple M2 CPU here for reference. Indeed, NVidia is introducing ARM-based CPUs with their CUDA GPUs which may be a long-term game changer, but is beyond the scope of this page.

While some of the tests below evaluate the influence of multiple threads to accelerate processing, it is worth emphasizing that most server-based neuroimaging is intentionally single threaded. Most neuroimaging studies acquire data from many individuals, and the data from each individual is processed independently on a single thread. Therefore, the parallelism is based on analyzing data from multiple individuals in parallel. Therefore, the Single Threaded Pipeline sections below are most relevant for server purchases.

Testing

Several computers were tested:

MacBook Pro 16 (2022) with Apple M2 Pro 12 (8+4) core, L2 32mb, maximum bandwidth 200 GB/s, 3.5GHz, integrated GPU.
MacBook Pro 14 (2024) with Apple M4 Pro 14 (10+4) core, L2 36mb, maximum bandwidth 273 GB/s, 4.5GHz, integrated GPU.
AMD Ryzen 7950x3d 16 core, L2 16mb, L3 128 mb (8 fast cores share 32mb, 8 share 96mb), 64Gb DDR4 6000, Ubuntu 23.04, 4.2-5.5 Ghz, 83.2 GB/s with a 12 Gb NVidia RTX 4070 Ti. We refer to this chip as 7950x3d when all 16 cores are enabled, 7950f when only the fast 8 cores are enabled, and 7950c when only the 8 cores with the extra 3D cache are enabled.
AMD 7995WX 96 core, L2 96mb, L3 384mb, 2.5-5.1 Ghz, 332.8 GB/s with a 24 Gb NVidia RTX 4090.
Intel I9 12900ks 16 core (8 efficiency), L2 14mb, L3 30 mb , 64Gb DDR3 3200, Ubuntu 23.04, 3.4-5.5 Ghz, 51.2 GB/s.

AFNI Parallel Processing

AFNI is a popular software suite for brain imaging. It is a collection of executables written in C. It can be compiled to run natively on the Apple Silicon CPUs. AFNI provides a Bigger Speed Test. This benchmark tests 3dDeconvolve where most of the computations can be run in parallel, but where Amdahl's law incurs diminishing returns with increasing number of threads. For the Ryzen 7950x3d the faster cores outperform the large cache cores.

AFNI (Mostly) Single Threaded Pipeline

The graph below shows the total time to compute the AFNI_data6 s01.ap.simple script. This is a complete pipeline, analyzing fMRI activity observed in 452 volumes of data. Some programs like 3dClustSim use OpenMP acceleration, while many stages are single-threaded. This is a nice demonstration of the real world performance of this architecture for a typical dataset.

Rank   System/CPU                (sec)
AMD 7950f 8-core (fast)           252
AMD 7950c 8-core (big cache)      273
AMD 7950x3D 16-core               283
Apple M2 Pro 12-core (8 big)      301
Intel 12900k 16-core (8 big)      365

FSL Single Threaded Pipeline

FSL is a popular pipeline for neuroimaging analysis. The FSL Evaluation and Example Data Suite (FEEDS) is a regression test for the core tools. This suite has expanded over the years to include new tools, so times are specific to the version being tested (here 6.0.5).

Rank   System/CPU                  (sec)
Apple M4 Pro 14-core (10 big)       243
Apple M2 Pro 12-core (8 big)        345
AMD 7950x3D 8-core (fast)           372
AMD 7950x3D 16-core                 385
AMD 7950x3D 8-core (big cache)      396
Intel 12900k 16-core (8 big)        546
AMD 7995WX 96-core                  617

One could argue that feeds is designed to validate the accuracy of each tool, with relatively small datasets. A more realistic test is the registration dataset from the FSL training course which computes the complex spatial preprocessing for a typical modern dataset. Note that the course material has changed over time, for a more recent test, see the fMRI pipeline below.

Rank   System/CPU                  (sec)
Intel 12900k 16-core (8 big)        706
Apple M2 Pro 12-core (8 big)        796
AMD 7950f 8-core (fast)             788
AMD 7950x3D 16-core                 806
AMD 7950c 8-core (big cache)        879

POSIX Threads (pthreads)

POSIX Threads is a low-level method for handling parallel threads. Here we test C-Ray a simple raytracing benchmark by John Tsiombikas that provides a measure of parallel floating point CPU performance. The small size means it can be contained in cache, so it is not a great measure of typical performance with real world datasets. On the other hand, it is very portable and has been used for many years. This allows us to make historical comparisons. Specifically, here we provide data for SGI Depot Test 3.

Rank   System/CPU                   (msec)  Threads
AMD 7995WX 96-core                     311     200
Ampere Altra A1 3.0GHz 160-core        413     200
Apple M4 Pro 14-core (10 big) 4.5GHz  1011      64
AMD 7950x3D 16-core                   1245     120
Intel 12900k 16-core (8 big)          1548     120
Apple M2 Pro 12-core (8 big) 3.7GHz   1640      64
Apple M4 Pro 14-core (10 big) 4.5GHz 10264       1
Apple M2 Pro 12-core (8 big) 3.7GHz  14014       1
Intel 12900k 16-core (8 big)         16257       1
AMD 7995WX 96-core                   17086       1
AMD 7950x3D 16-core                  18691       1
Ampere Altra A1 3.0GHz 160-core      34563       1

End-to-end DWI Pipeline

A handful of neuroimaging tools are dramatically accelerated by using a CUDA-compatible graphics card (GPU) rather than the central processing unit (CPU). This includes the FSL tools Bedpostx, Eddy and Probtrackx. However, these tools are explicitly designed to minimize memory demands and maintain frequent interaction with the CPU. This results in diminishing returns from premium hardware. These observations are specific to the design of these specific FSL tools and should not be generalized to all machine learning applications. To demonstrate the specific behavior of the FSL tools, we run a real-world neuroimaging diffusion weighted imaging workflow.

This benchmark uses a more representative dataset from the Aphasia Recovery Cohort. It includes:

43 DWI volumes acquired with anterior-to-posterior (AP) phase encoding
43 DWI volumes with reversed phase encoding (PA)

The pipeline performs the following steps:

Distortion correction with topup and eddy
Fiber orientation modeling using bedpostx
Probabilistic tractography with probtrackx to model connections between regions defined by the Harvard-Oxford atlas

The results are shown below. Apple Silicon CPUs (like the M4 Pro) perform competitively for general CPU-bound tasks. However, Apple hardware does not support CUDA. On CUDA-enabled systems, GPU acceleration leads to massive speedups, especially for tractography and model fitting. In practice, the server‑class systems tested (Epyc, 8480CL) were slower than commodity desktop computers for some stages (e.g., eddy), likely because the servers had turbo boost disabled (maintaining a fixed clock speed regardless of active threads) and used error‑correcting memory. Note that graphics cards with a huge amount of RAM like the H200 require a kludge to run probtrackx.

System	topup	eddy	bedpostx	probtrackx	other	total
AMD Epyc 9355 H200	91	253	149	48	26	567
AMD Epyc 9454 H100	100	245	163	51	39	598
AMD 7995WX RTX4090	103	198	225	47	27	600
AMD 7950x3D RTX4070	84	203	339	77	24	727
AMD 5975WX RTX4070 Ti Super	108	228	339	69	31	775
AMD 7965WX RTX 4000 Ada	91	241	453	82	28	895
Intel-8480CL A100	154	442	201	73	48	918
AMD 3900X RTX 3080 Ti	145	337	408	114	44	1048
Apple M4 Pro	93	4148	1613	7357	17	13228

End-to-end fMRI Pipeline

This repository includes a simple benchmark using an end-to-end fMRI processing pipeline with FSL. The pipeline performs all steps required for a single-subject analysis, including spatial undistortion, registration, motion correction, statistical modeling, and temporal filtering. The dataset includes 302 fMRI volumes and uses a fieldmap for spatial undistortion, guided by a T1-weighted anatomical scan. The task involves left and right finger tapping.

Rank   System/CPU                  (sec)
Apple M4 Pro 14-core (10 big)       251
AMD 7995WX 96-core                  305
AMD Epyc 9355 H200                  340
AMD 7950x3D 16-core                 371
AMD 7965WX RTX 4000 Ada             395
AMD Epyc 9454 H100                  449
AMD 5975WX 32-core                  602
Intel-8480CL A100                   618
AMD 3900X                           714

Artificial Intelligence

Artificial Intelligence (AI) models are revolutionizing neuroimaging. While NVidia graphics card hardware is the dominant player in generating AI models, many popular tools in our field like brainchop and some FreeSurfer AI models are able to work on graphics cards from other vendors, such as Apple. This benchmark brain extracts images from the the Aphasia Recovery Cohort and the Stroke Outcome Optimization Project using the MindGrab model.

Rank   System/CPU                  (sec)
AMD Epyc 9355 H200                   23
AMD 7995WX 4090                      25
AMD 7950x3D 4070 Ti                  27
AMD 5975WX 4070 Ti Super             33
AMD Epyc 9454 H100                   34
AMD 7965WX RTX 4000 Ada              36
Apple M4 Pro 14-core (10 big)        43
Intel-8480CL A100                    43

Conclusions

In aggregate, AMD's 3D cache does not appear to benefit the tools used in neuroimaging, so the EPYC 9374F is likely to out perform the more expensive 9384X.
No CPU evaluated dominated the results, and each appears competitive. In this test, single threaded tasks were done an otherwise dormant system (allowing unrestricted boost of a single thread), while typical servers will be running many concrurrent tasks.
Empirically, Apple M2/M4 have excellent single-threaded performance. This likely reflects its prodigious bandwidth (among other benefits). Clearly, MacBooks provide a great platform for developing pipelines to be deployed for large datasets on the cloud and servers. Unfortunately, some neuroimaging tools rely on NVidia's CUDA that is not available on MacOS. It will be interesting to see how NVidia ARM-based CPUs with CUDA GPUs compete in this domain.
In general, many neuroimaging processing tools do not exploit modern multi-core CPUs. These tools are often single-threaded or show rapidly diminishing returns beyond a small number of cores. As a result, the benefit of a server with many CPU cores is not in accelerating the processing of a single subject, but rather in enabling parallel processing of multiple subjects simultaneously. This makes high-core-count systems ideal for batch processing datasets with images from many individuals.

Contributing

This web page was originally created in 2023 to inform my center's workstation and supercomputer purchases. While the findings remain relevant, much of the hardware described is no longer the latest generation. To keep the benchmarks current, data and scripts for the final three tests are provided, allowing users to evaluate their own systems. If you would like to contribute, feel free to submit a pull request or open an issue to share your results.

To run the benchmarks, you will need FSL and brainchop installed. On Linux systems with an NVIDIA GPU, ensure that both clang and nvcc are installed (sudo apt install nvidia-cuda-toolkit clang) Without nvcc, brainchop will fall back to CPU processing instead of using the GPU.

Once the requirements are met, run the following:

git clone https://github.com/neurolabusc/CPUsForNeuroimaging
cd ./CPUsForNeuroimaging/bench
python dwi.py
python fmri.py
python ai.py

Links

Anandtech Ryzen 7950x3d review.
Puget Systems Ryzen 7950x3d for content creation.
Phoronix EPYC 9684X review
Tim Dettmers provides insights for selecting a GPU for deep learning.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
bench		bench
LICENSE		LICENSE
README.md		README.md
afni_bench.png		afni_bench.png
afni_bench_py.py		afni_bench_py.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Testing

AFNI Parallel Processing

AFNI (Mostly) Single Threaded Pipeline

FSL Single Threaded Pipeline

POSIX Threads (pthreads)

End-to-end DWI Pipeline

End-to-end fMRI Pipeline

Artificial Intelligence

Conclusions

Contributing

Links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

neurolabusc/CPUsForNeuroimaging

Folders and files

Latest commit

History

Repository files navigation

About

Testing

AFNI Parallel Processing

AFNI (Mostly) Single Threaded Pipeline

FSL Single Threaded Pipeline

POSIX Threads (pthreads)

End-to-end DWI Pipeline

End-to-end fMRI Pipeline

Artificial Intelligence

Conclusions

Contributing

Links

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages