RAM Coffers
Overview
Introduction
RAM Coffers is an open-source architecture for efficient inference of large language models (LLMs), developed by Scott Boudreaux of Elyan Labs.[1] It enables high-performance LLM inference on aging hardware, such as IBM POWER8 systems, through NUMA-distributed weight banking and O(1) lookup routing that support conditional memory access.[1][2] The project highlights sustainability by demonstrating viable AI workloads on obsolete retro computing hardware, extending the useful life of such systems for modern tasks.[1][2] The project was initially developed and demonstrated in December 2025, with the GitHub repository published in January 2026. It conceptually predates the similar DeepSeek Engram approach, with the related paper (arXiv:2601.07372) submitted on January 12, 2026.[1][3]Development History
RAM Coffers was developed as an independent research project by Scott Boudreaux of Elyan Labs, using the GitHub username @Scottcjn. The work originated from efforts to achieve efficient large language model inference on aging hardware, specifically an IBM POWER8 S824 server acquired second-hand for $700 in 2014.[2][4] The project was developed by December 16, 2025, with a demonstration video published on December 17, 2025. The GitHub repository was publicly released on January 19, 2026.[2]Release and Milestones
RAM Coffers architecture was developed and documented as of December 16, 2025, by Scott Boudreaux of Elyan Labs, with the GitHub repository (https://github.com/Scottcjn/ram-coffers) published in January 2026. This architecture predated DeepSeek's Engram concept by 27 days, as the latter was published on arXiv under identifier 2601.07372 on January 12, 2026.[3][2] A subsequent milestone involved an ablation study documented in materials associated with the GRAIL-V submission to CVPR 2026, which demonstrated a 20% efficiency gain in video generation through emotional prompting, alongside a 23.9% reduction in file size and up to 33% performance improvement in complex emotional scenes across models such as AnimateDiff and SVD, with results validated using LPIPS metrics.Technical Architecture
NUMA-Distributed Weight Banking
NUMA-Distributed Weight Banking is the foundational mechanism in RAM Coffers that partitions model weights across Non-Uniform Memory Access (NUMA) nodes according to semantic domains. This approach organizes the model's parameters into distinct "coffers," each assigned to a specific NUMA node and dedicated to a particular knowledge category, such as core general knowledge, science/technology, creative writing/long-context handling, or niche/historical content.[1] The partitioning follows domain-based sharding, where weights are distributed to enable targeted storage and retrieval. In a typical configuration on multi-socket hardware, the coffers are allocated as follows:- Coffer 0 on NUMA Node 3 (193 GB): Heavy/General (core knowledge)
- Coffer 1 on NUMA Node 1 (183 GB): Science/Tech domain
- Coffer 2 on NUMA Node 0 (119 GB): Creative/Long CTX
- Coffer 3 on NUMA Node 2 (62 GB): Niche/History[1]
Resonance Routing
Resonance routing is the dynamic query-to-memory mapping mechanism in RAM Coffers that enables selective activation of relevant weight banks during inference. Incoming queries are first embedded into vector representations, which are then compared against pre-defined domain signatures associated with each memory bank using cosine similarity. This matching process identifies and activates only the most semantically aligned coffers, avoiding unnecessary access to irrelevant portions of the model.[1] The resonance-based approach draws on associative recall principles to route queries efficiently: the cosine similarity score determines the "resonance" strength between the query embedding and coffer signatures, triggering intelligent weight activation for the selected bank(s). This results in targeted retrieval that bypasses full model loading, achieving O(1) lookup performance regardless of model size.[1] By limiting computation to pertinent memory regions, resonance routing substantially reduces latency and memory bandwidth demands compared to conventional LLM inference pipelines. The mechanism operates atop NUMA-distributed weight banks, leveraging their physical separation to minimize cross-node traffic during routing.[1][2]Non-Bijunctive Pruning
Non-bijunctive pruning is a Hebbian-inspired technique in RAM Coffers that selectively collapses irrelevant paths in the model before fetching full weights.[5] This approach, described as "Hebbian-inspired non-bijunctive path collapse" in the implementation file ggml-intelligent-collapse.h, reduces memory bandwidth requirements by eliminating unnecessary data transfers early in processing.[5] By performing this selective collapse prior to full weight retrieval, non-bijunctive pruning minimizes bandwidth usage.[5] In the RAM Coffers processing flow, the technique is applied during the pse_collapse_prune step, which executes non-bijunctive path selection immediately after coffer activation and before complete weight fetching.[5] This positioning enables efficient pruning in conjunction with resonance routing, as only the activated domain-specific coffers undergo path evaluation and collapse.[5]Hardware-Specific Optimizations
RAM Coffers includes targeted optimizations for PowerPC-based systems, with particular emphasis on the IBM POWER8 architecture to maximize inference efficiency on legacy hardware.[1] These include the use of Data Cache Block Touch (DCBT) instructions to issue prefetch hints that promote data residency in L2 and L3 caches, reducing memory access latency by proactively loading relevant weights and tensors into faster cache levels.[1] This cache-aware strategy exploits POWER8's hierarchical cache design to minimize stalls during selective weight activation.[1] The architecture also incorporates PSE Entropy Burst, which draws entropy directly from the PowerPC timebase register—a hardware counter providing high-resolution timing information—to inject variability into the generation process and improve output diversity without relying on external sources.[1] Vector Scalar Extension (VSX) optimizations further accelerate compute-intensive operations, particularly in vectorized attention mechanisms, by leveraging POWER8's SIMD capabilities for faster processing of matrix and reduction operations.[1] These low-level adaptations are designed for systems such as the dual 8-core IBM POWER8 S824.[1]Core Features
Conditional Memory System
The RAM Coffers project introduces a conditional memory paradigm that separates static model knowledge from dynamic computation, enabling more efficient inference by selectively storing and accessing portions of the model weights based on query relevance.[5] Model weights are partitioned across distinct memory banks according to semantic domains, such as core knowledge (often labeled as heavy/general), science/technology, creative writing (including long-context handling), and history (or niche topics).[5] This domain-based organization allows the system to assign specialized portions of the model to dedicated coffers, reflecting natural divisions in knowledge types. For each incoming query, only the relevant memory bank is activated, ensuring that computation draws from the appropriate domain-specific weights rather than loading the entire model.[5] This selective activation minimizes unnecessary data movement and computation. Implemented via NUMA-distributed weight banking, the conditional memory approach achieves its primary design goals of reducing latency through targeted access and alleviating memory pressure by avoiding full-model residency during inference.[5] The paradigm emphasizes efficient separation of persistent knowledge storage from runtime processing, a principle that supports high-performance operation even on constrained hardware configurations.[2]PSE Entropy Burst
PSE Entropy Burst is a hardware entropy injection mechanism in RAM Coffers that introduces true randomness into the LLM generation process to enhance output diversity.[1] It leverages the PowerPC timebase register as a source of hardware-derived entropy, which is injected during the generation step after coffer activation and path pruning.[1] This approach addresses the limitations of deterministic sampling in large language models by injecting variability at the hardware level, resulting in more varied and creative responses across generations.[1] The technique is implemented in the filepse-entropy-burst.h and forms part of the inference pipeline where entropy is applied from the active NUMA coffer node.[1] By drawing on the PowerPC architecture's native timebase for entropy, PSE Entropy Burst provides a low-overhead method to improve generation quality that is particularly effective on PowerPC systems such as IBM POWER8 hardware.[1] This PowerPC-specific optimization complements the broader conditional memory system by ensuring that diversity enhancements remain computationally efficient without relying on software-based pseudo-random sources.[1]
DCBT Resident Prefetch
DCBT Resident Prefetch is a PowerPC-specific cache optimization in RAM Coffers that uses the Data Cache Block Touch (DCBT) instruction to issue cache hints, proactively loading data into L2 and L3 caches to ensure residency before access during inference.[1] The technique operates by inserting DCBT instructions during the activation of relevant memory coffers, which prompts the hardware to fetch and retain targeted cache lines in higher-level caches, thereby reducing latency from future memory accesses.[1] This process integrates with thread affinity mechanisms to bind execution to the appropriate NUMA node while warming the cache, maximizing the effectiveness of the prefetch hints on PowerPC architectures.[1] By leveraging DCBT's ability to anticipate memory needs without immediate consumption, the optimization tailors cache behavior to the demands of conditional memory access in RAM Coffers, enhancing efficiency on IBM POWER8 systems where it accelerates LLM inference.[1]Top-K Collapse
Top-K Collapse is a performance optimization in RAM Coffers that accelerates the attention mechanism in large language model inference through VSX-optimized top-k selection and path collapsing. Implemented in the fileggml-topk-collapse-vsx.h, it targets the PowerPC architecture's Vector Scalar Extension (VSX) instruction set to enable efficient vectorized processing.[5][6]
The technique selects the top-k highest-scoring elements—typically attention scores or relevance paths—and collapses less significant ones, thereby reducing the number of computations required during attention. This selective pruning focuses resources on the most impactful connections, limiting the overhead associated with full attention matrix evaluation. By applying VSX intrinsics for parallel comparisons, reductions, and pruning operations, Top-K Collapse exploits hardware parallelism to minimize latency and memory bandwidth usage in attention layers.[5][1]
Top-K Collapse builds on non-bijunctive pruning principles by providing a specialized, attention-specific variant that integrates hardware-aware optimizations for PowerPC systems. This targeted approach enhances inference efficiency without requiring changes to the underlying model architecture, making it particularly suited to the NUMA-distributed conditional memory framework of RAM Coffers.[5]
Performance
Benchmarks on POWER8 Hardware
RAM Coffers has been benchmarked on IBM POWER8 hardware, specifically a dual 8-core S824 system equipped with 320GB of RAM.[5] These benchmarks focus on inference performance using the TinyLlama 1.1B model quantized to Q4_K, with measurements taken in tokens per second at a prompt length of 128 (pp128).[5] The results compare the stock llama.cpp baseline against progressive optimizations:- Stock llama.cpp: 16.74 tokens/sec[5]
- With POWER8 VSX optimizations: 66.49 tokens/sec[5]
- With additional PSE Collapse: 84.62 tokens/sec[5]
- Full RAM Coffers including DCBT: 147.54 tokens/sec[5]
Efficiency Gains in Video Generation
RAM Coffers development led to the discovery of significant efficiency improvements in video generation through the use of "emotional prompting," as detailed in the GRAIL-V paper submitted to CVPR 2026.[1] This technique, involving the incorporation of emotional language in prompts, yielded a 20% overall efficiency gain in video generation tasks.[1] In controlled ablation studies, it produced a 23.9% reduction in file size while maintaining perceptual quality.[1] For complex scenes involving multi-character emotional interactions, performance improvements reached up to 33%.[1] These gains were validated across diffusion-based video models including AnimateDiff and Stable Video Diffusion (SVD), using Learned Perceptual Image Patch Similarity (LPIPS) metrics in a 35 matched-pair benchmark.[1] The ablation study and associated findings emerged as a byproduct of RAM Coffers testing and were documented in the GRAIL-V submission materials available in the project repository.[1]Comparisons
Comparison to DeepSeek Engram
RAM Coffers and DeepSeek's Engram both introduce conditional memory mechanisms that decouple static knowledge from dynamic computation in large language model inference, allowing targeted retrieval of relevant parameters while reducing unnecessary processing.[3][2] The developer of RAM Coffers announced and demonstrated concepts in December 2025 (including a related video on December 17, 2025), while the GitHub repository became active with initial commits on January 19, 2026. DeepSeek's Engram paper was published on arXiv (2601.07372) on January 12, 2026.[2][3] While both systems achieve O(1) lookup for efficient conditional memory activation, RAM Coffers emphasizes NUMA-aware resonance routing to selectively activate domain-sharded weight banks, with a focus on commodity and retro hardware such as IBM POWER8 systems.[2] In contrast, Engram employs scalable lookup via hashed N-gram embeddings and deterministic addressing to offload static patterns to host memory, complementing Mixture-of-Experts architectures.[3][7] RAM Coffers focuses on query-based weight routing tailored to non-standard hardware, whereas Engram introduces conditional memory as a distinct sparsity axis optimized for scalable deployment.Ecosystem and Impact
Integration with llama.cpp
RAM Coffers is built directly on top of llama.cpp, extending the framework's GGML library to incorporate NUMA-aware memory management and specialized inference optimizations.[1] The integration focuses on enhancing llama.cpp's memory handling and routing mechanisms through additions such as multi-bank indexing for distributed weights, resonance-based routing for selective activation, and hardware-specific optimizations targeted at architectures like IBM POWER8.[5] Core integration occurs via several key header files that modify or extend GGML components: ggml-ram-coffers.h implements multi-bank NUMA weight indexing with resonance routing; ggml-coffer-mmap.h manages GGUF model sharding across NUMA nodes; ggml-intelligent-collapse.h introduces path pruning mechanisms; and pse-entropy-burst.h provides hardware entropy injection tailored to PowerPC systems.[5] This approach preserves compatibility with existing GGUF models from the llama.cpp ecosystem, allowing RAM Coffers to serve as a drop-in enhancement for inference without necessitating model conversions.[1]Sustainability and Proof-of-Antiquity
RAM Coffers enables efficient large language model inference on aging hardware such as IBM POWER8 servers, which can promote the reuse of legacy systems and potentially extend their operational life for AI workloads.[1] The project is developed by Elyan Labs, which also develops the Proof-of-Antiquity (PoA) blockchain protocol implemented in Rustchain. PoA rewards relic machines based on their authenticity, age, and entropy endurance, incentivizing the preservation of vintage computing hardware. By prioritizing older systems in its consensus mechanism, PoA supports digital preservation and encourages the use of existing hardware over newer alternatives.[8] Both projects reflect a focus on retro-computing reuse, demonstrating the potential utility of older hardware in modern applications like AI inference.Repository Structure and Licensing
RAM Coffers is an open-source project hosted on GitHub at https://github.com/Scottcjn/ram-coffers under the account of Scott Boudreaux (@Scottcjn) of Elyan Labs.[1] The repository is licensed under the Apache-2.0 License.[1] Key files in the repository include:- ggml-ram-coffers.h (multi-bank indexing and routing)
- ggml-coffer-mmap.h (GGUF model sharding)
- ggml-intelligent-collapse.h (path pruning)
- pse-entropy-burst.h (entropy injection)
- DEEPSEEK_COMPARISON.md (side-by-side comparison with DeepSeek Engram)