The MRI for Large Language Models
OpenBrain provides real-time visualization of neural network internals in GPT-OSS-20B and other Mixture-of-Experts models. Watch how 768 expert networks collaborate across 24 transformer layers as the model processes language and generates responses.
- 🔬 Real-time MoE Visualization: See expert routing decisions as they happen, token by token
- 🧠 Attention Pattern Analysis: Visualize multi-head attention weights and patterns
- ⚡ Production-Ready Architecture: Built for real hardware with GPU optimization
- 📡 WebSocket Streaming: Sub-100ms latency real-time updates
- 📊 Live Performance Metrics: Track GPU memory, throughput, and activation patterns
- 🎯 Educational Interface: Learn how modern AI models actually work inside
- Primary: GPT-OSS-20B (20 billion parameters)
- Fallback: Any Hugging Face Transformers model with MoE layers
- Architecture: 24 Transformer layers with Mixture-of-Experts
- 768 Total Experts (32 experts per layer)
- Top-4 Expert Routing (12.5% sparsity - only 96/768 experts active per token)
- Dynamic Load Balancing across expert networks
- Specialized Expert Functions: Math, language, reasoning, code generation
- 32 Attention Heads per layer (768 total heads)
- 8,192 Token Context window
- Multi-head Self-Attention with real-time weight visualization
- Attention Pattern Analysis showing token-to-token relationships
- Python: 3.8+ (3.10+ recommended)
- GPU: NVIDIA GPU with 16GB+ VRAM (RTX 4080/4090, A4000+)
- CUDA: 11.8+ or 12.x with compatible drivers
- System RAM: 32GB+ recommended for large models
git clone https://github.com/your-username/openbrain
cd openbrain
python setup.py # Installs everything automatically# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Install OpenBrain dependencies
pip install -r requirements.txt
# Optional: Flash Attention for 2-4x speedup
pip install flash-attn --no-build-isolationexport MODEL_PATH="/path/to/gpt-oss-20b" # Point to your model
python openbrain_server.py # Starts on port 8888
open http://localhost:8888 # View in browser# If GPT-OSS-20B is not available, will auto-fallback to gpt2
python openbrain_server.py # Automatically handles model fallback
open http://localhost:8888 # View in browserIf the automatic setup fails:
# Install PyTorch with CUDA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Install requirements
pip install -r requirements.txt
# Optional: Flash Attention for better performance
pip install flash-attn --no-build-isolationMODEL_PATH: Path to GPT-OSS-20B model (required)MAX_GPU_MEM_GB: Maximum GPU memory to use (default: "19GB")CUDA_VISIBLE_DEVICES: GPU device selection
The server supports various model formats:
- Hugging Face Transformers format
- 4-bit quantization (BitsAndBytes)
- Flash Attention 2 (when available)
- Multi-GPU distribution
The system registers forward hooks on each MoE layer to capture:
- Router probability distributions
- Selected expert indices
- Routing decisions per token
class MoEHook:
def __call__(self, module, input, output):
# Extract routing decisions
router_probs = output.router_probs.detach().cpu().numpy()
selected_experts = output.selected_experts.detach().cpu().numpy()
# Store for visualization
activation = ExpertActivation(
layer=self.layer_idx,
selected_experts=selected_experts.tolist(),
router_probs=router_probs.tolist(),
timestamp=time.time()
)Attention hooks capture multi-head attention patterns:
- Per-head attention weights
- Token-to-token attention maps
- Attention pattern evolution
WebSocket connection streams:
- Generated tokens
- Expert activations per token
- Attention patterns
- Performance metrics
- 4-bit Quantization: Reduces memory usage by ~75%
- Gradient Checkpointing: Trades compute for memory
- Model Sharding: Distributes across multiple GPUs
- Flash Attention 2: 2-4x faster attention computation
- KV Caching: Reuses previous computations
- Mixed Precision: FP16/BF16 for faster computation
- GPU: 16GB VRAM (RTX 4080, A4000)
- CPU: 8+ cores
- RAM: 32GB system memory
- Storage: 100GB for model weights
- GPU: 24GB+ VRAM (RTX 4090, A5000, A6000)
- CPU: 16+ cores
- RAM: 64GB system memory
- Storage: NVMe SSD for model loading
For larger deployments:
export CUDA_VISIBLE_DEVICES=0,1,2,3
python openbrain_server.pyClient → Server:
{
"type": "generate",
"prompt": "Your prompt here"
}Server → Client:
{
"type": "token",
"token": "generated",
"expert_activations": [
{
"layer": 0,
"selected_experts": [1, 7, 15, 23],
"router_probs": [0.4, 0.3, 0.2, 0.1],
"timestamp": 1234567890.123
}
],
"attention_activations": [...]
}GET /: Serve main applicationGET /health: Health check and system statusGET /assets/*: Static file serving
CUDA Out of Memory:
export MAX_GPU_MEM_GB="12GB" # Reduce memory usageModel Loading Fails:
# Use smaller model for testing
export MODEL_PATH="microsoft/DialoGPT-medium"WebSocket Connection Issues:
- Check firewall settings
- Verify port 8888 is available
- Try different browser
Enable detailed logging:
export PYTHONPATH=.
python -m logging.basicConfig level=DEBUG openbrain_server.pyopenbrain/
├── openbrain_server.py # Main server
├── openbrain.html # Frontend
├── requirements.txt # Dependencies
├── setup.py # Installation script
├── assets/ # Static assets
└── README.md # This file
- Fork the repository
- Create feature branch
- Add tests for new functionality
- Submit pull request
# Test with smaller fallback model
export MODEL_PATH="gpt2"
python openbrain_server.pyMIT License - see LICENSE file for details.
If you use OpenBrain in your research:
@software{openbrain2024,
title={OpenBrain: Real-time MoE Visualization for Large Language Models},
author={OpenBrain Team},
year={2024},
url={https://github.com/your-repo/openbrain}
}- GPT-OSS-20B model architecture
- Hugging Face Transformers library
- FastAPI for web framework
- PyTorch for deep learning backend
