OpenBrain 🧠

The MRI for Large Language Models

OpenBrain provides real-time visualization of neural network internals in GPT-OSS-20B and other Mixture-of-Experts models. Watch how 768 expert networks collaborate across 24 transformer layers as the model processes language and generates responses.

✨ Features

🔬 Real-time MoE Visualization: See expert routing decisions as they happen, token by token
🧠 Attention Pattern Analysis: Visualize multi-head attention weights and patterns
⚡ Production-Ready Architecture: Built for real hardware with GPU optimization
📡 WebSocket Streaming: Sub-100ms latency real-time updates
📊 Live Performance Metrics: Track GPU memory, throughput, and activation patterns
🎯 Educational Interface: Learn how modern AI models actually work inside

🏗️ Technical Architecture

Model Support

Primary: GPT-OSS-20B (20 billion parameters)
Fallback: Any Hugging Face Transformers model with MoE layers
Architecture: 24 Transformer layers with Mixture-of-Experts

Mixture-of-Experts Configuration

768 Total Experts (32 experts per layer)
Top-4 Expert Routing (12.5% sparsity - only 96/768 experts active per token)
Dynamic Load Balancing across expert networks
Specialized Expert Functions: Math, language, reasoning, code generation

Attention Mechanism

32 Attention Heads per layer (768 total heads)
8,192 Token Context window
Multi-head Self-Attention with real-time weight visualization
Attention Pattern Analysis showing token-to-token relationships

🚀 Quick Start

Prerequisites

Python: 3.8+ (3.10+ recommended)
GPU: NVIDIA GPU with 16GB+ VRAM (RTX 4080/4090, A4000+)
CUDA: 11.8+ or 12.x with compatible drivers
System RAM: 32GB+ recommended for large models

Installation Options

Option 1: Automated Setup (Recommended)

git clone https://github.com/your-username/openbrain
cd openbrain
python setup.py  # Installs everything automatically

Option 2: Manual Installation

# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install OpenBrain dependencies
pip install -r requirements.txt

# Optional: Flash Attention for 2-4x speedup
pip install flash-attn --no-build-isolation

Running OpenBrain

Cloud Deployment

export MODEL_PATH="/path/to/gpt-oss-20b"  # Point to your model
python openbrain_server.py                # Starts on port 8888
open http://localhost:8888                # View in browser

Fallback Mode (Smaller Model)

# If GPT-OSS-20B is not available, will auto-fallback to gpt2
python openbrain_server.py                # Automatically handles model fallback
open http://localhost:8888                # View in browser

Manual Installation

If the automatic setup fails:

# Install PyTorch with CUDA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install requirements
pip install -r requirements.txt

# Optional: Flash Attention for better performance
pip install flash-attn --no-build-isolation

Configuration

Environment Variables

MODEL_PATH: Path to GPT-OSS-20B model (required)
MAX_GPU_MEM_GB: Maximum GPU memory to use (default: "19GB")
CUDA_VISIBLE_DEVICES: GPU device selection

Model Loading

The server supports various model formats:

Hugging Face Transformers format
4-bit quantization (BitsAndBytes)
Flash Attention 2 (when available)
Multi-GPU distribution

How It Works

MoE Routing Hooks

The system registers forward hooks on each MoE layer to capture:

Router probability distributions
Selected expert indices
Routing decisions per token

class MoEHook:
    def __call__(self, module, input, output):
        # Extract routing decisions
        router_probs = output.router_probs.detach().cpu().numpy()
        selected_experts = output.selected_experts.detach().cpu().numpy()
        
        # Store for visualization
        activation = ExpertActivation(
            layer=self.layer_idx,
            selected_experts=selected_experts.tolist(),
            router_probs=router_probs.tolist(),
            timestamp=time.time()
        )

Attention Visualization

Attention hooks capture multi-head attention patterns:

Per-head attention weights
Token-to-token attention maps
Attention pattern evolution

Real-time Streaming

WebSocket connection streams:

Generated tokens
Expert activations per token
Attention patterns
Performance metrics

Performance Optimization

Memory Management

4-bit Quantization: Reduces memory usage by ~75%
Gradient Checkpointing: Trades compute for memory
Model Sharding: Distributes across multiple GPUs

Inference Speed

Flash Attention 2: 2-4x faster attention computation
KV Caching: Reuses previous computations
Mixed Precision: FP16/BF16 for faster computation

Hardware Requirements

Minimum Requirements

GPU: 16GB VRAM (RTX 4080, A4000)
CPU: 8+ cores
RAM: 32GB system memory
Storage: 100GB for model weights

Recommended Requirements

GPU: 24GB+ VRAM (RTX 4090, A5000, A6000)
CPU: 16+ cores
RAM: 64GB system memory
Storage: NVMe SSD for model loading

Multi-GPU Setup

For larger deployments:

export CUDA_VISIBLE_DEVICES=0,1,2,3
python openbrain_server.py

API Reference

WebSocket Messages

Client → Server:

{
    "type": "generate",
    "prompt": "Your prompt here"
}

Server → Client:

{
  "type": "token",
    "token": "generated",
  "expert_activations": [
    {
            "layer": 0,
            "selected_experts": [1, 7, 15, 23],
            "router_probs": [0.4, 0.3, 0.2, 0.1],
            "timestamp": 1234567890.123
        }
    ],
    "attention_activations": [...]
}

REST Endpoints

GET /: Serve main application
GET /health: Health check and system status
GET /assets/*: Static file serving

Troubleshooting

Common Issues

CUDA Out of Memory:

export MAX_GPU_MEM_GB="12GB"  # Reduce memory usage

Model Loading Fails:

# Use smaller model for testing
export MODEL_PATH="microsoft/DialoGPT-medium"

WebSocket Connection Issues:

Check firewall settings
Verify port 8888 is available
Try different browser

Debug Mode

Enable detailed logging:

export PYTHONPATH=.
python -m logging.basicConfig level=DEBUG openbrain_server.py

Development

Project Structure

openbrain/
├── openbrain_server.py    # Main server
├── openbrain.html         # Frontend
├── requirements.txt       # Dependencies
├── setup.py              # Installation script
├── assets/               # Static assets
└── README.md            # This file

Contributing

Fork the repository
Create feature branch
Add tests for new functionality
Submit pull request

Testing

# Test with smaller fallback model
export MODEL_PATH="gpt2"
python openbrain_server.py

License

MIT License - see LICENSE file for details.

Citation

If you use OpenBrain in your research:

@software{openbrain2024,
    title={OpenBrain: Real-time MoE Visualization for Large Language Models},
    author={OpenBrain Team},
    year={2024},
    url={https://github.com/your-repo/openbrain}
}

Acknowledgments

GPT-OSS-20B model architecture
Hugging Face Transformers library
FastAPI for web framework
PyTorch for deep learning backend

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
__pycache__		__pycache__
assets		assets
.gitignore		.gitignore
README.md		README.md
deploy.md		deploy.md
index.html		index.html
logo.jpg		logo.jpg
openbrain.html		openbrain.html
openbrain_server.py		openbrain_server.py
requirements.txt		requirements.txt
server.py		server.py
setup.py		setup.py
setup_cloud_gpu.md		setup_cloud_gpu.md
test_server.py		test_server.py

Folders and files

Latest commit

History

Repository files navigation

OpenBrain 🧠

✨ Features

🏗️ Technical Architecture

Model Support

Mixture-of-Experts Configuration

Attention Mechanism

🚀 Quick Start

Prerequisites

Installation Options

Option 1: Automated Setup (Recommended)

Option 2: Manual Installation

Running OpenBrain

Cloud Deployment

Fallback Mode (Smaller Model)

Manual Installation

Configuration

Environment Variables

Model Loading

How It Works

MoE Routing Hooks

Attention Visualization

Real-time Streaming

Performance Optimization

Memory Management

Inference Speed

Hardware Requirements

Minimum Requirements

Recommended Requirements

Multi-GPU Setup

API Reference

WebSocket Messages

REST Endpoints

Troubleshooting

Common Issues

Debug Mode

Development

Project Structure

Contributing

Testing

License

Citation

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages