Skip to content

PranavViswanath/brain_gpt

Repository files navigation

OpenBrain 🧠

The MRI for Large Language Models

Watch it in action!

OpenBrain provides real-time visualization of neural network internals in GPT-OSS-20B and other Mixture-of-Experts models. Watch how 768 expert networks collaborate across 24 transformer layers as the model processes language and generates responses.

OpenBrain Interface

✨ Features

  • 🔬 Real-time MoE Visualization: See expert routing decisions as they happen, token by token
  • 🧠 Attention Pattern Analysis: Visualize multi-head attention weights and patterns
  • ⚡ Production-Ready Architecture: Built for real hardware with GPU optimization
  • 📡 WebSocket Streaming: Sub-100ms latency real-time updates
  • 📊 Live Performance Metrics: Track GPU memory, throughput, and activation patterns
  • 🎯 Educational Interface: Learn how modern AI models actually work inside

🏗️ Technical Architecture

Model Support

  • Primary: GPT-OSS-20B (20 billion parameters)
  • Fallback: Any Hugging Face Transformers model with MoE layers
  • Architecture: 24 Transformer layers with Mixture-of-Experts

Mixture-of-Experts Configuration

  • 768 Total Experts (32 experts per layer)
  • Top-4 Expert Routing (12.5% sparsity - only 96/768 experts active per token)
  • Dynamic Load Balancing across expert networks
  • Specialized Expert Functions: Math, language, reasoning, code generation

Attention Mechanism

  • 32 Attention Heads per layer (768 total heads)
  • 8,192 Token Context window
  • Multi-head Self-Attention with real-time weight visualization
  • Attention Pattern Analysis showing token-to-token relationships

🚀 Quick Start

Prerequisites

  • Python: 3.8+ (3.10+ recommended)
  • GPU: NVIDIA GPU with 16GB+ VRAM (RTX 4080/4090, A4000+)
  • CUDA: 11.8+ or 12.x with compatible drivers
  • System RAM: 32GB+ recommended for large models

Installation Options

Option 1: Automated Setup (Recommended)

git clone https://github.com/your-username/openbrain
cd openbrain
python setup.py  # Installs everything automatically

Option 2: Manual Installation

# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install OpenBrain dependencies
pip install -r requirements.txt

# Optional: Flash Attention for 2-4x speedup
pip install flash-attn --no-build-isolation

Running OpenBrain

Cloud Deployment

export MODEL_PATH="/path/to/gpt-oss-20b"  # Point to your model
python openbrain_server.py                # Starts on port 8888
open http://localhost:8888                # View in browser

Fallback Mode (Smaller Model)

# If GPT-OSS-20B is not available, will auto-fallback to gpt2
python openbrain_server.py                # Automatically handles model fallback
open http://localhost:8888                # View in browser

Manual Installation

If the automatic setup fails:

# Install PyTorch with CUDA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install requirements
pip install -r requirements.txt

# Optional: Flash Attention for better performance
pip install flash-attn --no-build-isolation

Configuration

Environment Variables

  • MODEL_PATH: Path to GPT-OSS-20B model (required)
  • MAX_GPU_MEM_GB: Maximum GPU memory to use (default: "19GB")
  • CUDA_VISIBLE_DEVICES: GPU device selection

Model Loading

The server supports various model formats:

  • Hugging Face Transformers format
  • 4-bit quantization (BitsAndBytes)
  • Flash Attention 2 (when available)
  • Multi-GPU distribution

How It Works

MoE Routing Hooks

The system registers forward hooks on each MoE layer to capture:

  • Router probability distributions
  • Selected expert indices
  • Routing decisions per token
class MoEHook:
    def __call__(self, module, input, output):
        # Extract routing decisions
        router_probs = output.router_probs.detach().cpu().numpy()
        selected_experts = output.selected_experts.detach().cpu().numpy()
        
        # Store for visualization
        activation = ExpertActivation(
            layer=self.layer_idx,
            selected_experts=selected_experts.tolist(),
            router_probs=router_probs.tolist(),
            timestamp=time.time()
        )

Attention Visualization

Attention hooks capture multi-head attention patterns:

  • Per-head attention weights
  • Token-to-token attention maps
  • Attention pattern evolution

Real-time Streaming

WebSocket connection streams:

  • Generated tokens
  • Expert activations per token
  • Attention patterns
  • Performance metrics

Performance Optimization

Memory Management

  • 4-bit Quantization: Reduces memory usage by ~75%
  • Gradient Checkpointing: Trades compute for memory
  • Model Sharding: Distributes across multiple GPUs

Inference Speed

  • Flash Attention 2: 2-4x faster attention computation
  • KV Caching: Reuses previous computations
  • Mixed Precision: FP16/BF16 for faster computation

Hardware Requirements

Minimum Requirements

  • GPU: 16GB VRAM (RTX 4080, A4000)
  • CPU: 8+ cores
  • RAM: 32GB system memory
  • Storage: 100GB for model weights

Recommended Requirements

  • GPU: 24GB+ VRAM (RTX 4090, A5000, A6000)
  • CPU: 16+ cores
  • RAM: 64GB system memory
  • Storage: NVMe SSD for model loading

Multi-GPU Setup

For larger deployments:

export CUDA_VISIBLE_DEVICES=0,1,2,3
python openbrain_server.py

API Reference

WebSocket Messages

Client → Server:

{
    "type": "generate",
    "prompt": "Your prompt here"
}

Server → Client:

{
  "type": "token",
    "token": "generated",
  "expert_activations": [
    {
            "layer": 0,
            "selected_experts": [1, 7, 15, 23],
            "router_probs": [0.4, 0.3, 0.2, 0.1],
            "timestamp": 1234567890.123
        }
    ],
    "attention_activations": [...]
}

REST Endpoints

  • GET /: Serve main application
  • GET /health: Health check and system status
  • GET /assets/*: Static file serving

Troubleshooting

Common Issues

CUDA Out of Memory:

export MAX_GPU_MEM_GB="12GB"  # Reduce memory usage

Model Loading Fails:

# Use smaller model for testing
export MODEL_PATH="microsoft/DialoGPT-medium"

WebSocket Connection Issues:

  • Check firewall settings
  • Verify port 8888 is available
  • Try different browser

Debug Mode

Enable detailed logging:

export PYTHONPATH=.
python -m logging.basicConfig level=DEBUG openbrain_server.py

Development

Project Structure

openbrain/
├── openbrain_server.py    # Main server
├── openbrain.html         # Frontend
├── requirements.txt       # Dependencies
├── setup.py              # Installation script
├── assets/               # Static assets
└── README.md            # This file

Contributing

  1. Fork the repository
  2. Create feature branch
  3. Add tests for new functionality
  4. Submit pull request

Testing

# Test with smaller fallback model
export MODEL_PATH="gpt2"
python openbrain_server.py

License

MIT License - see LICENSE file for details.

Citation

If you use OpenBrain in your research:

@software{openbrain2024,
    title={OpenBrain: Real-time MoE Visualization for Large Language Models},
    author={OpenBrain Team},
    year={2024},
    url={https://github.com/your-repo/openbrain}
}

Acknowledgments

  • GPT-OSS-20B model architecture
  • Hugging Face Transformers library
  • FastAPI for web framework
  • PyTorch for deep learning backend

About

The MRI for large language models - Watch GPT think in real time!

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors