A modular, local-first platform for document analysis and investigative research
Philosophy | Architecture | Features | Quick Start | Security | Production | Shards | Documentation
SHATTERED isn't a product - it's a platform. The shards are the products. Or rather, bundles of shards configured for specific use cases.
Core Principles:
- Build domain-agnostic infrastructure that supports domain-specific applications
- Lower the bar for contribution so non-coders can build custom shards
- Provide utility to people in need, not just those who can pay
- Local-first: Your data never leaves your machine unless you want it to
- Privacy-preserving: No telemetry, no cloud dependencies, full data sovereignty
Every investigative workflow follows the same fundamental pattern:
INGEST --> EXTRACT --> ORGANIZE --> ANALYZE --> ACT
| | | | |
| | | | +-- Export, Generate, Notify
| | | +-- ACH, Contradictions, Patterns, Anomalies
| | +-- Timeline, Graph, Matrix, Provenance
| +-- Entities, Claims, Events, Relationships
+-- Documents, Data, Communications, Records
- Core shards handle INGEST and EXTRACT
- Domain shards handle ORGANIZE and ANALYZE
- Output shards handle ACT
SHATTERED uses the Voltron architectural philosophy: a modular, plug-and-play system where self-contained shards combine into a unified application.
+------------------+
| ArkhamFrame | <-- THE FRAME (immutable core)
| (17 Services) |
+--------+---------+
|
+--------+---------+
| arkham-shell | <-- THE SHELL (UI renderer)
| (React/TypeScript)|
+--------+---------+
|
+--------------------+--------------------+
| | | | |
+----v----+ +--v--+ +-----v-----+ +--v--+ +---v---+
|Dashboard| | ACH | | Search | |Graph| |Timeline| <-- SHARDS (25)
+---------+ +-----+ +-----------+ +-----+ +--------+
- Frame is Immutable: Shards depend on the Frame, never the reverse
- No Shard Dependencies: Shards communicate via events, not imports
- Schema Isolation: Each shard gets its own PostgreSQL schema
- Graceful Degradation: Works with or without AI/GPU capabilities
- Event-Driven Architecture: Loose coupling through pub/sub messaging
| Feature | Description |
|---|---|
| AI Junior Analyst | LLM-powered analysis across all shards - anomaly detection, contradiction finding, pattern recognition, credibility assessment, and insight synthesis |
| LLM Summarization | Automatic document and corpus summarization with multiple formats (brief, standard, detailed, executive, key points) |
| Deception Detection | AI-assisted credibility assessment using MOM, POP, MOSES, and EVE checklists |
| Query Expansion | Semantic search enhancement via LLM |
| Devil's Advocate | AI-generated counter-arguments for ACH analysis |
| Technique | Capabilities |
|---|---|
| ACH (Analysis of Competing Hypotheses) | Full matrix analysis, evidence scoring, premortem analysis, cone of plausibility, corpus search integration, scenario planning, devil's advocate mode |
| Contradiction Detection | Automated identification of conflicting claims across documents with severity scoring and resolution tracking |
| Pattern Recognition | Recurring patterns, behavioral patterns, temporal patterns, correlation analysis with statistical significance |
| Anomaly Detection | Statistical anomalies, contextual anomalies, collective anomalies with LLM-powered analysis |
| Credibility Assessment | Source reliability scoring, bias indicators, deception detection checklists |
| Provenance Tracking | Evidence chains, data lineage, audit trails, artifact verification |
Graph Analysis - 10+ visualization modes:
| Mode | Description |
|---|---|
| Force-Directed | Interactive network layout with physics simulation |
| Hierarchical | Tree-based layouts (top-down, bottom-up, radial) |
| Circular | Entities arranged in circular patterns |
| Sankey | Flow diagrams showing relationships and quantities |
| Matrix | Adjacency matrix for dense relationship analysis |
| Geographic | Map overlays with Leaflet integration |
| Causal | Cause-and-effect relationship visualization |
| Argumentation | ACH integration showing evidence-hypothesis relationships |
| Link Analysis | i2 Analyst Notebook-style investigation graphs |
| Temporal | Time-based graph evolution |
Graph Analytics:
- Centrality measures (degree, betweenness, closeness, eigenvector, PageRank)
- Community detection algorithms
- Path finding (shortest path, all paths, critical paths)
- Cycle detection
- Component analysis
Timeline Analysis:
- Temporal event extraction and visualization
- Date normalization across formats
- Conflict detection for overlapping events
- Phase/period management
- Gap analysis
- Event clustering
| Stage | Capabilities |
|---|---|
| Ingest | Multi-format support (PDF, DOCX, images, HTML, TXT), batch processing, duplicate detection, job queue management |
| OCR | PaddleOCR for standard OCR, Vision LLM for complex documents (supports local Qwen-VL or cloud APIs like GPT-4o), language detection, confidence scoring |
| Parse | 8 chunking strategies, metadata extraction, relations extraction, table detection |
| Embed | Multiple embedding models, batch processing, incremental updates |
| Entity Extraction | spaCy-powered NER (PERSON, ORG, GPE, DATE, etc.), relationship detection, duplicate merging |
| Claim Extraction | Factual claim identification, source attribution, verification status tracking |
| Type | Description |
|---|---|
| Semantic Search | Vector similarity using pgvector embeddings |
| Keyword Search | PostgreSQL full-text search with BM25 ranking |
| Hybrid Search | Combined semantic + keyword with configurable weights |
| Similarity Search | Find documents similar to a reference document |
| Faceted Search | Filter by project, document type, date range, entities |
| Feature | Formats |
|---|---|
| Data Export | JSON, CSV, PDF, DOCX |
| Analytical Reports | Investigation summaries, entity profiles, timeline reports, ACH reports |
| Letters | FOIA requests, complaints, legal correspondence with templates |
| Packets | Complete investigation bundles with versioning and sharing |
| Templates | Jinja2-based template system with placeholder validation |
The Frame provides 17 core services available to all shards:
| Service | Description |
|---|---|
| ConfigService | Environment + YAML configuration management |
| ResourceService | Hardware detection, GPU/CPU management, tier assignment |
| StorageService | File/blob storage with categories and lifecycle |
| DatabaseService | PostgreSQL with per-shard schema isolation |
| VectorService | pgvector-based vector storage for embeddings and similarity search |
| LLMService | OpenAI-compatible LLM integration (LM Studio, Ollama, vLLM) |
| ChunkService | 8 text chunking strategies (semantic, sentence, fixed, etc.) |
| EventBus | Pub/sub messaging for inter-shard communication |
| WorkerService | PostgreSQL-based job queues (SKIP LOCKED) with specialized worker pools |
| DocumentService | Document CRUD with content and metadata access |
| EntityService | Entity extraction, relationships, and deduplication |
| ProjectService | Project organization and management |
| ExportService | Multi-format export (JSON, CSV, PDF, DOCX) |
| TemplateService | Jinja2 template rendering and management |
| NotificationService | Email, webhook, and log notifications |
| SchedulerService | APScheduler-based job scheduling |
| AIJuniorAnalystService | LLM-powered cross-shard analysis |
| Shard | Description | Key Features |
|---|---|---|
| Dashboard | System monitoring and administration | Service health, database stats, worker management, event log, LLM configuration |
| Projects | Project organization | Project CRUD, document organization, bulk operations |
| Settings | Application configuration | 7 setting categories, import/export, reset capabilities |
| Shard | Description | Key Features |
|---|---|---|
| Ingest | Document ingestion | Multi-format support, batch processing, job queue, duplicate detection |
| Documents | Document management | CRUD operations, content access, metadata, batch operations |
| Parse | Document parsing | 8 chunking strategies, relations extraction, table detection |
| Embed | Vector embeddings | Multiple models, batch processing, incremental updates |
| OCR | Text extraction | PaddleOCR, Vision LLM (local or cloud), language detection, confidence scoring |
| Entities | Entity management | NER extraction, relationships, deduplication, type management |
| Shard | Description | Key Features |
|---|---|---|
| Search | Document search | Semantic, keyword, hybrid search, facets, suggestions |
| Shard | Description | Key Features |
|---|---|---|
| ACH | Analysis of Competing Hypotheses | Matrix analysis, premortem, cone of plausibility, corpus search, scenarios |
| Claims | Claim extraction | Document extraction, verification status, source attribution |
| Credibility | Source assessment | Reliability scoring, bias detection, deception checklists (MOM/POP/MOSES/EVE) |
| Contradictions | Conflict detection | Cross-document analysis, severity scoring, resolution tracking |
| Anomalies | Anomaly detection | Statistical, contextual, collective anomalies, LLM analysis |
| Patterns | Pattern recognition | Recurring, behavioral, temporal, correlation patterns |
| Provenance | Evidence chains | Data lineage, audit trails, artifact verification |
| Summary | Auto-summarization | Multiple summary types, batch processing, auto-summarize on ingest |
| Shard | Description | Key Features |
|---|---|---|
| Graph | Network visualization | 10+ layout modes, analytics, cross-shard integration |
| Timeline | Temporal visualization | Event extraction, date normalization, phases, gap detection |
| Shard | Description | Key Features |
|---|---|---|
| Export | Data export | JSON, CSV, PDF, DOCX, job management |
| Reports | Report generation | Multiple report types, templates, scheduling |
| Letters | Letter generation | FOIA, complaints, legal templates |
| Packets | Investigation bundles | Versioning, sharing, access control |
| Templates | Template management | Jinja2 syntax, versioning, validation |
| Component | Technology |
|---|---|
| Runtime | Python 3.10+ |
| API Framework | FastAPI with async/await |
| Database | PostgreSQL 14+ with pgvector extension |
| Job Queue | PostgreSQL (SKIP LOCKED pattern) |
| Vector Store | pgvector (PostgreSQL extension) |
| Component | Technology |
|---|---|
| Framework | React 18 + TypeScript 5 |
| Build Tool | Vite |
| Styling | TailwindCSS + shadcn/ui |
| Icons | Lucide React |
| State | URL state + local storage |
| Charts | Recharts |
| Maps | Leaflet |
| Component | Options |
|---|---|
| LLM Inference | LM Studio, Ollama, vLLM, OpenAI API |
| NER | spaCy (en_core_web_sm/lg/trf) |
| OCR | PaddleOCR, Vision LLM (Qwen-VL local, or cloud GPT-4o/Claude) |
| Embeddings | sentence-transformers, OpenAI |
- Python 3.10+
- Node.js 18+ (for local UI development only)
- PostgreSQL 14+ with pgvector extension
# Clone the repository
git clone https://github.com/yourusername/SHATTERED.git
cd SHATTERED
# Install the Frame
cd packages/arkham-frame
pip install -e .
# Install all shards (or select specific ones)
for dir in ../arkham-shard-*/; do
pip install -e "$dir"
done
# Install spaCy model
python -m spacy download en_core_web_sm
# Install UI dependencies
cd ../arkham-shard-shell
npm installCreate a .env file or set environment variables:
# Required - PostgreSQL with pgvector extension
DATABASE_URL=postgresql://user:pass@localhost:5432/shattered
# Optional - LLM Integration (OpenAI-compatible endpoint)
LLM_ENDPOINT=http://localhost:1234/v1
LLM_API_KEY=your-api-key
# Optional - Embedding Model (default: all-MiniLM-L6-v2)
EMBED_MODEL=all-MiniLM-L6-v2
# Optional - Vision LLM for OCR
VLM_ENDPOINT=http://localhost:1234/v1
# Optional - Auth (required for production)
AUTH_SECRET_KEY=generate-with-openssl-rand-hex-32# Terminal 1: Start the Frame API (auto-discovers installed shards)
python -m uvicorn arkham_frame.main:app --host 127.0.0.1 --port 8100
# Terminal 2 (optional): Start the UI for development
cd packages/arkham-shard-shell
npm run dev
# Workers start automatically with the Frame
# Background jobs use PostgreSQL SKIP LOCKED pattern# Copy environment template
cp .env.example .env
# Generate a secure auth key
python -c "import secrets; print('AUTH_SECRET_KEY=' + secrets.token_urlsafe(32))"
# Add the output to your .env file
# Start all services (PostgreSQL + App)
docker compose up -d
# Access the application
open http://localhost:8100The Docker setup includes:
- PostgreSQL 14 with pgvector extension pre-installed
- All shards and the UI bundled in a single container
- Automatic database migrations on startup
- No external dependencies (Redis, Qdrant not required)
First-time Setup: When you first access the application, you'll be prompted to create an admin account. This sets up your tenant and initial credentials.
SHATTERED includes built-in authentication and multi-tenant support.
- Start the application using Docker or manual installation
- Navigate to the app - you'll be redirected to the setup wizard
- Create your tenant - enter organization name and admin credentials
- Log in with your new admin account
| Role | Capabilities |
|---|---|
| Admin | Full access: user management, settings, audit logs |
| Analyst | Read/write access to all analysis features |
| Viewer | Read-only access to documents and analyses |
Add these to your .env file:
# REQUIRED: Generate with: python -c "import secrets; print(secrets.token_urlsafe(32))"
AUTH_SECRET_KEY=your-secure-random-key
# Optional: JWT token lifetime (default: 3600 seconds = 1 hour)
JWT_LIFETIME_SECONDS=3600
# Optional: Rate limiting
RATE_LIMIT_DEFAULT=100/minute
RATE_LIMIT_UPLOAD=20/minute
RATE_LIMIT_AUTH=10/minute
# Optional: CORS origins (comma-separated, defaults to localhost)
CORS_ORIGINS=https://your-domain.comAdmins can manage users at Settings → Users:
- Create new users with email, password, and role
- Edit user roles and display names
- Deactivate/reactivate accounts
- View user activity
All security-relevant actions are logged:
- User creation, updates, deletion
- Role changes
- Login attempts (coming soon)
View the audit log at Settings → Audit Log (admin only).
For production deployments with HTTPS, SHATTERED includes Traefik integration.
- A domain name pointing to your server
- Ports 80 and 443 open for Let's Encrypt verification
- Docker and Docker Compose installed
# 1. Configure environment
cp .env.example .env
# Edit .env and set:
# - AUTH_SECRET_KEY (generate a secure key)
# - DOMAIN=your-domain.com
# - [email protected]
# 2. Create certificate storage
mkdir -p traefik
touch traefik/acme.json
chmod 600 traefik/acme.json
# 3. Start with HTTPS
docker compose -f docker-compose.yml -f docker-compose.traefik.yml up -d- Automatic HTTPS via Let's Encrypt (auto-renewing)
- HTTP → HTTPS redirect for all traffic
- Security headers (HSTS, CSP, X-Frame-Options)
- Modern TLS (TLS 1.2+ only, strong ciphers)
# Check HTTPS is working
curl -I https://your-domain.com
# Check HTTP redirects
curl -I http://your-domain.com
# Should return 301 redirect to HTTPS
# Check security headers
curl -I https://your-domain.com | grep -i "strict-transport"To enable the Traefik dashboard:
# Generate password hash
htpasswd -nb admin your-password
# Add to .env
TRAEFIK_DASHBOARD=true
TRAEFIK_DASHBOARD_AUTH=admin:$apr1$... # output from htpasswd
# Access at https://traefik.your-domain.comSHATTERED is 100% air-gap capable when configured correctly. Deploy in isolated networks with no internet access.
- PostgreSQL 14+ with pgvector extension
- Local LLM server (LM Studio, Ollama, or vLLM)
- Pre-cached embedding models
On a connected machine:
# Start the application
docker compose up -d
# Navigate to Settings → ML Models
# Download desired embedding models using the UI
# Models are cached in: ~/.cache/huggingface/hubCopy the cache directory to your air-gapped system.
# .env for air-gapped deployment
DATABASE_URL=postgresql://user:pass@localhost:5432/shattered
# Enable offline mode (prevents model download attempts)
ARKHAM_OFFLINE_MODE=true
# Custom model cache location (if different from default)
ARKHAM_MODEL_CACHE=/path/to/huggingface/hub
# Local LLM endpoint
LLM_ENDPOINT=http://localhost:1234/v1
# Vision LLM for OCR (optional)
VLM_ENDPOINT=http://localhost:1234/v1| Server | Default Endpoint | Notes |
|---|---|---|
| LM Studio | http://localhost:1234/v1 |
GUI-based, easy setup |
| Ollama | http://localhost:11434/v1 |
CLI-based, many models |
| vLLM | http://localhost:8000/v1 |
High-performance serving |
| Feature | Air-Gap Status | Notes |
|---|---|---|
| Document Processing | Full | PDF, DOCX, images, etc. |
| OCR | Full | PaddleOCR works offline |
| Vision LLM OCR | Full | Requires local VLM (e.g., Qwen-VL) |
| Embeddings | Full | Pre-cache models first |
| Semantic Search | Full | pgvector is fully local |
| Entity Extraction | Full | spaCy models are local |
| LLM Analysis | Full | Requires local LLM server |
| Geo View | Limited | Requires internet for map tiles* |
*The Geo View tab in the Graph page fetches map tiles from OpenStreetMap. For full air-gap operation:
- Avoid using the Geo View tab (all other graph views work offline)
- For advanced users: set up a local tile server with offline OpenStreetMap data
After deployment, verify no external connections:
# Monitor network traffic (Linux)
ss -tuln | grep ESTAB
# Or use netstat
netstat -an | grep ESTABLISHED
# The only connections should be to:
# - localhost (PostgreSQL, LLM server)
# - Your local network (if applicable)SHATTERED supports diverse investigative workflows:
- Social Media Analysis: Archive posts, extract entities, map networks
- FOIA Tracking: Request templates, deadline tracking, response analysis
- Source Verification: Credibility assessment, claim verification, contradiction detection
- Publication Prep: Claim extraction, citation tracing, fact-check reports
- Tenant Defense: Violation chronology, housing code matching, evidence packets
- Employment Rights: Incident documentation, labor law elements, EEOC prep
- Consumer Protection: Warranty extraction, demand letters, small claims prep
- Case Building: Timeline construction, evidence organization, pattern identification
- Chronic Illness Management: Lab results parsing, symptom tracking, treatment analysis
- Insurance Appeals: Denial tracking, appeal letters, medical necessity documentation
- Diagnosis Research: Test organization, symptom progression, specialist prep
- Government Oversight: Meeting minutes parsing, vote tracking, promise vs action analysis
- Campaign Finance: Donor identification, bundling detection, money flow mapping
- Policy Analysis: Document comparison, stakeholder mapping, impact assessment
- Fraud Detection: Benford analysis, transaction anomalies, duplicate detection
- Investment Research: SEC filings analysis, financial statement parsing, news tracking
- Audit Support: Evidence chains, provenance tracking, documentation verification
- Structured Analysis: ACH matrices, alternative hypotheses, scenario planning
- Link Analysis: Entity relationships, network mapping, path analysis
- Temporal Analysis: Event timelines, pattern detection, prediction support
- Use
arkham-shard-achas reference implementation - Follow the manifest schema in
docs/shard_manifest_schema_prod.md - Implement the
ArkhamShardinterface from the Frame - No direct shard imports - use events for inter-shard communication
- Add comprehensive tests
packages/arkham-shard-{name}/
+-- pyproject.toml # Package definition with entry point
+-- shard.yaml # Manifest (navigation, events, capabilities)
+-- README.md # Documentation
+-- arkham_shard_{name}/
+-- __init__.py # Exports {Name}Shard class
+-- shard.py # Shard implementation
+-- api.py # FastAPI routes
+-- models.py # Pydantic models (optional)
+-- services/ # Business logic (optional)
# Run all tests
pytest
# Run specific shard tests
pytest packages/arkham-shard-ach/tests/
# Type checking
mypy packages/arkham-frame/
mypy packages/arkham-shard-ach/
# Linting
ruff check packages/All shards expose REST APIs that are auto-documented via FastAPI:
from fastapi import APIRouter, Depends
from arkham_frame import get_frame
router = APIRouter(prefix="/api/myshard", tags=["myshard"])
@router.get("/items")
async def list_items(frame=Depends(get_frame)):
# Access frame services
db = frame.db
events = frame.events
llm = frame.llm # Optional service
return {"items": [...]}| Document | Description |
|---|---|
| SECURITY.md | Security best practices and deployment guide |
| CLAUDE.md | Project guidelines and development standards |
| docs/voltron_plan.md | Architecture deep-dive |
| docs/shard_manifest_schema_prod.md | Production manifest schema |
| packages/arkham-frame/README.md | Frame services documentation |
| packages/arkham-shard-shell/README.md | UI shell documentation |
Each shard has its own README with API documentation, events, and usage examples.
| Metric | Value |
|---|---|
| Lines of Code | ~217,000 |
| Total Packages | 26 (25 shards + shell) |
| Frame Services | 17 |
| API Endpoints | 400+ |
| Graph Visualization Modes | 10+ |
| Chunking Strategies | 8 |
| Infrastructure | PostgreSQL-only (pgvector + SKIP LOCKED) |
- PostgreSQL-only architecture - Eliminated Redis and Qdrant dependencies
- pgvector integration - Native PostgreSQL vector search
- AI Junior Analyst integration across all analysis shards
- Full ACH implementation with premortem, cone of plausibility, corpus search
- Link Analysis mode (i2-style) for graph visualization
- Deception detection with MOM/POP/MOSES/EVE checklists
- Evidence chain provenance tracking
- Shared template system for exports
If you find SHATTERED useful, consider supporting development:
Contributions welcome! See CLAUDE.md for project guidelines.
Ways to contribute:
- Bug Reports: Open issues with reproduction steps
- Feature Requests: Describe your use case
- Code: Follow the shard development guidelines
- Documentation: Help improve guides and examples
- Bundles: Create pre-configured shard bundles for specific use cases
MIT License - Copyright (c) 2025-2026 Justin McHugh
See LICENSE for details.
SHATTERED - Break documents into pieces. Reassemble the truth.
Built for journalists, investigators, advocates, and anyone seeking truth in documents.
