Anti-Fragile Distributed Order Engine
A financial-grade order processing system that guarantees zero data loss, even during crashes.
Features • Architecture • Quick Start • API • Chaos Testing
Most e-commerce systems fail silently. An order is placed, the payment goes through, but a network hiccup loses the confirmation email. Or worse—inventory goes negative because two users bought the "last item" simultaneously.
This project solves these problems the way banks do: with proper financial-grade patterns that guarantee consistency even when services crash mid-operation.
Built for engineers who understand that "it works on my machine" isn't good enough.
Every financial movement creates paired entries. When money moves, we record both sides:
User Account: DEBIT $99.99
Merchant Account: CREDIT $99.99
The books always balance. If they don't, something is very wrong—and we'll know immediately.
What happens when 1,000 users click "Buy" on the last iPhone simultaneously?
With a naive implementation: negative inventory, overselling, angry customers, chargebacks.
With Redlock: exactly one user wins. The rest get a graceful "out of stock" message before their payment is even attempted.
lock, err := redlock.Acquire(ctx, "inventory:iphone-15", 10*time.Second)
if err != nil {
return ErrItemUnavailable
}
defer redlock.Release(ctx, lock)
// Safe to modify inventory hereThe classic distributed systems problem:
- Save order to database ✓
- Send confirmation to Kafka ← Kafka crashes here
- User never gets email, data is inconsistent
Our solution: save the event in the same database transaction as the order. A background worker reads this "outbox" table and publishes to Kafka. If Kafka is down, events queue up. When it recovers, everything catches up.
Zero. Data. Loss.
We don't just hope the system is resilient—we prove it. The included chaos testing suite randomly kills services and verifies:
- Ledger still balances
- No orders stuck in limbo
- All events eventually delivered
┌─────────────────────────────────────┐
│ Load Balancer │
└─────────────────┬───────────────────┘
│
┌─────────────────▼───────────────────┐
│ API Server │
│ • Rate Limiting │
│ • Request Validation │
│ • Prometheus Metrics │
└─────────────────┬───────────────────┘
│
┌─────────────────────────────────┼─────────────────────────────────┐
│ │ │
┌──────────▼──────────┐ ┌───────────▼───────────┐ ┌───────────▼───────────┐
│ Order Service │ │ Inventory Service │ │ Payment Service │
│ │ │ │ │ │
│ • Saga Orchestrator│ │ • Redlock Locking │ │ • Provider Abstraction│
│ • State Machine │ │ • Reservation System │ │ • Retry with Backoff │
│ • Compensation │ │ • Expiry Handling │ │ • Refund Support │
└──────────┬──────────┘ └───────────┬───────────┘ └───────────┬───────────┘
│ │ │
└─────────────────────────────────┼─────────────────────────────────┘
│
┌─────────────────▼───────────────────┐
│ Ledger Service │
│ • Double-Entry Accounting │
│ • Balance Calculation │
│ • Consistency Validation │
└─────────────────┬───────────────────┘
│
┌─────────────────────────────────┼─────────────────────────────────┐
│ │ │
┌──────────▼──────────┐ ┌───────────▼───────────┐ ┌───────────▼───────────┐
│ PostgreSQL │ │ Redis Cluster │ │ Kafka │
│ │ │ (3 nodes) │ │ │
│ • Orders │ │ • Distributed Locks │ │ • Event Streaming │
│ • Ledger │ │ • Caching │ │ • Async Processing │
│ • Outbox │ │ │ │ │
└─────────────────────┘ └───────────────────────┘ └───────────────────────┘
- Docker & Docker Compose
- Go 1.21+ (for local development)
- Make (optional, but recommended)
# Clone the repo
git clone https://github.com/yourusername/go-resilient-commerce.git
cd go-resilient-commerce
# Start everything
make docker-up
# Wait for services to be healthy, then run migrations
make migrate
# Seed some test products
make seedThat's it. You now have:
# Create an order
curl -X POST http://localhost:8080/api/v1/orders \
-H "Content-Type: application/json" \
-d '{
"user_id": "user-123",
"currency": "USD",
"items": [
{"product_id": "prod-001", "quantity": 1}
],
"shipping_address": {
"street": "123 Main St",
"city": "San Francisco",
"state": "CA",
"country": "USA",
"postal_code": "94102"
}
}'The saga orchestrator will:
- Create the order
- Reserve inventory (with Redlock)
- Process payment
- Update ledger (double-entry)
- Publish events to Kafka (via outbox)
All atomically. If any step fails, previous steps are automatically compensated.
| Method | Endpoint | Description |
|---|---|---|
POST |
/api/v1/orders |
Create new order |
GET |
/api/v1/orders/{id} |
Get order by ID |
GET |
/api/v1/orders?user_id=... |
Get user's orders |
POST |
/api/v1/orders/{id}/cancel |
Cancel order |
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/v1/products |
List all products |
GET |
/api/v1/products/{id} |
Get product details |
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/v1/accounts/{id}/balance |
Get account balance |
GET |
/api/v1/accounts/{id}/transactions |
Get transaction history |
GET |
/api/v1/ledger/validate |
Validate ledger consistency |
| Method | Endpoint | Description |
|---|---|---|
GET |
/health |
Health check |
GET |
/ready |
Readiness probe |
GET |
/metrics |
Prometheus metrics |
This simulates 1,000 users simultaneously trying to buy the last item:
make chaos-raceExpected result: exactly one successful purchase. The other 999 get ErrInsufficientInventory.
This randomly kills services while orders are being processed:
make chaosThe script:
- Creates orders continuously
- Randomly kills the payment service or worker
- Restarts killed services after 2 seconds
- After 60 seconds, validates:
- Ledger still balances
- No orphaned orders
- All outbox events eventually processed
make chaos-ledgerIf this ever fails in production, you have a serious problem. (It won't fail.)
go-resilient-commerce/
├── cmd/
│ ├── api/ # REST API server
│ ├── worker/ # Outbox event processor
│ └── chaos/ # Chaos testing CLI
├── internal/
│ ├── api/ # HTTP layer
│ ├── config/ # Configuration
│ ├── domain/ # Business domain
│ ├── inventory/ # Inventory + Redlock
│ ├── ledger/ # Double-entry accounting
│ ├── order/ # Saga orchestration
│ ├── outbox/ # Transactional outbox
│ ├── payment/ # Payment processing
│ └── platform/ # Infrastructure
├── migrations/ # SQL migrations
├── deployments/ # Docker, Prometheus, Grafana
├── scripts/ # Utility scripts
└── tests/ # Integration & benchmark tests
make build # Build all binaries
make test # Run unit tests
make test-integration # Run integration tests
make lint # Run linter
make fmt # Format code
make docker-up # Start infrastructure
make docker-down # Stop infrastructure
make help # Show all commands# Install dependencies
make deps
# Copy and configure environment
cp .env.example .env
# Run API server
make run-api
# In another terminal, run the worker
make run-worker| Decision | Why |
|---|---|
| Standard library HTTP | Zero framework overhead. Maximum performance. |
| Decimal for money | Floating-point arithmetic causes financial calculation errors. Always use decimal. |
| UUID everywhere | Sequential IDs leak information and enable enumeration attacks. |
| Serializable isolation | For financial operations, we accept the performance hit for correctness. |
| Saga over 2PC | Two-phase commit doesn't scale. Sagas with compensation do. |
| Outbox over direct publish | Kafka being down shouldn't lose orders. Ever. |
| Redlock over single Redis | Single-node Redis isn't truly distributed. Redlock provides safety guarantees. |
Benchmarks run on M1 MacBook Pro:
BenchmarkDecimalAdd 50000000 25.3 ns/op 0 B/op 0 allocs/op
BenchmarkDecimalMultiply 50000000 31.2 ns/op 0 B/op 0 allocs/op
BenchmarkOrderValidation 10000000 112.0 ns/op 0 B/op 0 allocs/op
Under load testing with 1000 concurrent users:
- Orders/sec: ~2,500 (limited by PostgreSQL)
- P99 latency: <50ms
- Error rate: 0% (excluding expected inventory conflicts)
The included Grafana dashboard shows:
- Orders per second by status (created, fulfilled, cancelled)
- Latency percentiles (P50, P95, P99)
- Outbox queue depth (should be near zero)
- Inventory lock success rate
- Ledger consistency status
If the outbox queue grows or ledger consistency fails, you'll know immediately.
See CONTRIBUTING.md for guidelines.
Short version:
- Fork it
- Create your feature branch
- Write tests
- Make sure
make testandmake lintpass - Open a PR
MIT License. See LICENSE for details.
Built with 💪 for engineers who take reliability seriously.