0% found this document useful (0 votes)
157 views53 pages

Coinbase Arch

The document discusses the architecture and optimization of an ultra-low latency trading system using Java and Golang, focusing on components like the Order Management System and Matching Engine. It highlights the importance of deterministic state machines, fault tolerance with RAFT, and various performance tuning techniques to achieve low latency. Additionally, it addresses challenges in cloud deployment and network optimizations to enhance system performance.

Uploaded by

birdring
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
157 views53 pages

Coinbase Arch

The document discusses the architecture and optimization of an ultra-low latency trading system using Java and Golang, focusing on components like the Order Management System and Matching Engine. It highlights the importance of deterministic state machines, fault tolerance with RAFT, and various performance tuning techniques to achieve low latency. Additionally, it addresses challenges in cloud deployment and network optimizations to enhance system performance.

Uploaded by

birdring
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

The making of an

Ultra Low Latency Trading System


With Java and Golang

Yucong Sun Jonathan Ting


Staff Software Engineer Senior Software Engineer
Exchange @ Coinbase
● Takeaways
○ General architecture of an Exchange
○ State of the Art
○ Learnings from optimizing the legacy system
Planetary view of an Exchange
Most users would/should not interact with an Exchange directly

Brokerage
User
Product
API

Market Data
Market Makers /
Trading Firm
Exchange
Orbital view of an Exchange
Order Management System:
Balance, Risk, Margin/Liquidations
Matching Engine: Order book API

API: FIX, HTTP


MarketData: FIX, Websocket OMS

Hot path: Balance check, Order Matching


Warm path: Settlement Matching
Market Data
Engine
Auxiliary: Market Data Feed
Assembly Lines of a Exchange

Order Order Order


Trading
System
Process 1 Trade Event Trade Event Trade Event

Trade Event Trade Event Trade Event


Order Order Order Trading
System
Process 2
Exhibit A:
Coinbase Derivatives Exchange
https://www.coinbase.com/derivatives
Trading System Logic Isn’t Complex
Hot Path

Submit & match incoming orders against resting orders (‘book’)

Public - no complex trading relationships

Other logic (timers, admin requests, state)

Affect trading logic, so want to be sequenced with any other events

Trading system assigns IDs to state

Single threaded
Trading System as Deterministic State Machine

State₀ + Input₀ => State₁ + Output₁ ALWAYS

Can snapshot/restore/replay to get to live state

Trading
Determinism is Tricky!
System
- Data Structure Iteration
T=100 Equivalent
- No randomness
- Behavior changes
- Old input => Old behavior
- Feature flagging Old Snapshot + Old Input
T=50 T=50..100
Fault Tolerance with RAFT

Aeron Cluster Trading


High performance RAFT implementation System
(Follower)
(Leader)
App has to be deterministic & single threaded

Consensus batched & pipelined with application

X
System throughput = 1 / App processing time
Trading Trading
System System
(Leader) (Follower)

What is RAFT? Visualize it here


Persisted RAFT Log

Cluster persists RAFT log (input) to disk, as per protocol

Aeron Archive API allows for replicating the RAFT log for backup

UDP
Trading Cluster
System Replicate Backup
RAFT Log
Replicated RAFT Log
Audit - Upload to cold storage

Logging - Replay & Send to ELK outside hot path

Debugging - Reproduce bugs locally

Fixing - Backfill missing events

Testing - CI/CD replay to avoid regressions


(3)
Replicating For Replay
Local
Trading Run Trading Logic
System

(2) (4)
Replicate Input, not Output Replay
IPC
Record
RAFT Log Trading Events
Hot Path - Multicast output

Other - Replicate input & fan out UDP


Trading Aeron
System Archive
(4)
Output larger & unbounded (1) Replay
Replicate Trading Events
RAFT Log IPC
1 order => potentially cascading set of events

Service
Replicating For Scalability

Binary tree replication

Network Latency bound by log(n)

Bandwidth usage bounded


Entire Hot Path
Order Submit UDP
Order Trading
Gateway System
Order Ack

RTT outliers < 100 μs


1) Parse & validate Order Submit
RTT medians < 50 μs 2) Send request to trading system
3) RAFT Consensus
Trading System Processing Times ~ 1 μs 4) Matching Algorithm
5) Send order events to gateway
300k/s Peak Throughput
6) Translate Order Ack

= 4 Network Hops (~20μs) + Processing


Hardware Environment for CDE
Colocated in datacenter with customers

Commodity hardware

❖ Intel Optane Drives


Faster than enterprise SSDs
We can fsync if needed without too much penalty

❖ Low Latency Switches


350ns cut-through forwarding
Real-time packet capture without latency hit

Isolated NICs for low latency & bulk traffic


Exhibit B:
Onto the (AWS) Cloud
Cloud
Cons
- Less control over hardware environment
- Need to maintain both DC/AWS deployment, toolchain, configs…

Pros
- Codification, Collaboration
- Good enough performance
- Personal environment
Challenge with Compute/Storage
Machine family: t, m, c, r, z , suffixes N, D
- Recommend: https://instances.vantage.sh/

Storage
- EBS vs Instance Storage

Orchestration
- Recommendation: Nomad
Challenge with AWS networking
Is there a good switch on AWS?
- Cut-through: <0.5us
- Store & forward: 5us - 50us
Secrets with AWS Networks
● Understand spine-leaf networking architecture
○ Region, AZ, sub-azs, racks
○ Avoid load balancers
● cluster placement group
○ capacity reservations
● bad apples

https://www.xkyle.com/Measuring-AWS-Region-and-AZ-Latency/
Numbers On AWS

RTT outliers < 1 ms


RTT medians < 300 μs 10 x Network Hops (~250μs)

Trading System Processing ~ 1 μs


Exhibit C:
Deep Dive on Performance Tuning
Fast Memory Access
Memory Local Data Structures
Cache locality outweighs O(n)

Primitive Friendly Data Structures


No Map<Integer>, avoid Boxing/Unboxing

Deserialize from memory directly into primitives

Represent Strings as 2 Longs


128 bits => 18 7-bit (ascii) | 21 6-bit (alphanumeric) | 25 5-bit (alphabetic) | 32 4-bit (hex)

No Allocation on Hot Path


Object Pooling
Small Messages

Simple Binary Encoding <types>


<enum name="Side" encodingType="uint8">
<validValue name="BUY">0</validValue>
<validValue name="SELL">1</validValue>
Byte Alignment Matters </enum>

FPGA Deserialization <type name="ClientOrderId" primitiveType="char" length="32">


</types>

Order Fields By Size <sbe:message name="Order" id="1">


<field name="orderId" id="1" type="int64"/>
<field name="price" id="2" type="int64"/>
<field name="quantity" id="3" type="int32"/>
VarData / Enum / Bitsets at End <field name=”side” id=4” type=”Side”/>
<field name=”clientOrderId” id=”5” type=”ClientOrderId”/>
</sbe:message>

Add Padding If Necessary


Java Challenges - Warmup

10k function invocations => JIT compilation


Regulated Exchange - Cannot “warm up” our code

Azul Zulu Prime JVM - ReadyNow!


Cache and Persist JIT Profile + Optimizations
Pre-train new releases with multiple replays of PROD logs

Fast initial orders, remove JIT compilation jitter


Java Challenges - Garbage Collection

“Stop The World” GC - All Application Threads Stalled


Java 8 - Concurrent Mark Sweep

Azul Zulu Prime JVM - Pauseless Garbage Collector


Azul C4 Garbage Collector
Network Optimizations
Multicast
Consensus
Output to order and market data gateways

Aeron - High Performance Messaging


Reliable Transport over UDP
Per-channel settings
Congestion & Flow Control
Socket Buffers - # data in flight ideally equal to Bandwidth Delay Product
MTU - Jumbo Frames (9k) for batching
Network Optimizations

NIC Driver Softirq Application


NIC Driver Qdisc Application Aeron point-to-point
Sending as fast as possible on AWS
Linux Kernel Bypass
NIC Application
Mean Max Throughput

Kernel Bypass non-DPDK 38μs 1897μs 80MB/s

Read from network card directly from user space


DPDK 28us 515μs 500MB/s
Decreases median, drastically reduces outliers
OpenOnload in data center w/ SolarFlare NICs
DPDK in the cloud - Aeron Support (premium)
Medians Good, Outliers Spiky

Weeks Before Launch


OS Scheduling Delay / Context Switches
How are CPU cycles are not running your hot threads? /proc/interrupts - per CPU hardware interrupt #

/proc/sched_debug - task running time per CPU

/proc/<tid>/schedstat
time on cpu time on runqueue # time slices
4200925624037 12872240906155 780539850
4200966662712 12872278642290 780547937
4201007606214 12872323980891 780556132
4201046361274 12872441023508 780564249

perf - get thread runtime or counts individually on a given CPU


/proc/softirqs - per CPU hardware interrupt #
# perf record -e "sched:sched_stat_runtime" -C <core id>
# perf script | awk '{print $1 }' | sort | uniq -c
15 kworker/3:1H-kb
1 kworker/3:2-cgr
3 perf
1 rcu_sched
12356 sender
Recommendation: Netdata

a nice visual holistic view of the system

per-cpu interrupts/softirqs/utilization

network, memory, disk, filesystem


OS Scheduling
Pin hot threads to hardcoded CPUs (taskset, sched_setaffinity)
Prevents context switching & cache misses

Isolate hot CPUs or prioritize threads (ISOLCPUS, taskset, cpusets, nice, chrt)
Prevent other user threads from taking CPU time
Busy-spin hot threads to monopolize CPU (and for polling)

Set affinities to hardware interrupts, kernel workqueues, etc.


Hardware interrupts - use tuna, or set /proc/irq/<irq#>/smp_affinity
Softirq kernel params - rcu_nocbs, nohz_full
Other Tuning

NUMA locality
If you have multiple CPU sockets, one is closer to NIC and memory
Layout matters - lock hot threads to that CPU / Memory NUMA node

Hyperthreading
Turn it off (or isolate corresponding logical CPU)
More available L1/L2 cache without it
Exhibit D:
Apply the learnings to improve
The Legacy System
Where the real fun begins…
Fun with MicroServices
Feed
API-FIX OEGW DB Admin
Proxy

Rest
Clearing API-FEED API2
Gateway

API Jobs GoJobs

Clearing- Trading
Core Engine

Solution: Another dashboard???


Life of an request
Tracing an single order placement request from start to finish

3
1 2
DB
Client FIX OEGW Feed
Proxy
4 Trading 5
Engine
6

3 DB
1 2 Feed
Client FIX OEGW
Trading Proxy
4 Engine 5

E2E Order placement latency

FIX queue broadcaster

Wait for prev order Receive from


finish FIX to OEGW RPC (2) Wait for feed msg
FP (6)

Graph N OEGW RPC Handler N


Everything !!!
DB Hold (3) TE RPC (4) client

Beware: Conn TE RPC Handler


- Client side view vs Server SQL

Side view (client)


Raft Repl FSM
Send to FP
SQL
- E2E view vs per-unit view
(5)
(serv
er)
- Tracing sampling FP Receive
Infra Inefficiencies - 1000us -> 600us
vs 50us
- Compute/Storage
Happy Path: min/p50 - Network latency
- Cross AZ traffic
~1200us: Elevated but not that outrageous
- Load balancer
- fsync()s
Per operation cost - 30us vs 1us
- Full native, no warmup issue
- Allocations, Pointers
- Metrics recording / Logging

Do you know how often your datadog metrics call is sending a UDP packet out?
Is it just misplaced fsync()s?
Batched fsync on Optane
or no fsync() here
Balance Order
FIX Receive RAFT FIX Send FAST
Check Process

Balance Order SLOW


FIX Receive RAFT Process FIX Send
Check

DB Non batched fsync() here

fsync() cost ~500us to 1ms on AWS hardware


Pointer & Memory Allocations In Golang
Heap escape analysis (-gcflags “-m”)
- Sending pointers or values containing pointers to channels.
- Storing pointers or values containing pointers in a slice. like []*string.
- Backing arrays of slices that get reallocated because an append would exceed their
capacity.
- Calling methods on an interface type

Pass a small struct by value could be 8x faster vs passing by pointer, thus moving it to the
heap. (x86_64 has cache line size 64 bytes)

https://segment.com/blog/allocation-efficiency-in-high-performance-go-services/
- GC pause?
Unhappy Path: p99/max
- Scheduling delays?
P99 ~4ms, Max 362ms
- Non-FIFO behaviors?
WTF is going on…
Is Golang GC really a issue?

https://malloc.se/blog/zgc-jdk16

https://www.azul.com/sites/default/fi
les/images/c4_paper_acm.pdf

https://go.dev/blog/ismmkeynote

https://tip.golang.org/doc/gc-guide
Hint: Goroutine explosion by GRPC
Golang grpc unary requests default to create new goroutine for every request, this cause starvation of
any background goroutines, leads to tail latencies

34041
goroutines???
Hint: Goroutine scheduler delay
Goroutine is not your good old thread

- Go scheduler

- GOMAXPROCS =
num CPUs

- Remember: Only
GOMAXPROCS will
run at same time
Visualizing how API-FIX works
Shuffled
S1 BTC-USD

S1 ETH-USD
Session1
Client

GRPC.
Session2 … Invoke
Client TCP
… Connection

Go Routine
Network Contended
Poller Resource

REMEMBER: Only GOMAXPROCS amount of goroutines will run at any given time
TCP
Visualizing how OEGW works Connection
Shuffled

Conn DB
Pool
GRPC.
FIX Invoke TCP
Connection
DB
FIX

TCP Goroutine
1 RPC = 1 Connection
goroutine Random: Not FIFO Contended
https://github.com/golang/go/issues/31708 Resource
Visualizing how Trading Engine works

raftChan ApplyCh OutCh


inputCh
TCP
Connection
Network
RAFT Poller

TCP
Connection

REMEMBER: Only GOMAXPROCS amount of goroutines will run at any given time
Mitigations: spinning important goroutine
select {
select { case item <- ch:
case item <- ch: // process item
// process item default:
} // busy spinning
continue
}

Challenges:
Note: Golang scheduler will force preempt Can’t spin too much, as you will run out of
long running go-routines every 10ms CPU and cause starvation.
runtime.LockOSThread()
Mitigations: Always batch when using channels
select { First Read
case item <- bufCh:
items := make([]int, 20)
items = append(items, item)
Remaining: Grab outstanding
for i := 0; i < 19; i++ { messages while you
are there
select {
case item <- bufCh:
items = append(items, item)
default: Why does this work?
break Remaining - Avoid scheduler delays
- Better cache locality
}
}
// processing items Don’t forget spinning!
default: continue
}
Realization: Golang is optimized for throughput
Most facilities in Golang Linux introduce an randomness element to optimize for
throughput, not latency

- Go encourage you/libraries to spawn adhoc goroutines everywhere


- No goroutine priorities, and scheduler is randomized and job stealing

Writing low latency code in Golang is not easy, but again it’s not easy anywhere
else either.

Recommendation: use GRPC in streaming mode, not unary mode!


Is it just misplaced fsync()s?
Batched fsync on Optane,
or no fsync() here
Balance Order
FIX Receive RAFT FIX Send FAST
Check Process

Balance Order SLOW


FIX Receive RAFT Process FIX Send
Check

DB Non batched fsync() here

fsync() cost ~500us to 1ms on AWS hardware


Latency Cost Rankings
<1us Kernel syscall overhead
~ 1us optimized application logic cost
~ 5us kernel context switching cost
~ 5us per network hop on LT hardware
~ 25us per network hop on AWS hardware
…“let’s add this part or the ~ 30us per message unoptimized application
process step in case we need logic cost
it”… the most common error of a ~ 50us - 100us RT Kernel scheduler delay [0]
~ <100us fsync on Optane
smart engineer, is to optimize ~ 250us golang GC pauses
the thing that should not exist…. ~ 1ms fsync on AWS Instance Storage
~ N ms non-RT Kernel scheduler delay [0]
Elon Musk on Engineering, interviewed by Tim Dodd ~N to NNms golang scheduler delays

[0] https://bristot.me/files/research/papers/ecrts2020/slides.pdf

You might also like