The making of an
Ultra Low Latency Trading System
With Java and Golang
Yucong Sun Jonathan Ting
Staff Software Engineer Senior Software Engineer
Exchange @ Coinbase
● Takeaways
○ General architecture of an Exchange
○ State of the Art
○ Learnings from optimizing the legacy system
Planetary view of an Exchange
Most users would/should not interact with an Exchange directly
Brokerage
User
Product
API
Market Data
Market Makers /
Trading Firm
Exchange
Orbital view of an Exchange
Order Management System:
Balance, Risk, Margin/Liquidations
Matching Engine: Order book API
API: FIX, HTTP
MarketData: FIX, Websocket OMS
Hot path: Balance check, Order Matching
Warm path: Settlement Matching
Market Data
Engine
Auxiliary: Market Data Feed
Assembly Lines of a Exchange
Order Order Order
Trading
System
Process 1 Trade Event Trade Event Trade Event
Trade Event Trade Event Trade Event
Order Order Order Trading
System
Process 2
Exhibit A:
Coinbase Derivatives Exchange
https://www.coinbase.com/derivatives
Trading System Logic Isn’t Complex
Hot Path
Submit & match incoming orders against resting orders (‘book’)
Public - no complex trading relationships
Other logic (timers, admin requests, state)
Affect trading logic, so want to be sequenced with any other events
Trading system assigns IDs to state
Single threaded
Trading System as Deterministic State Machine
State₀ + Input₀ => State₁ + Output₁ ALWAYS
Can snapshot/restore/replay to get to live state
Trading
Determinism is Tricky!
System
- Data Structure Iteration
T=100 Equivalent
- No randomness
- Behavior changes
- Old input => Old behavior
- Feature flagging Old Snapshot + Old Input
T=50 T=50..100
Fault Tolerance with RAFT
Aeron Cluster Trading
High performance RAFT implementation System
(Follower)
(Leader)
App has to be deterministic & single threaded
Consensus batched & pipelined with application
X
System throughput = 1 / App processing time
Trading Trading
System System
(Leader) (Follower)
What is RAFT? Visualize it here
Persisted RAFT Log
Cluster persists RAFT log (input) to disk, as per protocol
Aeron Archive API allows for replicating the RAFT log for backup
UDP
Trading Cluster
System Replicate Backup
RAFT Log
Replicated RAFT Log
Audit - Upload to cold storage
Logging - Replay & Send to ELK outside hot path
Debugging - Reproduce bugs locally
Fixing - Backfill missing events
Testing - CI/CD replay to avoid regressions
(3)
Replicating For Replay
Local
Trading Run Trading Logic
System
(2) (4)
Replicate Input, not Output Replay
IPC
Record
RAFT Log Trading Events
Hot Path - Multicast output
Other - Replicate input & fan out UDP
Trading Aeron
System Archive
(4)
Output larger & unbounded (1) Replay
Replicate Trading Events
RAFT Log IPC
1 order => potentially cascading set of events
Service
Replicating For Scalability
Binary tree replication
Network Latency bound by log(n)
Bandwidth usage bounded
Entire Hot Path
Order Submit UDP
Order Trading
Gateway System
Order Ack
RTT outliers < 100 μs
1) Parse & validate Order Submit
RTT medians < 50 μs 2) Send request to trading system
3) RAFT Consensus
Trading System Processing Times ~ 1 μs 4) Matching Algorithm
5) Send order events to gateway
300k/s Peak Throughput
6) Translate Order Ack
= 4 Network Hops (~20μs) + Processing
Hardware Environment for CDE
Colocated in datacenter with customers
Commodity hardware
❖ Intel Optane Drives
Faster than enterprise SSDs
We can fsync if needed without too much penalty
❖ Low Latency Switches
350ns cut-through forwarding
Real-time packet capture without latency hit
Isolated NICs for low latency & bulk traffic
Exhibit B:
Onto the (AWS) Cloud
Cloud
Cons
- Less control over hardware environment
- Need to maintain both DC/AWS deployment, toolchain, configs…
Pros
- Codification, Collaboration
- Good enough performance
- Personal environment
Challenge with Compute/Storage
Machine family: t, m, c, r, z , suffixes N, D
- Recommend: https://instances.vantage.sh/
Storage
- EBS vs Instance Storage
Orchestration
- Recommendation: Nomad
Challenge with AWS networking
Is there a good switch on AWS?
- Cut-through: <0.5us
- Store & forward: 5us - 50us
Secrets with AWS Networks
● Understand spine-leaf networking architecture
○ Region, AZ, sub-azs, racks
○ Avoid load balancers
● cluster placement group
○ capacity reservations
● bad apples
https://www.xkyle.com/Measuring-AWS-Region-and-AZ-Latency/
Numbers On AWS
RTT outliers < 1 ms
RTT medians < 300 μs 10 x Network Hops (~250μs)
Trading System Processing ~ 1 μs
Exhibit C:
Deep Dive on Performance Tuning
Fast Memory Access
Memory Local Data Structures
Cache locality outweighs O(n)
Primitive Friendly Data Structures
No Map<Integer>, avoid Boxing/Unboxing
Deserialize from memory directly into primitives
Represent Strings as 2 Longs
128 bits => 18 7-bit (ascii) | 21 6-bit (alphanumeric) | 25 5-bit (alphabetic) | 32 4-bit (hex)
No Allocation on Hot Path
Object Pooling
Small Messages
Simple Binary Encoding <types>
<enum name="Side" encodingType="uint8">
<validValue name="BUY">0</validValue>
<validValue name="SELL">1</validValue>
Byte Alignment Matters </enum>
FPGA Deserialization <type name="ClientOrderId" primitiveType="char" length="32">
</types>
Order Fields By Size <sbe:message name="Order" id="1">
<field name="orderId" id="1" type="int64"/>
<field name="price" id="2" type="int64"/>
<field name="quantity" id="3" type="int32"/>
VarData / Enum / Bitsets at End <field name=”side” id=4” type=”Side”/>
<field name=”clientOrderId” id=”5” type=”ClientOrderId”/>
</sbe:message>
Add Padding If Necessary
Java Challenges - Warmup
10k function invocations => JIT compilation
Regulated Exchange - Cannot “warm up” our code
Azul Zulu Prime JVM - ReadyNow!
Cache and Persist JIT Profile + Optimizations
Pre-train new releases with multiple replays of PROD logs
Fast initial orders, remove JIT compilation jitter
Java Challenges - Garbage Collection
“Stop The World” GC - All Application Threads Stalled
Java 8 - Concurrent Mark Sweep
Azul Zulu Prime JVM - Pauseless Garbage Collector
Azul C4 Garbage Collector
Network Optimizations
Multicast
Consensus
Output to order and market data gateways
Aeron - High Performance Messaging
Reliable Transport over UDP
Per-channel settings
Congestion & Flow Control
Socket Buffers - # data in flight ideally equal to Bandwidth Delay Product
MTU - Jumbo Frames (9k) for batching
Network Optimizations
NIC Driver Softirq Application
NIC Driver Qdisc Application Aeron point-to-point
Sending as fast as possible on AWS
Linux Kernel Bypass
NIC Application
Mean Max Throughput
Kernel Bypass non-DPDK 38μs 1897μs 80MB/s
Read from network card directly from user space
DPDK 28us 515μs 500MB/s
Decreases median, drastically reduces outliers
OpenOnload in data center w/ SolarFlare NICs
DPDK in the cloud - Aeron Support (premium)
Medians Good, Outliers Spiky
Weeks Before Launch
OS Scheduling Delay / Context Switches
How are CPU cycles are not running your hot threads? /proc/interrupts - per CPU hardware interrupt #
/proc/sched_debug - task running time per CPU
/proc/<tid>/schedstat
time on cpu time on runqueue # time slices
4200925624037 12872240906155 780539850
4200966662712 12872278642290 780547937
4201007606214 12872323980891 780556132
4201046361274 12872441023508 780564249
perf - get thread runtime or counts individually on a given CPU
/proc/softirqs - per CPU hardware interrupt #
# perf record -e "sched:sched_stat_runtime" -C <core id>
# perf script | awk '{print $1 }' | sort | uniq -c
15 kworker/3:1H-kb
1 kworker/3:2-cgr
3 perf
1 rcu_sched
12356 sender
Recommendation: Netdata
a nice visual holistic view of the system
per-cpu interrupts/softirqs/utilization
network, memory, disk, filesystem
OS Scheduling
Pin hot threads to hardcoded CPUs (taskset, sched_setaffinity)
Prevents context switching & cache misses
Isolate hot CPUs or prioritize threads (ISOLCPUS, taskset, cpusets, nice, chrt)
Prevent other user threads from taking CPU time
Busy-spin hot threads to monopolize CPU (and for polling)
Set affinities to hardware interrupts, kernel workqueues, etc.
Hardware interrupts - use tuna, or set /proc/irq/<irq#>/smp_affinity
Softirq kernel params - rcu_nocbs, nohz_full
Other Tuning
NUMA locality
If you have multiple CPU sockets, one is closer to NIC and memory
Layout matters - lock hot threads to that CPU / Memory NUMA node
Hyperthreading
Turn it off (or isolate corresponding logical CPU)
More available L1/L2 cache without it
Exhibit D:
Apply the learnings to improve
The Legacy System
Where the real fun begins…
Fun with MicroServices
Feed
API-FIX OEGW DB Admin
Proxy
Rest
Clearing API-FEED API2
Gateway
API Jobs GoJobs
Clearing- Trading
Core Engine
Solution: Another dashboard???
Life of an request
Tracing an single order placement request from start to finish
3
1 2
DB
Client FIX OEGW Feed
Proxy
4 Trading 5
Engine
6
3 DB
1 2 Feed
Client FIX OEGW
Trading Proxy
4 Engine 5
E2E Order placement latency
FIX queue broadcaster
Wait for prev order Receive from
finish FIX to OEGW RPC (2) Wait for feed msg
FP (6)
Graph N OEGW RPC Handler N
Everything !!!
DB Hold (3) TE RPC (4) client
Beware: Conn TE RPC Handler
- Client side view vs Server SQL
Side view (client)
Raft Repl FSM
Send to FP
SQL
- E2E view vs per-unit view
(5)
(serv
er)
- Tracing sampling FP Receive
Infra Inefficiencies - 1000us -> 600us
vs 50us
- Compute/Storage
Happy Path: min/p50 - Network latency
- Cross AZ traffic
~1200us: Elevated but not that outrageous
- Load balancer
- fsync()s
Per operation cost - 30us vs 1us
- Full native, no warmup issue
- Allocations, Pointers
- Metrics recording / Logging
Do you know how often your datadog metrics call is sending a UDP packet out?
Is it just misplaced fsync()s?
Batched fsync on Optane
or no fsync() here
Balance Order
FIX Receive RAFT FIX Send FAST
Check Process
Balance Order SLOW
FIX Receive RAFT Process FIX Send
Check
DB Non batched fsync() here
fsync() cost ~500us to 1ms on AWS hardware
Pointer & Memory Allocations In Golang
Heap escape analysis (-gcflags “-m”)
- Sending pointers or values containing pointers to channels.
- Storing pointers or values containing pointers in a slice. like []*string.
- Backing arrays of slices that get reallocated because an append would exceed their
capacity.
- Calling methods on an interface type
Pass a small struct by value could be 8x faster vs passing by pointer, thus moving it to the
heap. (x86_64 has cache line size 64 bytes)
https://segment.com/blog/allocation-efficiency-in-high-performance-go-services/
- GC pause?
Unhappy Path: p99/max
- Scheduling delays?
P99 ~4ms, Max 362ms
- Non-FIFO behaviors?
WTF is going on…
Is Golang GC really a issue?
https://malloc.se/blog/zgc-jdk16
https://www.azul.com/sites/default/fi
les/images/c4_paper_acm.pdf
https://go.dev/blog/ismmkeynote
https://tip.golang.org/doc/gc-guide
Hint: Goroutine explosion by GRPC
Golang grpc unary requests default to create new goroutine for every request, this cause starvation of
any background goroutines, leads to tail latencies
34041
goroutines???
Hint: Goroutine scheduler delay
Goroutine is not your good old thread
- Go scheduler
- GOMAXPROCS =
num CPUs
- Remember: Only
GOMAXPROCS will
run at same time
Visualizing how API-FIX works
Shuffled
S1 BTC-USD
S1 ETH-USD
Session1
Client
…
GRPC.
Session2 … Invoke
Client TCP
… Connection
…
Go Routine
Network Contended
Poller Resource
REMEMBER: Only GOMAXPROCS amount of goroutines will run at any given time
TCP
Visualizing how OEGW works Connection
Shuffled
Conn DB
Pool
GRPC.
FIX Invoke TCP
Connection
DB
FIX
TCP Goroutine
1 RPC = 1 Connection
goroutine Random: Not FIFO Contended
https://github.com/golang/go/issues/31708 Resource
Visualizing how Trading Engine works
raftChan ApplyCh OutCh
inputCh
TCP
Connection
Network
RAFT Poller
TCP
Connection
REMEMBER: Only GOMAXPROCS amount of goroutines will run at any given time
Mitigations: spinning important goroutine
select {
select { case item <- ch:
case item <- ch: // process item
// process item default:
} // busy spinning
continue
}
Challenges:
Note: Golang scheduler will force preempt Can’t spin too much, as you will run out of
long running go-routines every 10ms CPU and cause starvation.
runtime.LockOSThread()
Mitigations: Always batch when using channels
select { First Read
case item <- bufCh:
items := make([]int, 20)
items = append(items, item)
Remaining: Grab outstanding
for i := 0; i < 19; i++ { messages while you
are there
select {
case item <- bufCh:
items = append(items, item)
default: Why does this work?
break Remaining - Avoid scheduler delays
- Better cache locality
}
}
// processing items Don’t forget spinning!
default: continue
}
Realization: Golang is optimized for throughput
Most facilities in Golang Linux introduce an randomness element to optimize for
throughput, not latency
- Go encourage you/libraries to spawn adhoc goroutines everywhere
- No goroutine priorities, and scheduler is randomized and job stealing
Writing low latency code in Golang is not easy, but again it’s not easy anywhere
else either.
Recommendation: use GRPC in streaming mode, not unary mode!
Is it just misplaced fsync()s?
Batched fsync on Optane,
or no fsync() here
Balance Order
FIX Receive RAFT FIX Send FAST
Check Process
Balance Order SLOW
FIX Receive RAFT Process FIX Send
Check
DB Non batched fsync() here
fsync() cost ~500us to 1ms on AWS hardware
Latency Cost Rankings
<1us Kernel syscall overhead
~ 1us optimized application logic cost
~ 5us kernel context switching cost
~ 5us per network hop on LT hardware
~ 25us per network hop on AWS hardware
…“let’s add this part or the ~ 30us per message unoptimized application
process step in case we need logic cost
it”… the most common error of a ~ 50us - 100us RT Kernel scheduler delay [0]
~ <100us fsync on Optane
smart engineer, is to optimize ~ 250us golang GC pauses
the thing that should not exist…. ~ 1ms fsync on AWS Instance Storage
~ N ms non-RT Kernel scheduler delay [0]
Elon Musk on Engineering, interviewed by Tim Dodd ~N to NNms golang scheduler delays
[0] https://bristot.me/files/research/papers/ecrts2020/slides.pdf