0% found this document useful (0 votes)

157 views53 pages

Coinbase Arch

The document discusses the architecture and optimization of an ultra-low latency trading system using Java and Golang, focusing on components like the Order Management System and Matching Engine. It highlights the importance of deterministic state machines, fault tolerance with RAFT, and various performance tuning techniques to achieve low latency. Additionally, it addresses challenges in cloud deployment and network optimizations to enhance system performance.

Uploaded by

birdring

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

157 views53 pages

Coinbase Arch

Uploaded by

birdring

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

The making of an

Ultra Low Latency Trading System

With Java and Golang

Yucong Sun Jonathan Ting

Staff Software Engineer Senior Software Engineer
Exchange @ Coinbase
● Takeaways
○ General architecture of an Exchange
○ State of the Art
○ Learnings from optimizing the legacy system
Planetary view of an Exchange
Most users would/should not interact with an Exchange directly

Brokerage
User
Product
API

Market Data
Market Makers /
Trading Firm
Exchange
Orbital view of an Exchange
Order Management System:
Balance, Risk, Margin/Liquidations
Matching Engine: Order book API

API: FIX, HTTP

MarketData: FIX, Websocket OMS

Hot path: Balance check, Order Matching

Warm path: Settlement Matching
Market Data
Engine
Auxiliary: Market Data Feed
Assembly Lines of a Exchange

Order Order Order

Trading
System
Process 1 Trade Event Trade Event Trade Event

Trade Event Trade Event Trade Event

Order Order Order Trading
System
Process 2
Exhibit A:
Coinbase Derivatives Exchange
https://www.coinbase.com/derivatives
Trading System Logic Isn’t Complex
Hot Path

Submit & match incoming orders against resting orders (‘book’)

Public - no complex trading relationships

Other logic (timers, admin requests, state)

Affect trading logic, so want to be sequenced with any other events

Trading system assigns IDs to state

Single threaded
Trading System as Deterministic State Machine

State₀ + Input₀ => State₁ + Output₁ ALWAYS

Can snapshot/restore/replay to get to live state

Trading
Determinism is Tricky!
System
- Data Structure Iteration
T=100 Equivalent
- No randomness
- Behavior changes
- Old input => Old behavior
- Feature flagging Old Snapshot + Old Input
T=50 T=50..100
Fault Tolerance with RAFT

Aeron Cluster Trading

High performance RAFT implementation System
(Follower)
(Leader)
App has to be deterministic & single threaded

Consensus batched & pipelined with application

X
System throughput = 1 / App processing time
Trading Trading
System System
(Leader) (Follower)

What is RAFT? Visualize it here

Persisted RAFT Log

Cluster persists RAFT log (input) to disk, as per protocol

Aeron Archive API allows for replicating the RAFT log for backup

UDP
Trading Cluster
System Replicate Backup
RAFT Log
Replicated RAFT Log
Audit - Upload to cold storage

Logging - Replay & Send to ELK outside hot path

Debugging - Reproduce bugs locally

Fixing - Backfill missing events

Testing - CI/CD replay to avoid regressions

(3)
Replicating For Replay
Local
Trading Run Trading Logic
System

(2) (4)
Replicate Input, not Output Replay
IPC
Record
RAFT Log Trading Events
Hot Path - Multicast output

Other - Replicate input & fan out UDP

Trading Aeron
System Archive
(4)
Output larger & unbounded (1) Replay
Replicate Trading Events
RAFT Log IPC
1 order => potentially cascading set of events

Service
Replicating For Scalability

Binary tree replication

Network Latency bound by log(n)

Bandwidth usage bounded

Entire Hot Path
Order Submit UDP
Order Trading
Gateway System
Order Ack

RTT outliers < 100 μs

1) Parse & validate Order Submit
RTT medians < 50 μs 2) Send request to trading system
3) RAFT Consensus
Trading System Processing Times ~ 1 μs 4) Matching Algorithm
5) Send order events to gateway
300k/s Peak Throughput
6) Translate Order Ack

= 4 Network Hops (~20μs) + Processing

Hardware Environment for CDE
Colocated in datacenter with customers

Commodity hardware

❖ Intel Optane Drives

Faster than enterprise SSDs
We can fsync if needed without too much penalty

❖ Low Latency Switches

350ns cut-through forwarding
Real-time packet capture without latency hit

Isolated NICs for low latency & bulk traffic

Exhibit B:
Onto the (AWS) Cloud
Cloud
Cons
- Less control over hardware environment
- Need to maintain both DC/AWS deployment, toolchain, configs…

Pros
- Codification, Collaboration
- Good enough performance
- Personal environment
Challenge with Compute/Storage
Machine family: t, m, c, r, z , suffixes N, D
- Recommend: https://instances.vantage.sh/

Storage
- EBS vs Instance Storage

Orchestration
- Recommendation: Nomad
Challenge with AWS networking
Is there a good switch on AWS?
- Cut-through: <0.5us
- Store & forward: 5us - 50us
Secrets with AWS Networks
● Understand spine-leaf networking architecture
○ Region, AZ, sub-azs, racks
○ Avoid load balancers
● cluster placement group
○ capacity reservations
● bad apples

https://www.xkyle.com/Measuring-AWS-Region-and-AZ-Latency/
Numbers On AWS

RTT outliers < 1 ms

RTT medians < 300 μs 10 x Network Hops (~250μs)

Trading System Processing ~ 1 μs

Exhibit C:
Deep Dive on Performance Tuning
Fast Memory Access
Memory Local Data Structures
Cache locality outweighs O(n)

Primitive Friendly Data Structures

No Map<Integer>, avoid Boxing/Unboxing

Deserialize from memory directly into primitives

Represent Strings as 2 Longs

128 bits => 18 7-bit (ascii) | 21 6-bit (alphanumeric) | 25 5-bit (alphabetic) | 32 4-bit (hex)

No Allocation on Hot Path

Object Pooling
Small Messages

Simple Binary Encoding <types>

<enum name="Side" encodingType="uint8">
<validValue name="BUY">0</validValue>
<validValue name="SELL">1</validValue>
Byte Alignment Matters </enum>

FPGA Deserialization <type name="ClientOrderId" primitiveType="char" length="32">

</types>

Order Fields By Size <sbe:message name="Order" id="1">

<field name="orderId" id="1" type="int64"/>
<field name="price" id="2" type="int64"/>
<field name="quantity" id="3" type="int32"/>
VarData / Enum / Bitsets at End <field name=”side” id=4” type=”Side”/>
<field name=”clientOrderId” id=”5” type=”ClientOrderId”/>
</sbe:message>

Add Padding If Necessary

Java Challenges - Warmup

10k function invocations => JIT compilation

Regulated Exchange - Cannot “warm up” our code

Azul Zulu Prime JVM - ReadyNow!

Cache and Persist JIT Profile + Optimizations
Pre-train new releases with multiple replays of PROD logs

Fast initial orders, remove JIT compilation jitter

Java Challenges - Garbage Collection

“Stop The World” GC - All Application Threads Stalled

Java 8 - Concurrent Mark Sweep

Azul Zulu Prime JVM - Pauseless Garbage Collector

Azul C4 Garbage Collector
Network Optimizations
Multicast
Consensus
Output to order and market data gateways

Aeron - High Performance Messaging

Reliable Transport over UDP
Per-channel settings
Congestion & Flow Control
Socket Buffers - # data in flight ideally equal to Bandwidth Delay Product
MTU - Jumbo Frames (9k) for batching
Network Optimizations

NIC Driver Softirq Application

NIC Driver Qdisc Application Aeron point-to-point
Sending as fast as possible on AWS
Linux Kernel Bypass
NIC Application
Mean Max Throughput

Kernel Bypass non-DPDK 38μs 1897μs 80MB/s

Read from network card directly from user space

DPDK 28us 515μs 500MB/s
Decreases median, drastically reduces outliers
OpenOnload in data center w/ SolarFlare NICs
DPDK in the cloud - Aeron Support (premium)
Medians Good, Outliers Spiky

Weeks Before Launch

OS Scheduling Delay / Context Switches
How are CPU cycles are not running your hot threads? /proc/interrupts - per CPU hardware interrupt #

/proc/sched_debug - task running time per CPU

/proc/<tid>/schedstat
time on cpu time on runqueue # time slices
4200925624037 12872240906155 780539850
4200966662712 12872278642290 780547937
4201007606214 12872323980891 780556132
4201046361274 12872441023508 780564249

perf - get thread runtime or counts individually on a given CPU

/proc/softirqs - per CPU hardware interrupt #
# perf record -e "sched:sched_stat_runtime" -C <core id>
# perf script | awk '{print $1 }' | sort | uniq -c
15 kworker/3:1H-kb
1 kworker/3:2-cgr
3 perf
1 rcu_sched
12356 sender
Recommendation: Netdata

a nice visual holistic view of the system

per-cpu interrupts/softirqs/utilization

network, memory, disk, filesystem

OS Scheduling
Pin hot threads to hardcoded CPUs (taskset, sched_setaffinity)
Prevents context switching & cache misses

Isolate hot CPUs or prioritize threads (ISOLCPUS, taskset, cpusets, nice, chrt)
Prevent other user threads from taking CPU time
Busy-spin hot threads to monopolize CPU (and for polling)

Set affinities to hardware interrupts, kernel workqueues, etc.

Hardware interrupts - use tuna, or set /proc/irq/<irq#>/smp_affinity
Softirq kernel params - rcu_nocbs, nohz_full
Other Tuning

NUMA locality
If you have multiple CPU sockets, one is closer to NIC and memory
Layout matters - lock hot threads to that CPU / Memory NUMA node

Hyperthreading
Turn it off (or isolate corresponding logical CPU)
More available L1/L2 cache without it
Exhibit D:
Apply the learnings to improve
The Legacy System
Where the real fun begins…
Fun with MicroServices
Feed
API-FIX OEGW DB Admin
Proxy

Rest
Clearing API-FEED API2
Gateway

API Jobs GoJobs

Clearing- Trading
Core Engine

Solution: Another dashboard???

Life of an request
Tracing an single order placement request from start to finish

3
1 2
DB
Client FIX OEGW Feed
Proxy
4 Trading 5
Engine
6

3 DB
1 2 Feed
Client FIX OEGW
Trading Proxy
4 Engine 5

E2E Order placement latency

FIX queue broadcaster

Wait for prev order Receive from

finish FIX to OEGW RPC (2) Wait for feed msg
FP (6)

Graph N OEGW RPC Handler N

Everything !!!
DB Hold (3) TE RPC (4) client

Beware: Conn TE RPC Handler

- Client side view vs Server SQL

Side view (client)

Raft Repl FSM
Send to FP
SQL
- E2E view vs per-unit view
(5)
(serv
er)
- Tracing sampling FP Receive
Infra Inefficiencies - 1000us -> 600us
vs 50us
- Compute/Storage
Happy Path: min/p50 - Network latency
- Cross AZ traffic
~1200us: Elevated but not that outrageous
- Load balancer
- fsync()s
Per operation cost - 30us vs 1us
- Full native, no warmup issue
- Allocations, Pointers
- Metrics recording / Logging

Do you know how often your datadog metrics call is sending a UDP packet out?
Is it just misplaced fsync()s?
Batched fsync on Optane
or no fsync() here
Balance Order
FIX Receive RAFT FIX Send FAST
Check Process

Balance Order SLOW

FIX Receive RAFT Process FIX Send
Check

DB Non batched fsync() here

fsync() cost ~500us to 1ms on AWS hardware

Pointer & Memory Allocations In Golang
Heap escape analysis (-gcflags “-m”)
- Sending pointers or values containing pointers to channels.
- Storing pointers or values containing pointers in a slice. like []*string.
- Backing arrays of slices that get reallocated because an append would exceed their
capacity.
- Calling methods on an interface type

Pass a small struct by value could be 8x faster vs passing by pointer, thus moving it to the
heap. (x86_64 has cache line size 64 bytes)

https://segment.com/blog/allocation-efficiency-in-high-performance-go-services/
- GC pause?
Unhappy Path: p99/max
- Scheduling delays?
P99 ~4ms, Max 362ms
- Non-FIFO behaviors?
WTF is going on…
Is Golang GC really a issue?

https://malloc.se/blog/zgc-jdk16

https://www.azul.com/sites/default/fi
les/images/c4_paper_acm.pdf

https://go.dev/blog/ismmkeynote

https://tip.golang.org/doc/gc-guide
Hint: Goroutine explosion by GRPC
Golang grpc unary requests default to create new goroutine for every request, this cause starvation of
any background goroutines, leads to tail latencies

34041
goroutines???
Hint: Goroutine scheduler delay
Goroutine is not your good old thread

- Go scheduler

- GOMAXPROCS =
num CPUs

- Remember: Only
GOMAXPROCS will
run at same time
Visualizing how API-FIX works
Shuffled
S1 BTC-USD

S1 ETH-USD
Session1
Client
…
GRPC.
Session2 … Invoke
Client TCP
… Connection
…
Go Routine
Network Contended
Poller Resource

REMEMBER: Only GOMAXPROCS amount of goroutines will run at any given time
TCP
Visualizing how OEGW works Connection
Shuffled

Conn DB
Pool
GRPC.
FIX Invoke TCP
Connection
DB
FIX

TCP Goroutine
1 RPC = 1 Connection
goroutine Random: Not FIFO Contended
https://github.com/golang/go/issues/31708 Resource
Visualizing how Trading Engine works

raftChan ApplyCh OutCh

inputCh
TCP
Connection
Network
RAFT Poller

TCP
Connection

REMEMBER: Only GOMAXPROCS amount of goroutines will run at any given time
Mitigations: spinning important goroutine
select {
select { case item <- ch:
case item <- ch: // process item
// process item default:
} // busy spinning
continue
}

Challenges:
Note: Golang scheduler will force preempt Can’t spin too much, as you will run out of
long running go-routines every 10ms CPU and cause starvation.
runtime.LockOSThread()
Mitigations: Always batch when using channels
select { First Read
case item <- bufCh:
items := make([]int, 20)
items = append(items, item)
Remaining: Grab outstanding
for i := 0; i < 19; i++ { messages while you
are there
select {
case item <- bufCh:
items = append(items, item)
default: Why does this work?
break Remaining - Avoid scheduler delays
- Better cache locality
}
}
// processing items Don’t forget spinning!
default: continue
}
Realization: Golang is optimized for throughput
Most facilities in Golang Linux introduce an randomness element to optimize for
throughput, not latency

- Go encourage you/libraries to spawn adhoc goroutines everywhere

- No goroutine priorities, and scheduler is randomized and job stealing

Writing low latency code in Golang is not easy, but again it’s not easy anywhere
else either.

Recommendation: use GRPC in streaming mode, not unary mode!

Is it just misplaced fsync()s?
Batched fsync on Optane,
or no fsync() here
Balance Order
FIX Receive RAFT FIX Send FAST
Check Process

Balance Order SLOW

FIX Receive RAFT Process FIX Send
Check

DB Non batched fsync() here

fsync() cost ~500us to 1ms on AWS hardware

Latency Cost Rankings
<1us Kernel syscall overhead
~ 1us optimized application logic cost
~ 5us kernel context switching cost
~ 5us per network hop on LT hardware
~ 25us per network hop on AWS hardware
…“let’s add this part or the ~ 30us per message unoptimized application
process step in case we need logic cost
it”… the most common error of a ~ 50us - 100us RT Kernel scheduler delay [0]
~ <100us fsync on Optane
smart engineer, is to optimize ~ 250us golang GC pauses
the thing that should not exist…. ~ 1ms fsync on AWS Instance Storage
~ N ms non-RT Kernel scheduler delay [0]
Elon Musk on Engineering, interviewed by Tim Dodd ~N to NNms golang scheduler delays

[0] https://bristot.me/files/research/papers/ecrts2020/slides.pdf

Performance Guidelines For Amd Athlon™ 64 and Amd Opteron™ Ccnuma Multiprocessor Systems
No ratings yet
Performance Guidelines For Amd Athlon™ 64 and Amd Opteron™ Ccnuma Multiprocessor Systems
48 pages
Chatgpt Knowledge 2.2
No ratings yet
Chatgpt Knowledge 2.2
15 pages
0 A Brief Lesson in History
No ratings yet
0 A Brief Lesson in History
7 pages
System Design Interview
No ratings yet
System Design Interview
4 pages
Russ' 10 Ingredient Recipe For Making 1 Million TPS On $5K Hardware
No ratings yet
Russ' 10 Ingredient Recipe For Making 1 Million TPS On $5K Hardware
7 pages
Deepseek 3fs Webinar Part1 f2dd6949
No ratings yet
Deepseek 3fs Webinar Part1 f2dd6949
24 pages
MultiProcessors Tanenbaum BP
No ratings yet
MultiProcessors Tanenbaum BP
29 pages
System Design Interviews
No ratings yet
System Design Interviews
151 pages
Application-Level Consensus Benefits
No ratings yet
Application-Level Consensus Benefits
10 pages
Slides
No ratings yet
Slides
68 pages
Threads Revisited: Lecture Notes
No ratings yet
Threads Revisited: Lecture Notes
8 pages
Awsreinvent2014perftuningec2 141112191859 Conversion Gate02
No ratings yet
Awsreinvent2014perftuningec2 141112191859 Conversion Gate02
81 pages
Cloud Computing Module-5
No ratings yet
Cloud Computing Module-5
5 pages
Thenewsystemsperformance 131014005720 Phpapp01
No ratings yet
Thenewsystemsperformance 131014005720 Phpapp01
17 pages
DS Ia1
No ratings yet
DS Ia1
34 pages
EECS 470 Final Review
No ratings yet
EECS 470 Final Review
16 pages
System Design Scale System From Zero To Million Users #Systemdesign (English) (DownloadYoutubeSubtitles - Com)
No ratings yet
System Design Scale System From Zero To Million Users #Systemdesign (English) (DownloadYoutubeSubtitles - Com)
8 pages
Client Server Architecture: Group Members
No ratings yet
Client Server Architecture: Group Members
6 pages
Distributed Systems Architecture Overview
No ratings yet
Distributed Systems Architecture Overview
13 pages
Algoritma Penjadwalan Proses
No ratings yet
Algoritma Penjadwalan Proses
22 pages
Understanding System Architecture Essentials
No ratings yet
Understanding System Architecture Essentials
79 pages
Redhat Summit Perf Analysis and Tuning Part 2 2013
100% (1)
Redhat Summit Perf Analysis and Tuning Part 2 2013
57 pages
Interesting Facts About RAC
No ratings yet
Interesting Facts About RAC
40 pages
ONT Management & ACS Technical v1.1
No ratings yet
ONT Management & ACS Technical v1.1
18 pages
CSE357 Workbook
No ratings yet
CSE357 Workbook
62 pages
QConLondon2017 SamAdams LowLatencyTradingArchitectureatLMAXExchange
No ratings yet
QConLondon2017 SamAdams LowLatencyTradingArchitectureatLMAXExchange
86 pages
IBM Z Systems Processor Optimization SHARE Aug 2016
No ratings yet
IBM Z Systems Processor Optimization SHARE Aug 2016
27 pages
Os - Top 10 One Page Answers (Vardhaman)
No ratings yet
Os - Top 10 One Page Answers (Vardhaman)
6 pages
SPNGN2 Study Guide
No ratings yet
SPNGN2 Study Guide
17 pages
DPDK Prog Guide-17.11
No ratings yet
DPDK Prog Guide-17.11
334 pages
Data Center Network Topologies
No ratings yet
Data Center Network Topologies
26 pages
Project Report
No ratings yet
Project Report
11 pages
AWS Developer Notes
No ratings yet
AWS Developer Notes
17 pages
Network Performance Tuning in RHEL6
No ratings yet
Network Performance Tuning in RHEL6
37 pages
ATM Network Performance Lab Setup
100% (3)
ATM Network Performance Lab Setup
14 pages
Oracle 10g RAC Best Practices
100% (3)
Oracle 10g RAC Best Practices
34 pages
Overview of Architectural Styles and Models
No ratings yet
Overview of Architectural Styles and Models
15 pages
Service Mesh Primer 200205003805 PDF
No ratings yet
Service Mesh Primer 200205003805 PDF
43 pages
System Design
No ratings yet
System Design
56 pages
IPTV Storage Management with DAPHNE
No ratings yet
IPTV Storage Management with DAPHNE
32 pages
Operating Systems - Print For Students
No ratings yet
Operating Systems - Print For Students
89 pages
Client-Server Architecture Overview
No ratings yet
Client-Server Architecture Overview
44 pages
AUTOSAR EXP PlatformDesign
No ratings yet
AUTOSAR EXP PlatformDesign
91 pages
Network Link Aggregation Guide
No ratings yet
Network Link Aggregation Guide
8 pages
Algomasterio System Design Interview Handbook
No ratings yet
Algomasterio System Design Interview Handbook
19 pages
I. Amazon Elastic Compute Cloud (Amazon EC2)
No ratings yet
I. Amazon Elastic Compute Cloud (Amazon EC2)
16 pages
EN01 Data Center Network Overview
No ratings yet
EN01 Data Center Network Overview
63 pages
Understanding Distributed Systems Sample
No ratings yet
Understanding Distributed Systems Sample
32 pages
System Design Resources
No ratings yet
System Design Resources
25 pages
EN05 Intelligent Lossless Data Center Network
No ratings yet
EN05 Intelligent Lossless Data Center Network
45 pages
Red Hat Openstack Platform 12: Architecture Guide
No ratings yet
Red Hat Openstack Platform 12: Architecture Guide
72 pages
High-Performance Trading in C++
No ratings yet
High-Performance Trading in C++
55 pages
Speaker - A02 - 5747 - Best Practices in Networking For AI
No ratings yet
Speaker - A02 - 5747 - Best Practices in Networking For AI
15 pages
Lect14 4up
No ratings yet
Lect14 4up
8 pages
Cloud Computing Final Lab File
No ratings yet
Cloud Computing Final Lab File
27 pages
Comprehensive System Design Guide
100% (1)
Comprehensive System Design Guide
160 pages
Data Center Networking Essentials
No ratings yet
Data Center Networking Essentials
46 pages
HTC One Patent Infringement Claim Chart
No ratings yet
HTC One Patent Infringement Claim Chart
37 pages
Uc 3854
No ratings yet
Uc 3854
32 pages
FDE Innovator Club Competition New Final With Criteria
No ratings yet
FDE Innovator Club Competition New Final With Criteria
5 pages
Mckinsey 7s Framework
No ratings yet
Mckinsey 7s Framework
7 pages
Data Practices
No ratings yet
Data Practices
48 pages
Cisco Icons
No ratings yet
Cisco Icons
12 pages
The Impact of Technology on Students
No ratings yet
The Impact of Technology on Students
2 pages
RAN17 Signaling
No ratings yet
RAN17 Signaling
156 pages
Rebuilding ECLIPSE Models in Petrel
No ratings yet
Rebuilding ECLIPSE Models in Petrel
5 pages
Flat Unit 2 Problems
No ratings yet
Flat Unit 2 Problems
36 pages
Module 3 - Smart Contracts
No ratings yet
Module 3 - Smart Contracts
37 pages
Digital Audit Strategy for Scotland
No ratings yet
Digital Audit Strategy for Scotland
13 pages
Data Structure and Algorith
No ratings yet
Data Structure and Algorith
8 pages
GnuPlot & XMGrace Plotting Guide
No ratings yet
GnuPlot & XMGrace Plotting Guide
2 pages
Name: Joel James P. Alao Assignment 1
No ratings yet
Name: Joel James P. Alao Assignment 1
7 pages
DCX 8510 4 HardwareManual
No ratings yet
DCX 8510 4 HardwareManual
156 pages
SolvedProblemsinShearandMomentDiagrams403 422
No ratings yet
SolvedProblemsinShearandMomentDiagrams403 422
33 pages
Key Questions in Reinforcement Learning
No ratings yet
Key Questions in Reinforcement Learning
3 pages
Agenda Keyboard Shortcut CheatSheet
No ratings yet
Agenda Keyboard Shortcut CheatSheet
6 pages
Finite Element Method Prof. C.S. Upadhyay Department of Mechanical Engineering Indian Institute of Technology, Kanpur
No ratings yet
Finite Element Method Prof. C.S. Upadhyay Department of Mechanical Engineering Indian Institute of Technology, Kanpur
27 pages
Microsoft Office 2003 Pro Suite CD
No ratings yet
Microsoft Office 2003 Pro Suite CD
2 pages
Grade 08 ICT Lesson 03
No ratings yet
Grade 08 ICT Lesson 03
10 pages
Principles of Compiler Design PDF
0% (1)
Principles of Compiler Design PDF
177 pages
Ahmed Nauman Naik: SHU ID: 26036389
No ratings yet
Ahmed Nauman Naik: SHU ID: 26036389
10 pages
Hypothesis Testing in Research
No ratings yet
Hypothesis Testing in Research
104 pages
BTech - Semester - III - Computer Architecture
No ratings yet
BTech - Semester - III - Computer Architecture
6 pages
Introduction To C Programming: "Turbo C Programming For PC", Robert Lafore, SAMS
100% (3)
Introduction To C Programming: "Turbo C Programming For PC", Robert Lafore, SAMS
30 pages
Zoho Addons
No ratings yet
Zoho Addons
3 pages
Toshiba 13A24 Repair Project Details
No ratings yet
Toshiba 13A24 Repair Project Details
1 page
CNC Machining Center Programming Manual
50% (2)
CNC Machining Center Programming Manual
106 pages

Coinbase Arch

Uploaded by

Coinbase Arch

Uploaded by

The making of an

Ultra Low Latency Trading System

Yucong Sun Jonathan Ting

API: FIX, HTTP

Hot path: Balance check, Order Matching

Order Order Order

Trade Event Trade Event Trade Event

Submit & match incoming orders against resting orders (‘book’)

Public - no complex trading relationships

Other logic (timers, admin requests, state)

Affect trading logic, so want to be sequenced with any other events

Trading system assigns IDs to state

State₀ + Input₀ => State₁ + Output₁ ALWAYS

Can snapshot/restore/replay to get to live state

Aeron Cluster Trading

Consensus batched & pipelined with application

What is RAFT? Visualize it here

Cluster persists RAFT log (input) to disk, as per protocol

Logging - Replay & Send to ELK outside hot path

Debugging - Reproduce bugs locally

Fixing - Backfill missing events

Testing - CI/CD replay to avoid regressions

Other - Replicate input & fan out UDP

Binary tree replication

Network Latency bound by log(n)

Bandwidth usage bounded

RTT outliers < 100 μs

= 4 Network Hops (~20μs) + Processing

❖ Intel Optane Drives

❖ Low Latency Switches

Isolated NICs for low latency & bulk traffic

RTT outliers < 1 ms

Trading System Processing ~ 1 μs

Primitive Friendly Data Structures

Deserialize from memory directly into primitives

Represent Strings as 2 Longs

No Allocation on Hot Path

Simple Binary Encoding <types>

FPGA Deserialization <type name="ClientOrderId" primitiveType="char" length="32">

Order Fields By Size <sbe:message name="Order" id="1">

Add Padding If Necessary

10k function invocations => JIT compilation

Azul Zulu Prime JVM - ReadyNow!

Fast initial orders, remove JIT compilation jitter

“Stop The World” GC - All Application Threads Stalled

Azul Zulu Prime JVM - Pauseless Garbage Collector

Aeron - High Performance Messaging

NIC Driver Softirq Application

Kernel Bypass non-DPDK 38μs 1897μs 80MB/s

Read from network card directly from user space

Weeks Before Launch

/proc/sched_debug - task running time per CPU

perf - get thread runtime or counts individually on a given CPU

a nice visual holistic view of the system

network, memory, disk, filesystem

Set affinities to hardware interrupts, kernel workqueues, etc.

API Jobs GoJobs

Solution: Another dashboard???

E2E Order placement latency

FIX queue broadcaster

Wait for prev order Receive from

Graph N OEGW RPC Handler N

Beware: Conn TE RPC Handler

Side view (client)

Balance Order SLOW

DB Non batched fsync() here

fsync() cost ~500us to 1ms on AWS hardware

raftChan ApplyCh OutCh

- Go encourage you/libraries to spawn adhoc goroutines everywhere

Recommendation: use GRPC in streaming mode, not unary mode!

Balance Order SLOW

DB Non batched fsync() here

fsync() cost ~500us to 1ms on AWS hardware

You might also like