0% found this document useful (0 votes)

73 views107 pages

High-Performance Managed Languages Guide

This document discusses high performance managed languages. It begins by setting context around whether managed or native languages are best for building low-latency applications. It then covers topics like runtime optimization techniques in just-in-time compilers like profile-guided optimizations and inlining. The document also discusses garbage collection techniques like generational collection and thread-local allocation buffers. It aims to show that with the right optimizations, managed languages can provide good performance.

Uploaded by

abnaod5363

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views107 pages

High-Performance Managed Languages Guide

Uploaded by

abnaod5363

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 107

High Performance

Managed Languages

Martin Thompson - @mjpt777

Really, what is your preferred
platform for building HFT
applications?
Why do you build low-latency
applications on a
GC’ed platform?
Agenda

1. Let’s set some Context

2. Runtime Optimisation
3. Garbage Collection
4. Algorithms & Design
Some Context
Let’s be clear

A Managed Runtime is not

always the best choice…
Latency Arbitrage?
Two questions…
Why build on a
Managed Runtime?
Can managed languages
provide good performance?
We need to follow the
evidence…
Are native languages faster?
Time?

Skills & Resources?

What can, or should, be
outsourced?
CPU vs Memory
Performance
How much time to perform an
addition operation on 2 integers?
1 CPU Cycle
< 1ns
Sequential Access
-
Average time in ns/op to sum all
longs in a 1GB array?
Access Pattern Benchmark
Benchmark Mode Score Error Units

testSequential avgt 0.832 ± 0.006 ns/op

~1 ns/op
Really???
Less than 1ns per operation?
Random walk per OS Page
-
Average time in ns/op to sum all
longs in a 1GB array?
Access Pattern Benchmark
Benchmark Mode Score Error Units

testSequential avgt 0.832 ± 0.006 ns/op

testRandomPage avgt 2.703 ± 0.025 ns/op

~3 ns/op
Data dependant walk per OS Page
-
Average time in ns/op to sum all
longs in a 1GB array?
Access Pattern Benchmark
Benchmark Mode Score Error Units

testSequential avgt 0.832 ± 0.006 ns/op

testRandomPage avgt 2.703 ± 0.025 ns/op
testDependentRandomPage avgt 7.102 ± 0.326 ns/op

~7 ns/op
Random heap walk
-
Average time in ns/op to sum all
longs in a 1GB array?
Access Pattern Benchmark
Benchmark Mode Score Error Units

testSequential avgt 0.832 ± 0.006 ns/op

testRandomPage avgt 2.703 ± 0.025 ns/op
testDependentRandomPage avgt 7.102 ± 0.326 ns/op
testRandomHeap avgt 19.896 ± 3.110 ns/op

~20 ns/op
Data dependant heap walk
-
Average time in ns/op to sum all
longs in a 1GB array?
Access Pattern Benchmark
Benchmark Mode Score Error Units

testSequential avgt 0.832 ± 0.006 ns/op

testRandomPage avgt 2.703 ± 0.025 ns/op
testDependentRandomPage avgt 7.102 ± 0.326 ns/op
testRandomHeap avgt 19.896 ± 3.110 ns/op
testDependentRandomHeap avgt 89.516 ± 4.573 ns/op

~90 ns/op
Then ADD 40+ ns/op
for NUMA access on a server!!!!
Data Dependent Loads
aka “Pointer Chasing”!!!
Performance 101
Performance 101

1. Memory is transported in Cachelines

Performance 101

1. Memory is transported in Cachelines

2. Memory is managed in OS Pages

Performance 101

1. Memory is transported in Cachelines

2. Memory is managed in OS Pages

3. Memory is pre-fetched on
predictable access patterns
Runtime Optimisation
Runtime JIT

1. Profile guided optimisations

Runtime JIT

1. Profile guided optimisations

2. Bets can be taken and later revoked

Branches
void foo()
{
// code

if (condition)
{
// code

// code
}
Branches
void foo()
{
// code
Block A
if (condition)
{
// code

// code
}
Branches
void foo()
{
// code
Block A
if (condition)
{
// code Block B

}
Block C
// code
}
Branches
void foo()
{
// code
Block A Block A
if (condition)
{
// code Block B
Block C
}
Block C
// code
}
Branches
void foo()
{
// code
Block A Block A
if (condition)
{
// code Block B
Block C
}
Block C
// code Block B
}
Subtle Branches

int result = (i > 7) ? a : b;

Subtle Branches

int result = (i > 7) ? a : b;

CMOV vs Branch Prediction?

Method/Function Inlining
void foo()
{
// code

bar();

// code
}
Method/Function Inlining
void foo()
{
// code
Block A
bar();

// code
}

bar()
Method/Function Inlining
void foo()
{
// code
Block A
bar();

// code
}

bar()
Method/Function Inlining
void foo()
{
// code
Block A
bar();

// code Block B
}

bar()
Method/Function Inlining
void foo()
{
// code
Block A Block A
bar();

// code Block B
}

bar()
Method/Function Inlining
void foo()
{
// code
Block A Block A
bar();
bar()
// code Block B
}

bar()
Method/Function Inlining
void foo()
{
// code
Block A Block A
bar();
bar()
// code Block B
} Block B

bar()
Method/Function Inlining
void foo()
{
// code i-cache
bar(); & code bloat?
// code
}
Method/Function Inlining

“Inlining is THE optimisation.”

- Cliff Click
Bounds Checking

void foo(int[] array, int length)

{
// code

for (int i = 0; i < length; i++)

{
bar(array[i]);
}

// code
}
Bounds Checking

void foo(int[] array)

{
// code

for (int i = 0; i < array.length; i++)

{
bar(array[i]);
}

// code
}
Subtype Polymorphism
void draw(Shape[] shapes)
{
for (int i = 0; i < shapes.length; i++)
{
shapes[i].draw();
}
}

void bar(Shape shape)

{
bar(shape.isVisible());
}
Subtype Polymorphism
void draw(Shape[] shapes)
{
for (int i = 0; i < shapes.length; i++)
{
shapes[i].draw();
}
} Class Hierarchy Analysis
void bar(Shape shape) & Inline Caching
{
bar(shape.isVisible());
}
Runtime JIT

1. Profile guided optimisations

2. Bets can be taken and later revoked

Garbage Collection
Generational Garbage Collection

“Only the good die young.”

- Billy Joel
Generational Garbage Collection
Young/New Generation

TLAB

TLAB
Eden Survivor 0 Survivor 1 Virtual

Old Generation

Tenured Virtual
Modern Hardware (Intel Sandy Bridge EP)
C1 ... Cn Registers/Buffers <1ns C1 ... Cn
L1 ... L1 ~4 cycles ~1ns L1 ... L1
L2 ... L2 ~12 cycles ~3ns L2 ... L2

~40 cycles ~15ns

L3 L3
~60 cycles ~20ns (dirty hit)
PCI-e 3 MC QPI QPI MC PCI-e 3
QPI ~40ns
DRAM DRAM
40X DRAM DRAM 40X
IO
DRAM
~65ns DRAM IO

DRAM DRAM
* Assumption: 3GHz Processor
Broadwell EX – 24 cores & 60MB L3 Cache
Thread Local Allocation Buffers
Young/New Generation

TLAB

TLAB
Eden
Thread Local Allocation Buffers
Young/New Generation

TLAB

TLAB
Eden

• Affords locality of reference

• Avoid false sharing
• Can have NUMA aware allocation
Object Survival
Young/New Generation

TLAB

TLAB
Eden Survivor 0 Survivor 1 Virtual
Object Survival
Young/New Generation

TLAB

TLAB
Eden Survivor 0 Survivor 1 Virtual

• Aging Policies
• Compacting Copy
• NUMA Interleave
• Fast Parallel Scavenging
• Only the survivors require work
Object Promotion
Young/New Generation

TLAB

TLAB
Eden Survivor 0 Survivor 1 Virtual

Old Generation

Tenured Virtual
Object Promotion
Young/New Generation

TLAB

TLAB
Eden Survivor 0 Survivor 1 Virtual

Old Generation

Tenured Virtual

• Concurrent Collection
• String Deduplication
Compacting Collections
Compacting Collections – Depth first copy
Compacting Collections
Compacting Collections

OS Pages and
cache lines?
G1 – Concurrent Compaction
E E E Eden

S Survivor
O O S O O
O Old
S S
H Humongous

H O O E Unused

O E O O

O O H
Azul Zing C4
True Concurrent Compacting
Collector
Where next for GC?
Object Inlining/Aggregation
GC vs Manual Memory Management

Not easy to pick clear winner…

GC vs Manual Memory Management

Not easy to pick clear winner…

Managed GC
• GC Implementation
• Card Marking
• Read/Write Barriers
• Object Headers
• Background Overhead
in CPU and Memory
GC vs Manual Memory Management

Not easy to pick clear winner…

Managed GC Native
• GC Implementation • Malloc Implementation
• Card Marking • Arena/pool contention
• Read/Write Barriers • Bin Wastage
• Object Headers • Fragmentation
• Background Overhead • Debugging Effort
in CPU and Memory • Inter-thread costs
Algorithms & Design
What is most important to
performance?
• Avoiding cache misses
• Strength Reduction
• Avoiding duplicate work
• Amortising expensive operations
• Mechanical Sympathy
• Choice of Data Structures
• Choice of Algorithms
• API Design
• Overall Design
In a large codebase it is really
difficult to do everything well
It also takes some “uncommon”
disciplines such as:
profiling, telemetry, modelling…
“If I had more time, I would
have written a shorter letter.”

- Blaise Pascal
The story of Aeron
Aeron is an interesting lesson in
“time to performance”
Lots of others exists such at the
C# Roslyn compiler
Time spent on

Mechanical Sympathy
vs
Debugging Pointers

???
Immutable Data & Concurrency
Functional Programming
In Closing …
What does the future hold?
Remember
Assembly vs Compiled
Languages
What about the issues of
footprint, startup time,
GC pauses, etc. ???
Questions?
Blog: http://mechanical-sympathy.blogspot.com/
Twitter: @mjpt777

“Any intelligent fool can make things bigger, more

complex, and more violent.
It takes a touch of genius, and a lot of courage, to move
in the opposite direction.”

- Albert Einstein

Codedive CPUCachesHandouts
No ratings yet
Codedive CPUCachesHandouts
24 pages
Slides
No ratings yet
Slides
68 pages
Program Design and Analysis Program-Level Performance Analysis
No ratings yet
Program Design and Analysis Program-Level Performance Analysis
13 pages
Efficient SPM Utilization in Embedded Systems
No ratings yet
Efficient SPM Utilization in Embedded Systems
54 pages
C Program Optimization Guide
No ratings yet
C Program Optimization Guide
2 pages
Modern C++ Performance Optimization
No ratings yet
Modern C++ Performance Optimization
92 pages
Data Oriented Design for Efficient CPU Processing
No ratings yet
Data Oriented Design for Efficient CPU Processing
17 pages
JVM and Java Performance Tuning
No ratings yet
JVM and Java Performance Tuning
12 pages
GDC2003 Memory Optimization 18mar03
No ratings yet
GDC2003 Memory Optimization 18mar03
60 pages
Amdahl's Law: S (N) T (1) /T (N)
No ratings yet
Amdahl's Law: S (N) T (1) /T (N)
46 pages
CSE 219 Computer Science III: Code Profiling
No ratings yet
CSE 219 Computer Science III: Code Profiling
25 pages
Optimization of Computer Programs in C
No ratings yet
Optimization of Computer Programs in C
37 pages
4 Memory Models
No ratings yet
4 Memory Models
19 pages
C++ in Huge AAA Games - Nicolas Fleury - CppCon 2014
No ratings yet
C++ in Huge AAA Games - Nicolas Fleury - CppCon 2014
51 pages
Data-Oriented Design and C++ - Mike Acton - CppCon 2014
100% (1)
Data-Oriented Design and C++ - Mike Acton - CppCon 2014
201 pages
The Software Optimization Cookbook: Richard Gerber Aart J.C. Bik Kevin B. Smith Xinmin Tian
No ratings yet
The Software Optimization Cookbook: Richard Gerber Aart J.C. Bik Kevin B. Smith Xinmin Tian
13 pages
Chapter V - Large and Fast - Exploiting Memory Hierarchy
No ratings yet
Chapter V - Large and Fast - Exploiting Memory Hierarchy
33 pages
LLVM Static Analysis For Program Characterization and Memory Reuse Profile Estimation
No ratings yet
LLVM Static Analysis For Program Characterization and Memory Reuse Profile Estimation
6 pages
Optimization 5 Microbenchmarks
No ratings yet
Optimization 5 Microbenchmarks
25 pages
Runtime Code Manipulation with DynamoRIO
No ratings yet
Runtime Code Manipulation with DynamoRIO
306 pages
Lec01 1 Introduction
No ratings yet
Lec01 1 Introduction
36 pages
PHD Proposal
No ratings yet
PHD Proposal
85 pages
Os Internal 2 Notes
No ratings yet
Os Internal 2 Notes
30 pages
CS530 Fall2015 Lecture7
No ratings yet
CS530 Fall2015 Lecture7
7 pages
Performance Optimization Insights
No ratings yet
Performance Optimization Insights
104 pages
COSS - Lecture - 6 - With Annotation
No ratings yet
COSS - Lecture - 6 - With Annotation
37 pages
CUDA C Best Practices Guide
No ratings yet
CUDA C Best Practices Guide
116 pages
CUDA C Best Practices Guide
No ratings yet
CUDA C Best Practices Guide
116 pages
Code Optimization for Developers
No ratings yet
Code Optimization for Developers
11 pages
Cache
No ratings yet
Cache
31 pages
Cache Optimization Techniques
No ratings yet
Cache Optimization Techniques
23 pages
Assign 01
No ratings yet
Assign 01
12 pages
CS3350B Computer Architecture CPU Performance and Profiling: Marc Moreno Maza
No ratings yet
CS3350B Computer Architecture CPU Performance and Profiling: Marc Moreno Maza
28 pages
Memory Management Techniques
No ratings yet
Memory Management Techniques
48 pages
Cs8083 Unit II Notes
No ratings yet
Cs8083 Unit II Notes
23 pages
Code Optimization Sept. 25, 2003: "The Course That Gives CMU Its Zip!"
No ratings yet
Code Optimization Sept. 25, 2003: "The Course That Gives CMU Its Zip!"
57 pages
ch5 2
No ratings yet
ch5 2
61 pages
Computer Architecture and Organization: Lecture15: Cache Performance
No ratings yet
Computer Architecture and Organization: Lecture15: Cache Performance
17 pages
Memory Management - Reviewer
No ratings yet
Memory Management - Reviewer
8 pages
CUDA C++ Guide for Developers
No ratings yet
CUDA C++ Guide for Developers
118 pages
Data and Instruction Locality in Caches
No ratings yet
Data and Instruction Locality in Caches
78 pages
Parallel & Distributed Computing
No ratings yet
Parallel & Distributed Computing
58 pages
Scratchpad Memory Optimization Techniques
No ratings yet
Scratchpad Memory Optimization Techniques
34 pages
Memory Management Techniques Overview
No ratings yet
Memory Management Techniques Overview
13 pages
Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
No ratings yet
Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
37 pages
Performance of Java Application - Part 1
No ratings yet
Performance of Java Application - Part 1
9 pages
Memory Management Techniques Explained
No ratings yet
Memory Management Techniques Explained
13 pages
Java Vs C
No ratings yet
Java Vs C
8 pages
High Performance C++ Game Programming
No ratings yet
High Performance C++ Game Programming
41 pages
Lecture # 01
No ratings yet
Lecture # 01
30 pages
Memory Management Algorithms and Implementation in C C 1st Edition Bill Blunden Direct Download
No ratings yet
Memory Management Algorithms and Implementation in C C 1st Edition Bill Blunden Direct Download
114 pages
Week2 - 1
No ratings yet
Week2 - 1
64 pages
Embedded C Programming Guide
100% (1)
Embedded C Programming Guide
57 pages
Virtual M
No ratings yet
Virtual M
43 pages
550+ DataStructure CCEE MCQ
No ratings yet
550+ DataStructure CCEE MCQ
139 pages
Optimization - Your Worst Enemy
No ratings yet
Optimization - Your Worst Enemy
6 pages
M3 Guide
No ratings yet
M3 Guide
21 pages
Distributed Services With Go: Extracted From
No ratings yet
Distributed Services With Go: Extracted From
10 pages
Document Management System (DMS) ")
No ratings yet
Document Management System (DMS) ")
2 pages
Week 1: Day 1 Headliners
No ratings yet
Week 1: Day 1 Headliners
15 pages
WAsP 9 Help Facility
No ratings yet
WAsP 9 Help Facility
160 pages
NGINX SSL Performance
No ratings yet
NGINX SSL Performance
9 pages
To Improve User Interface Tips: by Victor Ponamariov
100% (1)
To Improve User Interface Tips: by Victor Ponamariov
59 pages
A Common-Sense Guide To Data Structures and Algorithms, Second Edition
No ratings yet
A Common-Sense Guide To Data Structures and Algorithms, Second Edition
14 pages
1 Introduction, Installation, Activation
No ratings yet
1 Introduction, Installation, Activation
10 pages
Java SE 8 Exam Questions and Answers
No ratings yet
Java SE 8 Exam Questions and Answers
4 pages
Oracle Test-Inside 1z0-808 v2020-05-13 by James 121q PDF
No ratings yet
Oracle Test-Inside 1z0-808 v2020-05-13 by James 121q PDF
140 pages
Oracle Prep4sure 1z0-808 v2020-02-07 by - Venla - 109q PDF
No ratings yet
Oracle Prep4sure 1z0-808 v2020-02-07 by - Venla - 109q PDF
125 pages
Huawei OceanStor HDP3500E Datasheet PDF
No ratings yet
Huawei OceanStor HDP3500E Datasheet PDF
2 pages
Oracle PREMIUM 1z0-808 by - VCEplus 32q-DEMO PDF
No ratings yet
Oracle PREMIUM 1z0-808 by - VCEplus 32q-DEMO PDF
34 pages
Oracle Test-Inside 1z0-808 v2020-05-13 by James 121q PDF
No ratings yet
Oracle Test-Inside 1z0-808 v2020-05-13 by James 121q PDF
140 pages
Ernest Rossi - Creative Dialogue With Our Genes
100% (3)
Ernest Rossi - Creative Dialogue With Our Genes
69 pages
Huawei OceanStor HDP3500E Datasheet PDF
No ratings yet
Huawei OceanStor HDP3500E Datasheet PDF
2 pages
Linear System Theory and Design (3rd Ed) - Chi-Tsang Chen
No ratings yet
Linear System Theory and Design (3rd Ed) - Chi-Tsang Chen
176 pages
Top 18 Database Projects Ideas
0% (1)
Top 18 Database Projects Ideas
7 pages
The Human Brain
No ratings yet
The Human Brain
12 pages
P L A S: Lain Anguage Bout Hiftwork
No ratings yet
P L A S: Lain Anguage Bout Hiftwork
47 pages
Cognos 10 Redbook
No ratings yet
Cognos 10 Redbook
572 pages
Empower Your Life: The Seven Steps
100% (1)
Empower Your Life: The Seven Steps
3 pages
HP Desktop and Notebook Servicing Guide
0% (2)
HP Desktop and Notebook Servicing Guide
39 pages
Server Poweredge t710 Technical Guide Book
No ratings yet
Server Poweredge t710 Technical Guide Book
56 pages
Computer Architecture: Chapter 3 Overview
No ratings yet
Computer Architecture: Chapter 3 Overview
74 pages
Power Reduction Techniques For An 8-Core Xeon® Processor
No ratings yet
Power Reduction Techniques For An 8-Core Xeon® Processor
23 pages
PC Hardware Interview Questions-1
100% (4)
PC Hardware Interview Questions-1
10 pages
Fujitsu CnfgCX400M4 CX25x0M4 Mayo 2018
No ratings yet
Fujitsu CnfgCX400M4 CX25x0M4 Mayo 2018
47 pages
UEFI vs BIOS: Performance Insights
No ratings yet
UEFI vs BIOS: Performance Insights
50 pages
CH03 COA10e
No ratings yet
CH03 COA10e
49 pages
HP z400 Workstation: Raising The Bar For Entry Workstations
No ratings yet
HP z400 Workstation: Raising The Bar For Entry Workstations
2 pages
DX Diag
No ratings yet
DX Diag
17 pages
Dxdiag
No ratings yet
Dxdiag
11 pages
Dell Emc Poweredge T640: Technical Guide
No ratings yet
Dell Emc Poweredge T640: Technical Guide
48 pages
Data Center Xeon HSBC Whitepaper
No ratings yet
Data Center Xeon HSBC Whitepaper
12 pages
PDF 5463 Isg
No ratings yet
PDF 5463 Isg
992 pages
Servidor DELLLarge
No ratings yet
Servidor DELLLarge
3 pages
Ibm X3650 Server
No ratings yet
Ibm X3650 Server
2 pages
Intel® Xeon® Processor 5600 Series: Product Brief
No ratings yet
Intel® Xeon® Processor 5600 Series: Product Brief
8 pages
User Manual 59-107-184
No ratings yet
User Manual 59-107-184
57 pages
EQUIPO-PROGRAMA-E85006-0068 Fireworks
No ratings yet
EQUIPO-PROGRAMA-E85006-0068 Fireworks
9 pages
Dell PowerEdge T420 Spec Sheet
No ratings yet
Dell PowerEdge T420 Spec Sheet
2 pages
INTEL CORE I7 PROCESSOR
100% (1)
INTEL CORE I7 PROCESSOR
22 pages
Computer Architecture and Organization Reviewer
No ratings yet
Computer Architecture and Organization Reviewer
14 pages
Arch&org - Chapter 3
No ratings yet
Arch&org - Chapter 3
14 pages
APP For Intel Xeon Processors
No ratings yet
APP For Intel Xeon Processors
17 pages
COA Report
No ratings yet
COA Report
4 pages
Understanding Von Neumann Architecture
No ratings yet
Understanding Von Neumann Architecture
4 pages
General Computing I Notes
No ratings yet
General Computing I Notes
37 pages
Performance Analysis of Dual Core, Core 2 Duo and Core I3 Intel Processor
No ratings yet
Performance Analysis of Dual Core, Core 2 Duo and Core I3 Intel Processor
7 pages
HP Z440, Z640, and Z840 Workstation Series: Maintenance and Service Guide
100% (1)
HP Z440, Z640, and Z840 Workstation Series: Maintenance and Service Guide
132 pages
Core I7 900 Ee and Desktop Processor Series 32nm Datasheet Vol 1
No ratings yet
Core I7 900 Ee and Desktop Processor Series 32nm Datasheet Vol 1
102 pages

High-Performance Managed Languages Guide

Uploaded by

High-Performance Managed Languages Guide

Uploaded by

High Performance

Martin Thompson - @mjpt777

1. Let’s set some Context

A Managed Runtime is not

Skills & Resources?

testSequential avgt 0.832 ± 0.006 ns/op

testSequential avgt 0.832 ± 0.006 ns/op

testSequential avgt 0.832 ± 0.006 ns/op

testSequential avgt 0.832 ± 0.006 ns/op

testSequential avgt 0.832 ± 0.006 ns/op

1. Memory is transported in Cachelines

1. Memory is transported in Cachelines

2. Memory is managed in OS Pages

1. Memory is transported in Cachelines

2. Memory is managed in OS Pages

1. Profile guided optimisations

1. Profile guided optimisations

2. Bets can be taken and later revoked

int result = (i > 7) ? a : b;

int result = (i > 7) ? a : b;

CMOV vs Branch Prediction?

“Inlining is THE optimisation.”

void foo(int[] array, int length)

for (int i = 0; i < length; i++)

void foo(int[] array)

for (int i = 0; i < array.length; i++)

void bar(Shape shape)

1. Profile guided optimisations

2. Bets can be taken and later revoked

“Only the good die young.”

~40 cycles ~15ns

• Affords locality of reference

Not easy to pick clear winner…

Not easy to pick clear winner…

Not easy to pick clear winner…

“Any intelligent fool can make things bigger, more

You might also like