Sony Computer Entertainment Europe Research & Development Division
Pitfalls of Object Oriented Programming
Tony Albrecht Technical Consultant Developer Services
What I will be covering
A quick look at Object Oriented (OO) programming A common example Optimisation of that example Summary
Slide 2
Object Oriented (OO) Programming
What is OO programming?
a programming paradigm that uses "objects" data structures consisting of datafields and methods together with their interactions to design applications and computer programs. (Wikipedia)
Includes features such as
Data abstraction Encapsulation Polymorphism Inheritance
Slide 3
Whats OOP for? OO programming allows you to think about problems in terms of objects and their interactions. Each object is (ideally) self contained
Contains its own code and data. Defines an interface to its code and data.
Each object can be perceived as a black box.
Slide 4
Objects If objects are self contained then they can be
Reused. Maintained without side effects. Used without understanding internal implementation/representation.
This is good, yes?
Slide 5
Are Objects Good? Well, yes And no. First some history.
Slide 6
A Brief History of C++
C++ development started
1979
2009
Slide 7
A Brief History of C++
Named C++
1979 1983
2009
Slide 8
A Brief History of C++
First Commercial release
1979
1985
2009
Slide 9
A Brief History of C++
Release of v2.0
1979
1989
2009
Slide 10
A Brief History of C++
Added multiple inheritance, abstract classes, static member functions, Release of v2.0 const member functions protected members.
1979
1989
2009
Slide 11
A Brief History of C++
Standardised
1979
1998
2009
Slide 12
A Brief History of C++
Updated
1979
2003
2009
Slide 13
A Brief History of C++
C++0x
1979
2009
Slide 14
So what has changed since 1979?
Many more features have been added to C++ CPUs have become much faster. Transition to multiple cores Memory has become faster.
[Link]
Slide 15
CPU performance
Slide 16
Computer architecture: a quantitative approach By John L. Hennessy, David A. Patterson, Andrea C. Arpaci-Dusseau
CPU/Memory performance
Slide 17
Computer architecture: a quantitative approach By John L. Hennessy, David A. Patterson, Andrea C. Arpaci-Dusseau
What has changed since 1979? One of the biggest changes is that memory access speeds are far slower (relatively)
1980: RAM latency ~ 1 cycle 2009: RAM latency ~ 400+ cycles
What can you do in 400 cycles?
Slide 18
What has this to do with OO?
OO classes encapsulate code and data. So, an instantiated object will generally contain all data associated with it.
Slide 19
My Claim
With modern HW (particularly consoles), excessive encapsulation is BAD.
Data flow should be fundamental to your design (Data Oriented Design)
Slide 20
Consider a simple OO Scene Tree
Base Object class
Contains general data
Node
Container class
Modifier
Updates transforms
Drawable/Cube
Renders objects
Slide 21
Object
Each object
Maintains bounding sphere for culling Has transform (local and world) Dirty flag (optimisation) Pointer to Parent
Slide 22
Objects
Class Definition Each square is 4 bytes
Memory Layout
Slide 23
Nodes
Each Node is an object, plus
Has a container of other objects Has a visibility flag.
Slide 24
Nodes
Class Definition
Memory Layout
Slide 25
Consider the following code
Update the world transform and world space bounding sphere for each object.
Slide 26
Consider the following code
Leaf nodes (objects) return transformed bounding spheres
Slide 27
Consider the following code
Leaf nodes (objects)this Whats wrong with return transformed bounding code? spheres
Slide 28
Consider the following code
Leaf nodesm_Dirty=false thenreturn transformed If (objects) we get branch bounding misprediction which costs 23 or 24 spheres cycles.
Slide 29
Consider the following code
Leaf nodes (objects) return transformed of bounding Calculation12 cycles. bounding sphere spheresthe world takes only
Slide 30
Consider the following code
Leaf nodes (objects) return transformed bounding So using a dirty using oneis actually spheres flag here (in the case slower than not
where it is false)
Slide 31
Lets illustrate cache usage
Main Memory
Each cache line is 128 bytes
L2 Cache
Slide 32
Cache usage
Main Memory parentTransform is already in the cache (somewhere)
L2 Cache
Slide 33
Cache usage
Main Memory
Assume this is a 128byte boundary (start of cacheline)
L2 Cache
Slide 34
Cache usage
Main Memory
L2 Cache
Load m_Transform into cache
Slide 35
Cache usage
Main Memory
L2 Cache
m_WorldTransform is stored via cache (write-back)
Slide 36
Cache usage
Main Memory
L2 Cache
Next it loads m_Objects
Slide 37
Cache usage
Main Memory
L2 Cache
Then a pointer is pulled from somewhere else (Memory managed by std::vector)
Slide 38
Cache usage
Main Memory
L2 Cache
vtbl ptr loaded into Cache
Slide 39
Cache usage
Main Memory
L2 Cache
Look up virtual function
Slide 40
Cache usage
Main Memory
L2 Cache
Then branch to that code (load in instructions)
Slide 41
Cache usage
Main Memory
L2 Cache
New code checks dirty flag then sets world bounding sphere
Slide 42
Cache usage
Main Memory
L2 Cache
Nodes World Bounding Sphere is then Expanded
Slide 43
Cache usage
Main Memory
L2 Cache
Then the next Object is processed
Slide 44
Cache usage
Main Memory
L2 Cache
First object costs at least 7 cache misses
Slide 45
Cache usage
Main Memory
L2 Cache
Subsequent objects cost at least 2 cache misses each
Slide 46
The Test
11,111 nodes/objects in a tree 5 levels deep Every node being transformed Hierarchical culling of tree Render method is empty
Slide 47
Performance
This is the time taken just to traverse the tree!
Slide 48
Why is it so slow?
~22ms
Slide 49
Look at GetWorldBoundingSphere()
Slide 50
Samples can be a little misleading at the source code level
Slide 51
if(!m_Dirty) comparison
Slide 52
Stalls due to the load 2 instructions earlier
Slide 53
Similarly with the matrix multiply
Slide 54
Some rough calculations
Branch Mispredictions: 50,421 @ 23 cycles each ~= 0.36ms
Slide 55
Some rough calculations
Branch Mispredictions: 50,421 @ 23 cycles each ~= 0.36ms L2 Cache misses: 36,345 @ 400 cycles each ~= 4.54ms
Slide 56
From Tuner, ~ 3 L2 cache misses per object
These cache misses are mostly sequential (more than 1 fetch from main memory can happen at once) Code/cache miss/code/cache miss/code
Slide 57
Slow memory is the problem here
How can we fix it? And still keep the same functionality and interface?
Slide 58
The first step
Use homogenous, sequential sets of data
Slide 59
Homogeneous Sequential Data
Slide 60
Generating Contiguous Data Use custom allocators
Minimal impact on existing code
Allocate contiguous
Nodes Matrices Bounding spheres
Slide 61
Performance
19.6ms -> 12.9ms
35% faster just by moving things around in memory!
Slide 62
What next? Process data in order Use implicit structure for hierarchy
Minimise to and fro from nodes.
Group logic to optimally use what is already in cache. Remove regularly called virtuals.
Slide 63
We start with a parent Node
Hierarchy
Node
Slide 64
Hierarchy
Node
Node
Which has children nodes
Node
Slide 65
Hierarchy
Node
Node
Node
And they have a parent
Slide 66
Hierarchy
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
And they have children
Slide 67
Hierarchy
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
And they all have parents
Slide 68
Hierarchy
Node
Node
Node
Node
Node Node Node Node Node
Node
Node NodeNode Node Node
Node
Node
A lot of this information can be inferred Slide 69
Hierarchy
Node
Use a set of arrays, one per hierarchy level
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Slide 70
Hierarchy
Node
:2
Parent has 2 children Node
:4
Node
:4
Children have 4 children Node Node Node Node Node
Node
Node
Node
Slide 71
Hierarchy
Node
:2
Ensure nodes and their data are contiguous in memory Node
:4
Node
:4
Node
Node
Node
Node
Node
Node
Node
Node
Slide 72
Make the processing global rather than local
Pull the updates out of the objects.
No more virtuals
Easier to understand too all code in one place.
Slide 73
Need to change some things
OO version
Update transform top down and expand WBS bottom up
Slide 74
Update transform
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Slide 75
Update transform
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Slide 76
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Update transform and world bounding sphere
Slide 77
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Add bounding sphere of child
Slide 78
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Update transform and world bounding sphere
Slide 79
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Add bounding sphere of child
Slide 80
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Update transform and world bounding sphere
Slide 81
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Add bounding sphere of child
Slide 82
Hierarchical bounding spheres pass info up Transforms cascade down Data use and code is striped.
Processing is alternating
Slide 83
Conversion to linear
To do this with a flat hierarchy, break it into 2 passes
Update the transforms and bounding spheres(from top down) Expand bounding spheres (bottom up)
Slide 84
Transform and BS updates
Node
:2
Node
:4
Node
:4
For each node at each level (top down) { multiply world transform by parents transform wbs by world transform }
Node
Node
Node
Node
Node
Node
Node
Node
Slide 85
Update bounding sphere hierarchies
Node
:2
Node
:4
Node
:4
For each node at each level (bottom up) { add wbs to parents } cull wbs against frustum }
Node
Node
Node
Node
Node
Node
Node
Node
Slide 86
Update Transform and Bounding Sphere
How many children nodes to process
Slide 87
Update Transform and Bounding Sphere
For each child, update transform and bounding sphere
Slide 88
Update Transform and Bounding Sphere
Note the contiguous arrays
Slide 89
So, whats happening in the cache?
Parent
Unified L2 Cache
Children node Childrens data not needed Parent Data
Childrens Data
Slide 90
Load parent and its transform
Parent
Unified L2 Cache
Parent Data
Childrens Data
Slide 91
Load child transform and set world transform
Parent
Unified L2 Cache
Parent Data
Childrens Data
Slide 92
Load child BS and set WBS
Parent
Unified L2 Cache
Parent Data
Childrens Data
Slide 93
Load child BS and set WBS
Parent Next child is calculated with no extra cache misses !
Unified L2 Cache
Parent Data
Childrens Data
Slide 94
Load child BS and set WBS
Parent next 2 children incur 2 The cache misses in total
Unified L2 Cache
Parent Data
Childrens Data
Slide 95
Prefetching
Parent Because all data is linear, we can predict what memory will be needed in ~400 cycles and prefetch Parent Data
Unified L2 Cache
Childrens Data
Slide 96
Tuner scans show about 1.7 cache misses per node. But, these misses are much more frequent
Code/cache miss/cache miss/code Less stalling
Slide 97
Performance
19.6 -> 12.9 -> 4.8ms
Slide 98
Prefetching Data accesses are now predictable Can use prefetch (dcbt) to warm the cache
Data streams can be tricky Many reasons for stream termination Easier to just use dcbt blindly
(look ahead x number of iterations)
Slide 99
Prefetching example
Prefetch a predetermined number of iterations ahead Ignore incorrect prefetches
Slide 100
Performance
19.6 -> 12.9 -> 4.8 -> 3.3ms
Slide 101
A Warning on Prefetching
This example makes very heavy use of the cache This can affect other threads use of the cache
Multiple threads with heavy cache use may thrash the cache
Slide 102
The old scan
~22ms
Slide 103
The new scan
~16.6ms
Slide 104
Up close
~16.6ms
Slide 105
Looking at the code (samples)
Slide 106
Performance counters
Branch mispredictions: 2,867 (cf. 47,000) L2 cache misses: 16,064 (cf 36,000)
Slide 107
In Summary
Just reorganising data locations was a win Data + code reorganisation= dramatic improvement. + prefetching equals even more WIN.
Slide 108
OO is not necessarily EVIL Be careful not to design yourself into a corner Consider data in your design
Can you decouple data from objects? code from objects?
Be aware of what the compiler and HW are doing
Slide 109
Its all about the memory
Optimise for data first, then code.
Memory access is probably going to be your biggest bottleneck
Simplify systems
KISS Easier to optimise, easier to parallelise
Slide 110
Homogeneity
Keep code and data homogenous
Avoid introducing variations Dont test for exceptions sort by them.
Not everything needs to be an object
If you must have a pattern, then consider using Managers
Slide 111
Remember
You are writing a GAME
You have control over the input data Dont be afraid to preformat it drastically if need be.
Design for specifics, not generics (generally).
Slide 112
Data Oriented Design Delivers
Better performance Better realisation of code optimisations Often simpler code More parallelisable code
Slide 113
The END
Questions?
Slide 114