Advanced Backtesting Paper
Advanced Backtesting Paper
HangukQuant1, 2 *
1 2
Abstract
Testing trading strategies are an essential part of the quant tool-kit. By definition, researching and
testing of ideas/strategies in the quantitative space is synonymous with prototyping. Ease of expressing
our ideas and correctness of code is prioritised. As a result, quantitative researchers often perform analysis
in a high-level language such as Python, abstracting away the marginal concerns of memory management,
garbage collection, object typing and compilation, while leaving quantitative engineers to do the ‘dirty’
work in a low-level language such as cpp. This comes at the cost of speed, often running up to 100 times
slower than it ought to! However, many are still reluctant to ‘leave’ the Python community - there is
excellent community support and a vast universe of useful libraries for data scientists and quants. In this
paper we see how we can write simple, readable code in Python with ‘C’ performance. The techniques
discussed herein can be extrapolated to other quantitative tasks, particularly where computationally
demanding tasks are at hand. Our focus is on single machine (as opposed to clusters), CPU (as opposed
to GPU) optimizations for faster computation (as opposed to memory/RAM) efficiencies.
1
Contents
1 Introduction 5
2 Our Results 5
3.3.1 FX-Calculator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Time-Series Long-Biased Momentum in Equities With Ex-Post Risk Management for Crash
Risk Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4 Benchmarking 35
4.3 Strat1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4 Strat2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5 Strat3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Strat1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3 Strat2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.4 Strat3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2
6.1.2 Replacing Function Calls with More Suitable Choices . . . . . . . . . . . . . . . . . . 54
10.3 Back to Numpy - All Functions are Equal, But Some are More Equal than Others . . . . . . 97
3
13 Conclusions 113
4
1 Introduction
As we often do, in our papers we do not beat about the bush. We cover technical details as we go through
the examples. We ‘learn by doing’, instead of rambling on about theory. If the theory is well-known in
literature such that one can easily reference them, we refer readers to the source. There is no point for me
to waste time re-iterating another blog post. As before, the same caveat applies - this is a result of weekly
musings - there are bound to be mistakes. Please email all errors/issues to [email protected] or tweet
me at @HangukQuant.
For most traders (non-MM/HFT firms), ‘instantaneous’ is not a necessity - but we still require efficiency.
For smaller (1 ∼ 5) groups of traders, members of the group often have to be equipped with skills in both
quantitative research and trade execution. The ‘efficiency’ refers not only to the trading latency - it also
refers to the efficiency of the team as a whole; the ability of the team to pipeline between data, alpha and
trade management. If you spend a week prototyping an idea and then another implementing it in another
‘fast’ language, this is an efficiency cost. Programming in a homogeneous framework tremendously speeds
up the quant process, and is significantly easier to handle. A well-written Python script can run at fractions
of a second slower than a finely-tuned C program - in most cases this is not a concern - then the team should
favor the more efficient solution, which is to adopt a homogeneous framework with a ‘simpler’ interface.
The downside is that many quantitative researchers do not know how to write efficient, readable Python
code. The idea they want to test takes 1 hour for them to see if it works - when it should only take a couple
of minutes. In this case, these issues seriously hamper the team efficiency as a whole. We should avoid this
whenever possible.
2 Our Results
We manage to achieve speedup up by one to two orders of magnitude with the discussions involved:
5
Version Strat 1 Strat 2 Strat 3
Original 116 - 81
Improved 10 - 5
As with most other papers, we will introduce the sample alphas and preliminary code. Then we iteratively
improve upon them to arrive at a better solution. Most readers should know: ‘We can do better’ is implicit
in all iterations of our code, our work and our life. We always start with the imperfect and strive to be better.
We show the tools, processes and intuitions for the differences between the iterations. To drive concreteness,
we will not work with contrive examples. Let’s begin right away.
We are not particularly concerned about the attractiveness of the alpha - although some of them appear
to have fairly attractive performance, we caution readers from taking them ‘as is’. We take no responsibility
over their correctness, viability or profitability in trading markets. Trading is a risky operation.
6
Figure 1: equity curve of our alpha set
with Sharpe:
Let’s create a mini-utility library that acts as a stub for our database and other utilities. Relevant data can
be obtained from YahooFinance, Oanda or other sources. Data retrieval is discussed and implemented in
previous papers by HangukQuant - we will not meddle with them here.
dirtylib is the library that we will call to retrieve data for our equities and FX data. Our dataset consists
of stocks and fx-pairs from 2010.01.01 ∼ 2020.01.01, for no particular reason. Our pickled data is a tuple
7
of an array of instruments in our universe, and the dfs is a dictionary of the corresponding OHLCV(AC)
dataframes for each instrument. We do a simple cleanup of null dataframes and empty dataframes. Next,
we label the currency that the instrument in denominated in, using % to mean ‘denominated’. For equity
tickers such as AAPL, we use the ticker AAPL%USD to indicate denomination in USD. For fx-pairs, we label
with the last 3 characters, such that EURJPY → EURJPY%JPY, USDJPY → USDJPY%JPY, EURUSD
→ EURUSD%USD and so on. We assume the base currency of our simulator is in USD. We see how this
nomenclature is useful later.
4 def g e t _ e q u i t i e s _ d i c t () :
5 ( instruments , dfs ) = get_cleaned ( " equities_temp . pickle " )
6 renamed = {}
7 for i in instruments :
8 renamed [ i + " % USD " ] = dfs [ i ]
9 return list ( renamed . keys () ) , renamed
10
11 def get_fx_dict () :
12 ( instruments , dfs ) = get_cleaned ( " fx_temp . pickle " )
13 renamed = {}
14 for i in instruments :
15 if len ( i ) == 6:
16 renamed [ i + " % " + i [3:]] = dfs [ i ]
17 else :
18 renamed [ i + " % USD " ] = dfs [ i ]
19 return list ( renamed . keys () ) , renamed
20
Listing 1: dirtylib.py
8
3.2 Equity Skewness Idiosyncratic Premia
For a detailed explanation of the skewness premia, refer to other literature. This is one that I have lightly
discussed upon many times before - it is simple and intuitive to understand. We will explain it in very high
level - people love lotteries, even though they are negative EV products. The return distribution of lotteries
are positively skewed. If there is a systematic preference for positively skewed return assets, then these assets
must be overpriced. Our trading strategy is short positive skew and long negative skew.
The return skewness can be decomposed into idiosyncratic skew and systematic skew, which is the skew
of the index/benchmark itself. In order to remove the beta effect, instead of computing skew based on
absolute returns of the asset, we do a variant where we compute skew based on the residuals of individual
returns regressed on market returns. We adopt a volatility targeting approach, at two levels - a) first
individual positions sizing are taken inversely proportional to their 30d rolling volatility of close-to-close
returns relative to other traded positions, and b) overall positions are scaled relative to their historical
volatility of the strategy to achieve a target portfolio volatility of 20% annualized. Our choice of variant is
simply to demonstrate a task that is rather computationally demanding - we need to do this on a rolling
basis, such that the regressions and skew computations are rolled every-day. Perhaps it is better to illustrate
with code.
Listing 2: strat1.py
We hope to run a test with some code as such: (we picked the first 200 instruments from our equities
universe to save time...)
1 if __name__ == " __main__ " :
2 ( instruments , dfs ) = g e t _ e q u i t i e s _ d i c t ()
3 alpha = Alpha (
4 instruments = instruments [:200] ,
5 dfs = dfs ,
6 configs ={
7 " start " : dlib . TRADE_START ,
8 " end " : dlib . TRADE_END ,
9
9 " longsize " : 50 ,
10 " shortsize " : 50
11 }
12 )
13 portfolio_df = alpha . run _simulat ion ()
Listing 3: strat1.py
which means we need to create an Alpha class and implement the function run simulation.
Listing 4: strat1.py
and the meat of our testing engine (bear through the 99 lines of code here, we will make it alot cleaner
as we go...):
1 def run_sim ulation ( self ) :
2 """
3 Settings
4 """
5 portfolio_vol = 0.20
6 t r a d e _ d a t e t i m e _ r a n g e = pd . date_range (
7 start = self . g e t _ t r a d e _ d a t e t i m e _ r a n g e () [0] ,
8 end = self . g e t _ t r a d e _ d a t e t i m e _ r a n g e () [1] ,
9 freq = " D "
10 )
11
12 """
13 Compute Metas
14 """
15 self . compute_metas ( index = t r a d e _ d a t e t i m e _ r a n g e )
16
17 """
18 I ni ti ali sa ti on s
19 """
20 portfolio_df = self . i n i t _ p o r t f o l i o _ s e t t i n g s ( trade_range = t r a d e _ d a t e t i m e _ r a n g e )
21
10
23 date = portfolio_df . loc [i , " datetime " ]
24 strat_scalar = 2
25
28 if i != 0:
29 date_prev = portfolio_df . loc [ i - 1 , " datetime " ]
30 day_pnl , nominal_ret = b acktest_ utils . get_pnl_stats (
31 date = date ,
32 prev = date_prev ,
33 portfolio_df = portfolio_df ,
34 instruments = self . instruments ,
35 idx =i ,
36 historicals = self . dfs ,
37 close_col = " adj_close "
38 )
39
40 strat_scalar = self . g e t _ s t r a t _ sc a l e r (
41 portfolio_df = portfolio_df ,
42 lookback =30 ,
43 portfolio_vol = portfolio_vol ,
44 idx =i ,
45 default = strat_scalar
46 )
47 portfolio_df . loc [i , " strat scalar " ] = strat_scalar
48
49 """
50 Compute Alpha
51 """
52 alpha_scores = {}
53 for inst in eligibles :
54 from sklearn . linear_model import L i n e a r R e g r e s s i on
55 from sklearn . preprocessing import P o l y n o m i a l F e a t u r e s
56 from scipy . stats import skew
57
11
66 alpha_scores = { k : v for k , v in sorted ( alpha_scores . items () , key = lambda pair : pair
[1]) }
67 alphalong = list ( alpha_scores . keys () ) [ - self . configs [ " longsize " ]:]
68 alphashort = list ( alpha_scores . keys () ) [: self . configs [ " shortsize " ]]
69
74 nominal_tot = 0
75 for inst in eligibles :
76 forecast = 1 if inst in alphalong else 0
77 forecast = -1 if inst in alphashort else forecast
78 vol_target = 1 /( self . configs [ " longsize " ] + self . configs [ " shortsize " ]) \
79 * portfolio_df . loc [i , " capital " ] \
80 * portfolio_vol \
81 / np . sqrt (253)
82 dollar_vol = self . dfs [ inst ]. loc [ date , " vol " ] * self . dfs [ inst ]. loc [ date , "
adj_close " ]
83 position = strat_scalar * forecast * vol_target / dollar_vol
84 portfolio_df . loc [i , inst + " units " ] = position
85 nominal_tot += abs ( position * self . dfs [ inst ]. loc [ date , " adj_close " ])
86
Listing 5: strat1.py
In order, we initialise the settings for portfolio volatility at 20%, create a date range that goes from the
start to end of our chosen range. We then use the compute metas function to compute metadata required for
the backtest, and then for each day: a) get a list of dynamic tradable assets, b) get a position sizing scalar,
12
c) get pnl, d) calculate updated positions, e) set positions inversely proportional to their historical volatility
with the capital divided among the actively traded instruments, f) compute target weights and number of
units to hold in the portfolio.
2 def g e t _ t r a d e _ d a t e t i m e _ r a n g e ( self ) :
3 return ( self . configs [ " start " ] , self . configs [ " end " ])
4
13
33 portfolio_df . loc [0 , " ewma " ] = 0.001
34 return portfolio_df
35
Listing 6: strat1.py
Throughout our backtest, we will need volatility statistics for risk management, close-to-close returns
to compute pnl, a list of eligible stocks to be traded on a particular date, and the ‘index/benchmark’ returns
(computed as the aggregated mean of all the returns in the universe). A ticker is considered eligible to be
traded if there are continuous 61 days of data for skew computation AND if there is movement in the stock
prices over the last 5 entries with non-negative price.
Note that the order of the statements in compute metas is non-trivial, in particular with respect to the
statement
self.dfs[inst] = df.join(self.dfs[inst])
. The statement
df = pd.DataFrame(index=index)
creates a daily series of dates (including weekends, holidays and so on) while the dataframes referenced by
self.dfs[inst]
contains a dataframe where the index is not guaranteed to be a continuous series (this depends on your
data source). Likely, the data source will contain only rows where there were samples to take, such that on
weekends and exchange holidays there were no entries. The ‘join’ operation will cause gaps in the joined
dataframe. We want to compute volatility and active universes before we forward fill and backward fill our
data. This subtlety is important.
Finally, we want to implement the method that gets our portfolio scalar,
1 def g e t _ s t r a t _ s c a l e r ( self , portfolio_df , lookback , portfolio_vol , idx , default ) :
2 a n n _ r e al i z e d _ v o l = np . sqrt ( portfolio_df . loc [ idx - 1 , " ewma " ] * 252)
3 return portfolio_vol / a n n _ r e a l i z e d _ v o l
Listing 7: strat1.py
14
where the ewma references the forward volatility estimate of our trading strategy’s nominal returns on an
unleveraged portfolio. We discussed this volatility targeting formulation in our previous papers [1]. Recall
that the individual positions are computed as follows:
vol_target = 1 \
/ (self.configs["longsize"] + self.configs["shortsize"]) \
* portfolio_df.loc[i, "capital"] \
* portfolio_vol \
/ np.sqrt(253)
dollar_vol = self.dfs[inst].loc[date, "vol"] * self.dfs[inst].loc[date, "adj_close"]
position = strat_scalar * forecast * vol_target / dollar_vol
The value
is the volatility in dollar terms rescaled to daily granularity for the entire portfolio. Hence, the variable
vol target is the volatility in dollar terms for a single position, which is divided by the asset volatility (also
in dollar terms). Hence, strat scalar is the portfolio scalar to increase position sizing due to volatility loss
attributed to diversification. We will improve this later to achieve more flexibility in our risk management.
Last but not least, we need to implement the functionality to get our pnl each day:
1 def get_pnl_stats ( date , prev , portfolio_df , instruments , idx , historicals , close_col = "
adj_close " ) :
2 day_pnl = 0
3 nominal_ret = 0
4 for inst in instruments :
5 units_held = portfolio_df . loc [ idx - 1 , inst + " units " . format ( inst ) ]
6 if units_held != 0:
7 delta_price = historicals [ inst ]. loc [ date , " adj_close " ] - historicals [ inst ]. loc [
prev , " adj_close " ]
8 inst_pnl = units_held * delta_price
9 day_pnl += inst_pnl
10 nominal_ret += portfolio_df . loc [ idx - 1 , inst + " w " ] * historicals [ inst ]. loc [
date , " ret " ]
11
15
17 portfolio_df . loc [ idx , " ewma " ] = 0.06 * ( nominal_ret **2) + 0.94 * portfolio_df . loc [ idx -
1 , " ewma " ] \
18 if nominal_ret != 0 else portfolio_df . loc [ idx - 1 , " ewma " ]
19 return day_pnl , nominal_ret
Note the line that performs EWMA volatility forecasts, which we use when computing the volatility
scalar. It is computed on unleveraged portfolio returns, hence we need to use the variable nominal ret. For
a brief discussion on this, refer to paper on vol & leverage targeting. [1].
We are done. This is our first strategy. Most of the strategies follow this format - a risk ‘controller’,
an alpha ‘controller’ and these two interacting together to churn signals. The biggest differences between
trading strategies are how the ‘controller’-s vary in terms of their logic and implementation. We look to
expand on this by introducing some more variants of strategies that we might encounter in the wild.
The first strategy was a cross-sectional, inverse volatility-risk strategy trading the skewness premia on US
equities. For this strategy, we want to expand beyond and trade FX/products based in other currencies. We
consider a very simple signal that goes long the largest drawdown over a rolling 12 day period and shorts the
smallest drawdowns. In order to do this, we need to implement an fx-calculator - let’s add this functionality
to our backtest utils.
3.3.1 FX-Calculator
16
This function takes in an instrument (recall that we labelled them in their denominated currency with the
% sign, and returns the value of a single unit of that instrument in the base currency. Corresponding to the
condition:
if instrument[-3:] == "USD":
return dfs[instrument].loc[date, "adj_close"]
an instrument denominated in ‘USD’ such as AAPL%USD or EURUSD%USD has a value (in base currency
USD) worth the price itself. The condition:
for an instrument such EURAUD%AUD is worth EURAUD%AUD * AUDUSD%USD, and the condition:
This affects our calculation of the nominal value of a single unit of non-US denominated product, and
shall be reflected in our code. Accordingly, the ‘dollar volatility’ computation is also affected, and we need to
make the relevant adjustments. Before demonstrating the code right away, we shall also introduce another
1
variant to our risk allocator - suppose we no longer want to take relative position sizing as vol .
We may want to adopt a risk variant that is agnostic to relative volatilties, such that their ‘risk’ estimate a
priori is a unit constant. Or, we may want to adopt a risk variant that takes positions inversely related to
their volatilities, but not proportionately so - penalising positions that are too large (a result of volatility
compression). Either way, we want to implement position sizing that is in an inverse relation to our estimate
of its relative risk (whether our estimate is constant or variable is another issue), else it would be a foolish
decision.
In FX (& other) markets, there is often compression of volatility followed by aggressive movement
(breakouts). The compression of volatility would - in an inverse volatility framework - cause us to position
aggressively in the asset to achieve the same amount of relative risk in volatility terms with respect to the
17
other assets. However, this can often be risky as volatility has a mean reverting component; the state of
affairs can change and a big move in the low-volatility asset may occur. Getting caught wrong-footed in these
positions would be career-ending. These situations can occur naturally in markets, or under influence of a
third-party (such as central bank FX-pegging). Let’s do a case-study in a 2 asset universe, using EURCHF
and USDJPY as our examples.
We see immediately that there are jumps in the EURCHF closing prices, while USDJPY prices have a
smoother path. The Swiss National Bank pegged its Swiss franc to the euro on Sept 6, 2011, and scrapped
it on Jan 15, 2015 - the flash crash is evident in the price chart. Let’s take a look at their 30-day rolling
volatilities, and their inverses (since we take positions inversely related to it):
We see there is some correlation but EURCHF is significantly jumpier - since their relative sizing scalar
is the ratio of their volatilities let’s take a look at the ratio of the rolling volatilties:
18
Figure 4: EURCHF USDJPY ratio of rolling inverse volatilities
, where we see that often we have positions in EURCHF that is 5 ∼ 20 times larger than the position
in USDJPY! Recalling the mean reverting nature of volatilities, this position sizing might not be astute and
in this 2 asset scenario the entire strategy pnl distribution is dominated by our EURCHF directional bets
instead of the premia targeted. There is lopsided nominal exposure, even though on a historical volatility
basis we are market neutral.
Additionally, we know that alpha is theoretical, and trading costs are real. We want to reduce trading
costs as much as possible, and jumpiness introduces positional change, which is costly. Let’s take a look at
the 1-day delta of rolling inverse volatilties:
19
Figure 5: EURCHF USDJPY rolling inverse volatilities deltas
which suggest that from one day to the next we might increase our position sizing by up to 400% in an
asset held. We want to penalise extreme low volatilities, while maintaining relative position sizing based on
perceived risk - consider the order of growth of some monotonic functions that ‘punish’ larger values:
20
Figure 6: Sqrt and log(base) ord of growth.
where we can determine the ‘punish’ factor for large positions by varying the log base. Let’s now suppose
we choose the base 1.3, and re-view the ratio of their rolling-log-inverse-volatilties,
21
Figure 7: EURCHF USDJPY ratio of rolling log inverse volatilities
which might give a more palatable relative sizing between the two assets, and their delta-1 ratios:
22
Figure 8: EURCHF USDJPY rolling log inverse volatilities deltas
which gives significantly less position turnover during volatility jumps. This simple exercise leads us
to different variants of relative risk sizing, and the determination of the behavior of our system is based
on the trader preference - it is entirely possible that the trader wants aggressive relative allocations based
on volatility, but he should be aware that it is a ’feature not a bug’. Note that in real systems it is often
preferred to remove pegged currencies from the universe under consideration entirely but the same issue
applies to illiquid names, and the punishing component is a relatively common feature, although it might
have implementation differences.
Summing up the last two sections, with the FX calculator and the custom risk variant, we may want to
adjust the code logic computing position sizing, such that it looks something like:
23
position = strat_scalar
* forecast
* vol_target
* self.dfs[inst].loc[date, "loginvvol"]
/ backtest_utils.unit_base_value(instrument=inst, dfs=self.dfs, date=date)
where the unit base value is the value of unit of that instrument adjusted in base currency terms and
the loginvvol is our revised estimate of risk (or whatever else it might be). However, the adjustments create a
minor complication, in that our portfolio scalar strat scalar was computed on an unleveraged portfolio returns
and compensates for volatility loss from diversification. However, in our log risk variant the volatility loss is
not purely diversification, since now our position has a different scaling factor. We need to compute strategy
scalar from realised portfolio (leveraged) returns, such that we make the following adjustments:
1 def g e t _ s t r a t _ s c a l e r ( self , portfolio_df , lookback , portfolio_vol , idx , default ) :
2 a n n _ r e al i z e d _ v o l = np . sqrt ( portfolio_df . loc [ idx - 1 , " ewma " ] * 252)
3 return portfolio_vol / a n n _ r e a l i z e d _ v o l * portfolio_df . loc [ idx - 1 , " ewstrat " ] # <<
1 def get_pnl_stats ( date , prev , portfolio_df , instruments , idx , historicals , close_col = "
adj_close " ) :
2 day_pnl = 0
3 nominal_ret = 0
4 for inst in instruments :
5 units_held = portfolio_df . loc [ idx - 1 , inst + " units " . format ( inst ) ]
6 if units_held != 0:
7 inst_pnl = units_held * un it _b a se _v al u e ( instrument = inst , dfs = historicals , date =
prev ) * historicals [ inst ]. loc [ date , " ret " ]
8 day_pnl += inst_pnl
9 nominal_ret += portfolio_df . loc [ idx - 1 , inst + " w " ] * historicals [ inst ]. loc [
date , " ret " ]
10
24
and we replace all in the relevant parts of our logic where the currency conversion should take place.
Note that I have added the statements in the eligibility check:
This is an obvious look-ahead bias but I have included it for a couple of reasons: a) my backtest is not to test
exactly how profitable it has been in the past but also to test if the mean-reversion effect exists in isolation
of anomalies and b) because it turns out my data source was corrupted and had random numbers sampled
on some days. Obviously, if the strategy was specifically trading tails then this would affect the viability of
my hypothesis checking, but in this case it does not.
We make the relevant changes to account for FX conversions and varying risk estimates, and our script
looks something like this for strategy 2 (note how most of the other code remains the same!)
1 import numpy as np
2 import pandas as pd
3
10 class Alpha () :
11
17 def g e t _ t r a d e _ d a t e t i m e _ r a n g e ( self ) :
18 return ( self . configs [ " start " ] , self . configs [ " end " ])
19
25
28 self . dfs [ inst ][ " ret " ] = -1 + self . dfs [ inst ][ " adj_close " ] / self . dfs [ inst ]. shift
(1) [ " adj_close " ]
29 self . dfs [ inst ][ " sampled " ] = self . dfs [ inst ][ " adj_close " ] != self . dfs [ inst ]. shift
(1) [ " adj_close " ]
30 self . dfs [ inst ][ " active " ] = self . dfs [ inst ][ " sampled " ]. rolling (5) . apply ( lambda x :
int ( np . any ( x ) ) ) . fillna (0)
31
32 self . dfs [ inst ][ " dd " ] = (1 + self . dfs [ inst ][ " ret " ]) . rolling ( window =12) . apply ( np .
prod ) / (1 + self . dfs [ inst ][ " ret " ]) . rolling ( window =12) . apply ( np . prod ) . cummax () - 1
33 self . dfs [ inst ][ " eligible " ] = \
34 (~ np . isnan ( self . dfs [ inst ][ " dd " ]) ) \
35 & self . dfs [ inst ][ " active " ] \
36 & ( self . dfs [ inst ][ " vol " ] > 0) \
37 & ( self . dfs [ inst ][ " adj_close " ] > 0) \
38 & ( self . dfs [ inst ][ " ret " ]. shift ( -1) < 0.30) \
39 & ( self . dfs [ inst ][ " ret " ]. shift ( -1) > -0.30)
40
26
64 end = self . g e t _ t r a d e _ d a t e t i m e _ r a n g e () [1] ,
65 freq = " D "
66 )
67
68 """
69 Compute Metas
70 """
71 self . compute_metas ( index = t r a d e _ d a t e t i m e _ r a n g e )
72
73 """
74 I niti al i sa ti o ns
75 """
76 portfolio_df = self . i n i t _ p o r t f o l i o _ s e t t i n g s ( trade_range = t r a d e _ d a t e t i m e _ r a n g e )
77
84 if i != 0:
85 date_prev = portfolio_df . loc [ i - 1 , " datetime " ]
86 day_pnl , nominal_ret = ba cktest_ utils . get_pnl_stats (
87 date = date ,
88 prev = date_prev ,
89 portfolio_df = portfolio_df ,
90 instruments = self . instruments ,
91 idx =i ,
92 historicals = self . dfs ,
93 close_col = " adj_close "
94 )
95
96 strat_scalar = self . g e t _ s t r at _ s c a l e r (
97 portfolio_df = portfolio_df ,
98 lookback =30 ,
99 portfolio_vol = portfolio_vol ,
100 idx =i ,
101 default = strat_scalar
102 )
103
104 portfolio_df . loc [i , " ewstrat " ] = 0.06 * strat_scalar + 0.94 * portfolio_df .
loc [ i - 1 , " ewstrat " ] \
105 if nominal_ret != 0 else portfolio_df . loc [ i - 1 , " ewstrat " ]
27
106
109 alpha_scores = {}
110 for inst in eligibles :
111 alpha_scores [ inst ] = -1 * self . dfs [ inst ]. loc [ date , " dd " ]
112
121 nominal_tot = 0
122 for inst in eligibles :
123 forecast = 1 if inst in alphalong else 0
124 forecast = -1 if inst in alphashort else forecast
125 vol_target = 1 / ( self . configs [ " longsize " ] + self . configs [ " shortsize " ]) \
126 * portfolio_df . loc [i , " capital " ] \
127 * portfolio_vol \
128 / np . sqrt (253)
129 position = strat_scalar * forecast * vol_target * self . dfs [ inst ]. loc [ date , "
loginvvol " ] / bac ktest_u tils . un i t_ ba se _ va lu e ( instrument = inst , dfs = self . dfs , date = date )
130 portfolio_df . loc [i , inst + " units " ] = position
131 nominal_tot += abs ( position * backt est_uti ls . un it _b a se _v al u e ( instrument =
inst , dfs = self . dfs , date = date ) )
132
28
144 portfolio_df . to_csv ( " strat2 . csv " )
145 return portfolio_df
146
That’s great and now we have two strategies, let’s introduce the third - and final - strategy.
In this strategy, we take a time series momentum strategy - and introduce a slight change to the risk
management to help illustrate some variables when it comes to strategy management. Unlike the other
strategies, we do not know beforehand how many instruments are long/short, or the degree of our forecast.
In essence, we have a forecast that is non-binary (1/-1 or 0), and can be any value within a range. Since we
have generalised the portfolio scalar compute, the volatility targeting component still works as desired even
if we vary our forecast’s ranges. We define the moving averages ‘f’, ‘m’, ’s’, ’ss’ to mean fast, medium, slow
and super slow and our alpha computation is as such:
self.dfs[inst]["smaf"] = self.dfs[inst]["adj_close"].rolling(10).mean()
self.dfs[inst]["smam"] = self.dfs[inst]["adj_close"].rolling(30).mean()
self.dfs[inst]["smas"] = self.dfs[inst]["adj_close"].rolling(100).mean()
self.dfs[inst]["smass"] = self.dfs[inst]["adj_close"].rolling(300).mean()
self.dfs[inst]["alphas"] = 0.0 + \
(self.dfs[inst]["smaf"] > self.dfs[inst]["smam"]) + \
(self.dfs[inst]["smaf"] > self.dfs[inst]["smas"]) + \
29
(self.dfs[inst]["smaf"] > self.dfs[inst]["smass"])+ \
(self.dfs[inst]["smam"] > self.dfs[inst]["smas"]) + \
(self.dfs[inst]["smam"] > self.dfs[inst]["smass"])
where the 0.0 casts the boolean pandas Series object into a float value. Our forecast values are all positive,
such that we only go long assets, and have no cut-offs - we just trade larger in sizing for those with larger
forecasts and trade smaller for those with smaller forecasts.
We add new component to the risk factor which prevents us from trading when even ‘pigs are flying’.
That is, we only want to trade our proposed positions if the market is not euphoric, and we do this by setting
a filter that says ‘only trade when less than 60% of the assets in our universe possess short-term momentum’.
num_trend = 0
for inst in eligibles:
num_trend = num_trend + 1
if self.dfs[inst].loc[date, "smaf"] > self.dfs[inst].loc[date, "smam"]
else num_trend
if len(eligibles) > 0 and num_trend / len(eligibles) > 0.60:
for inst in eligibles:
portfolio_df.loc[i, "{} units".format(inst)] = 0
portfolio_df.loc[i, "{} w".format(inst)] = 0
nominal_tot = 0
Note that this could be some other risk controller, such as the capping of leverage or whatever the trader
can dream off. The following test script would look something like this:
1 import numpy as np
2 import pandas as pd
3
11 class Alpha () :
12
30
13 def __init__ ( self , instruments , dfs , configs ) :
14 self . instruments = instruments
15 self . dfs = dfs
16 self . configs = configs
17
18 def g e t _ t r a d e _ d a t e t i m e _ r a n g e ( self ) :
19 return ( self . configs [ " start " ] , self . configs [ " end " ])
20
33 self . dfs [ inst ][ " smaf " ] = self . dfs [ inst ][ " adj_close " ]. rolling (10) . mean ()
34 self . dfs [ inst ][ " smam " ] = self . dfs [ inst ][ " adj_close " ]. rolling (30) . mean ()
35 self . dfs [ inst ][ " smas " ] = self . dfs [ inst ][ " adj_close " ]. rolling (100) . mean ()
36 self . dfs [ inst ][ " smass " ] = self . dfs [ inst ][ " adj_close " ]. rolling (300) . mean ()
37 self . dfs [ inst ][ " alphas " ] = 0.0 + \
38 ( self . dfs [ inst ][ " smaf " ] > self . dfs [ inst ][ " smam " ]) + \
39 ( self . dfs [ inst ][ " smaf " ] > self . dfs [ inst ][ " smas " ]) + \
40 ( self . dfs [ inst ][ " smaf " ] > self . dfs [ inst ][ " smass " ]) + \
41 ( self . dfs [ inst ][ " smam " ] > self . dfs [ inst ][ " smas " ]) + \
42 ( self . dfs [ inst ][ " smam " ] > self . dfs [ inst ][ " smass " ])
43
31
52 def i n i t _ p o r t f o l i o _ s e t t i n g s ( self , trade_range ) :
53 portfolio_df = pd . DataFrame ( index = trade_range ) . reset_index () . rename ( columns ={ " index "
: " datetime " })
54 portfolio_df . loc [0 , " capital " ] = 10000
55 portfolio_df . loc [0 , " ewma " ] = 0.001
56 portfolio_df . loc [0 , " ewstrat " ] = 1
57 return portfolio_df
58
79 """
80 Compute Metas
81 """
82 self . compute_metas ( index = t r a d e _ d a t e t i m e _ r a n g e )
83
84 """
85 I niti al i sa ti o ns
86 """
87 portfolio_df = self . i n i t _ p o r t f o l i o _ s e t t i n g s ( trade_range = t r a d e _ d a t e t i m e _ r a n g e )
88
32
92
95 if i != 0:
96 date_prev = portfolio_df . loc [ i - 1 , " datetime " ]
97 day_pnl , nominal_ret = ba cktest_ utils . get_pnl_stats (
98 date = date ,
99 prev = date_prev ,
100 portfolio_df = portfolio_df ,
101 instruments = self . instruments ,
102 idx =i ,
103 historicals = self . dfs ,
104 close_col = " adj_close "
105 )
106
115 portfolio_df . loc [i , " ewstrat " ] = 0.06 * strat_scalar + 0.94 * portfolio_df .
loc [ i - 1 , " ewstrat " ] \
116 if nominal_ret != 0 else portfolio_df . loc [ i - 1 , " ewstrat " ]
117
120 forecasts = {}
121 for inst in eligibles :
122 forecasts [ inst ] = self . dfs [ inst ]. loc [ date , " alphas " ]
123
128 nominal_tot = 0
129 for inst in eligibles :
130 forecast = forecasts [ inst ]
131 vol_target = 1 / ( len ( eligibles ) ) \
132 * portfolio_df . loc [i , " capital " ] \
133 * portfolio_vol \
33
134 / np . sqrt (253)
135 position = strat_scalar * forecast * vol_target * self . dfs [ inst ]. loc [ date , "
loginvvol " ] / bac ktest_u tils . un i t_ ba se _ va lu e ( instrument = inst , dfs = self . dfs , date = date )
136 portfolio_df . loc [i , inst + " units " ] = position
137 nominal_tot += abs ( position * backt est_uti ls . un it _b a se _v al u e ( instrument =
inst , dfs = self . dfs , date = date ) )
138
145 num_trend = 0
146 for inst in eligibles :
147 num_trend = num_trend + 1 if self . dfs [ inst ]. loc [ date , " smaf " ] > self . dfs [
inst ]. loc [ date , " smam " ] else num_trend
148 if len ( eligibles ) > 0 and num_trend / len ( eligibles ) > 0.60:
149 for inst in eligibles :
150 portfolio_df . loc [i , " {} units " . format ( inst ) ] = 0
151 portfolio_df . loc [i , " {} w " . format ( inst ) ] = 0
152 nominal_tot = 0
153
34
172 }
173 )
174 portfolio_df = alpha . run _simulat ion ()
Note that we ‘roll forward’ volatility estimates when our strategy takes no trades after filtering. This
is to prevent volatility from ‘decaying’ and causing unnecessarily large positions when no trades have been
taken for a long time. This is akin to risk management in event-driven trades. Notably, we would not be
able to achieve the annualized target volatility in calendar days, but we would achieve it when based on the
number of actively traded days. Accordingly, the Sharpe computations have filtered out unactive days since
we assume no risk on those days.
If you have understood the 3 strategies, we are now done! Congratulations - we would now focus on
improving this set of grossly inefficient scripts by adopting better design choices and analysing their CPU
loads.
4 Benchmarking
Before we make any rash decisions and start rewriting chunks of code, it may first be wise to ensure that we
have a way of checking that the rewritten code performs the same logic as our initial proposal - and that it
is in fact getting better without errors. At this point, you MUST have some form of version control. Using
git, the command:
would create a version of our current code, and later we can reference it by doing
- do remember that the tools provided by git are tremendously useful to any programmer - we urge readers
who have not adopted version control to their development to check it out.
You should also adopt some form of testing, to make sure that the system performs as expected, where the
tests can be implemented in very granular fashion such as unit-testing or in higher-level, lower overhead (but
35
reduced safety) integration testing. Here I again urge readers to check out the literature on software testing,
and arrive at one that suits your profile. You can even make testing the controller of your development cycle,
such as in TDD (test-driven development).
We do a super light-weight (virtually non-existing) integration test here, where we verify that the
modifications of our code performed by the iterative improvements are correct by comparing the terminal
wealth of our backtests pre and post change. (note that this should be accurate up to a fair number of
decimals at least)
In order to compare their efficiency of CPU usage, we also time our functions - we can create a timer in
dirtylib as follows:
We import this into the stratx.py files and add the simple decorator as follows (here is a great writeup
on decorators: link, Real Python):
1 @timeme
2 def run_sim ulation ( self ) :
3 """
4 Settings
5 """
6 portfolio_vol = 0.20
7 t r a d e _ d a t e t i m e _ r a n g e = pd . date_range (
8 start = self . g e t _ t r a d e _ d a t e t i m e _ r a n g e () [0] ,
9 end = self . g e t _ t r a d e _ d a t e t i m e _ r a n g e () [1] ,
10 freq = " D "
11 )
12 ...
13 ...
and we should get the time taken for run simulation to run every time we run our script.
The runtime depends on the size of the asset universe, and the number of assets under consideration
36
for trading each day. Let’s go ahead and add the decorators to our three strategies, fix their configurations
and run the scripts again to get their terminal wealth.
First we give a preview of our results and what we will be rewarded with in the remainder of our work:
Original 116 - 81
Improved 10 - 5
4.3 Strat1
Note that we have already changed strat1 strat scalar to the generalised version taking in consideration both
the exponentially weighted, historical strategy scalars as well as the volatility on portfolio leveraged returns.
37
Terminal Wealth:
10269.606843
When we are profiling, to save time we will use a reduced problem set with the following configurations:
Terminal Wealth:
3179.454924
4.4 Strat2
38
"shortsize": 20
}
)
Terminal Wealth:
43342.458658
4.5 Strat3
As with strategy 1, to save time we use the reduced problem set in our code profiling:
39
}
)
Terminal Wealth:
19242.269661
We want to modify our code structure that it is easier to maintain. A noticeable feature of our trading scripts
is that they share a lot of common functionality - they mostly differ in the computation and management
of alpha signals. The rest is the same. This is well suited to a Parent-Child relationship in Object-Oriented
Programming. We refer the details to other literature, and show a simple example for demonstration.
Here we consider a simple example: we have a parent, and when he meet someone, he goes ‘yo’. His
son is cooler, so that the son goes ‘yo wassup’ instead.
1 class Parent () :
2
6 def yo ( self ) :
7 print ( " yo " )
8
17 def yo ( self ) :
18 print ( " yo wassup " )
19
20 son = Son ()
21 son . yo () # yo wassup
22 son . testing () # yo wassup
We see how the call trace works in this example. super refers to the parent class Parent, and the child
Son overrides (see literature on method overriding) the function ‘yo’: when calling the function ‘yo’ on a
Son object, we use Son’s implementation. For methods not overridden, we use the parent implementation,
40
by walking up the inherited classes and searching for implemented functionalities. We shall restructure our
code to ‘collect’ similar features in a generic parent Alpha class to make our code more manageable. Our
parent class should have enough flexibility to encompass the logic of all the 3 strategies. For instance, since
we now need to accommodate both cross sectional strategies trading fixed number of long/shorts & time
series strategies with dynamic traded cardinality, we should introduce a variable num traded to divide our
portfolio capital.
Additionally, like we did in our simple example, we want to implement ‘default’ functionality in the
parent class.
1 import numpy as np
2 import pandas as pd
3
11 class Alpha () :
12
19 def g e t _ t r a d e _ d a t e t i m e _ r a n g e ( self ) :
20 return ( self . configs [ " start " ] , self . configs [ " end " ])
21
41
(1) [ " adj_close " ]) . rolling (30) . std ()
30 self . dfs [ inst ] = df . join ( self . dfs [ inst ])
31 self . dfs [ inst ] = self . dfs [ inst ]. fillna ( method = " ffill " ) . fillna ( method = " bfill " )
32 self . dfs [ inst ][ " ret " ] = -1 + \
33 self . dfs [ inst ][ " adj_close " ] / self . dfs [ inst ]. shift (1) [ " adj_close " ]
34 self . dfs [ inst ][ " sampled " ] = self . dfs [ inst ][ " adj_close " ] \
35 != self . dfs [ inst ]. shift (1) [ " adj_close " ]
36 self . dfs [ inst ][ " active " ] = self . dfs [ inst ][ " sampled " ]\
37 . rolling (5)
38 . apply ( lambda x : int ( np . any ( x ) ) )
39 . fillna (0)
40 pass
41
42
68 vol_target = 1 / num_trading \
69 * self . portfolio_df . loc [ idx , " capital " ] \
70 * portfolio_vol \
71 / np . sqrt (253)
72 position = strat_scalar * \
73 forecast * \
74 vol_target * \
75 self . dfs [ inst ]. loc [ date , " invrisk " ] / \
76 back test_uti ls . un it _ ba se _v a lu e ( instrument = inst , dfs = self . dfs , date = date )
77 self . portfolio_df . loc [ idx , inst + " units " ] = position
78 nominal_tot += abs ( position * b acktest_ utils . u ni t_ b as e_ va l ue ( instrument = inst ,
dfs = self . dfs , date = date ) )
79 return nominal_tot
80
88 @timeme
89 def run_sim ulation ( self ) :
90 """
91 Settings
92 """
93 portfolio_vol = 0.20
94 t r a d e _ d a t e t i m e _ r a n g e = pd . date_range (
95 start = self . g e t _ t r a d e _ d a t e t i m e _ r a n g e () [0] ,
96 end = self . g e t _ t r a d e _ d a t e t i m e _ r a n g e () [1] ,
97 freq = " D "
98 )
99
100 """
101 Compute Metas
102 """
103 self . compute_metas ( index = t r a d e _ d a t e t i m e _ r a n g e )
104
105 """
106 I niti al i sa ti o ns
107 """
108 self . portfolio_df = self . i n i t _ p o r t f o l i o _ s e t t i n g s ( trade_range = t r a d e _ d a t e t i m e _ r a n g e )
43
109
116 if i != 0:
117 date_prev = self . portfolio_df . loc [ i - 1 , " datetime " ]
118 day_pnl , nominal_ret = ba cktest_ utils . get_pnl_stats (
119 date = date ,
120 prev = date_prev ,
121 portfolio_df = self . portfolio_df ,
122 instruments = self . instruments ,
123 idx =i ,
124 historicals = self . dfs ,
125 close_col = " adj_close "
126 )
127
44
152 index =i , date = date , nominal_tot = nominal_tot )
153
Essentially, we have ‘factored’ out common functionalities, and generalised some features. Our Alpha
class asks that any child that inherits the Alpha class implements the eligibility check and the inverse
risk estimate for each asset. Additionally, it needs to implement the compute forecasts functionality that
is the alpha signal generator for that strategy. If the strategy requires some additional information to
calculate the alpha signals, we can do the calculation in the compute metas method and assign them to the
object variables. We implement default post-risk management, which is to not do anything. Unique risk
management is achieved by overriding the parent’s method.
5.2 Strat1
Our strategy implementations are now significantly simplified. We show them as follows:
1 import numpy as np
2 import pandas as pd
3
45
20 . rolling (61)
21 . apply ( lambda x : ~ np . any ( np . isnan ( x ) ) )
22 . fillna (0)
23 super () . compute_metas ( index )
24 retagg = pd . DataFrame ( index = index )
25 for inst in self . instruments :
26 print ( inst )
27 self . dfs [ inst ][ " invrisk " ] = 1 / self . dfs [ inst ][ " vol " ]
28 self . dfs [ inst ][ " eligible " ] = \
29 self . dfs [ inst ][ " alive " ]. astype ( int ) &\
30 self . dfs [ inst ][ " active " ]. astype ( int ) &\
31 ( self . dfs [ inst ][ " vol " ] > 0) . astype ( int ) &\
32 ( self . dfs [ inst ][ " adj_close " ] > 0) . astype ( int )
33
54 alpha_scores = {
55 k : v for k , v in sorted ( alpha_scores . items () , key = lambda pair : pair [1])
56 }
57 alphalong = list ( alpha_scores . keys () ) [ - self . configs [ " longsize " ]:]
58 alphashort = list ( alpha_scores . keys () ) [: self . configs [ " shortsize " ]]
59
60 forecasts = {}
61 for inst in eligibles :
62 forecast = 1 if inst in alphalong else 0
46
63 forecast = -1 if inst in alphashort else forecast
64 forecasts [ inst ] = forecast
65
66 return forecasts , self . configs [ " longsize " ] + self . configs [ " shortsize " ]
This is much cleaner than before - if we want to create a new strategy, we do not have repeat the same
∼ 150 or so lines of code. We just need to implement the two functionalities that is characteristic of the
strategy itself.
5.3 Strat2
1 import numpy as np
2 import pandas as pd
3
47
29 & ( self . dfs [ inst ][ " adj_close " ] > 0) \
30 & ( self . dfs [ inst ][ " ret " ]. shift ( -1) < 0.30) \
31 & ( self . dfs [ inst ][ " ret " ]. shift ( -1) > -0.30)
32
38 alpha_scores = {
39 k : v for k , v in sorted ( alpha_scores . items () , key = lambda pair : pair [1])
40 }
41 alphalong = list ( alpha_scores . keys () ) [ - self . configs [ " longsize " ]:]
42 alphashort = list ( alpha_scores . keys () ) [: self . configs [ " shortsize " ]]
43
44 forecasts = {}
45 for inst in eligibles :
46 forecast = 1 if inst in alphalong else 0
47 forecast = -1 if inst in alphashort else forecast
48 forecasts [ inst ] = forecast
49
50 return forecasts , self . configs [ " longsize " ] + self . configs [ " shortsize " ]
5.4 Strat3
And now suppose we want to have unique behavior in risk management (such as the crash-risk-factor filter-
ing), we can override the parent method which will be called in the parent’s run simulation function instead
of the default implementation.
1 import numpy as np
2 import pandas as pd
3
48
13
49
56 self . portfolio_df . loc [ index , " {} units " . format ( inst ) ] = 0
57 self . portfolio_df . loc [ index , " {} w " . format ( inst ) ] = 0
58 nominal_tot = 0
59 return nominal_tot
Now, this really helps us make optimizing our code easier - optimizations that we make to Alpha class
is applied to all of our strategies, and we do not need to go into the individual files to make our changes.
Additionally, since the functionalities in Alpha class are most commonly called, perhaps it is wise to focus
our optimizations and making our Alpha class CPU-efficient first?
Now that we have cleaned up the code, we can verify that we have not ‘disturbed’ the logic by running
the simulations again, and we end up with the same terminal wealth. We did this, and we are all good to
go. Let’s introduce another toolset: suppose you have generated 2 output files, say ‘strat1 prechange.csv’
and ’strat1 postchange.csv’, you can test whether the output files generated by the new code is differently
by doing in the terminal (fc on Windows instead of diff):
and this should give no outputs - meaning each line is the same!
There are two versions of reality - one that is fake (the one that we perceive) and one that is real (the one
that is). We may hypothesize about what are the lines/logic in our code that causes the inefficiencies, but
we would not know for real what actually is causing the code to be inefficient without quantitative probes.
We are after all, systematic traders. We want to systematically improve our code efficiency too, hierar-
chically from the top level view down to the granular. We also want to ‘attack’ the biggest pain points, so
that we get the most bang for our buck in terms of effort in relation to efficiency gain.
In this section, we will demonstrate some effective tools to take a deeper analysis of our code. We
arbitrarily choose Strat3 to start profiling and reducing efficiency bottlenecks.
To get a general view of the run-time, we can use the unix time command:
Command>>
/usr/bin/time --verbose -o temp.txt python3 strat3.py
50
Out>>
Command being timed: "python3 strat3.py"
User time (seconds): 82.18
System time (seconds): 1.57
Percent of CPU this job got: 101%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:22.29
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 204588
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 54753
Voluntary context switches: 156
Involuntary context switches: 57484
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Minor page faults are caused when a program accesses new memory spaces. Memory addressing follow
lazy allocation by the kernel, which on access forces the kernel to pause logic execution. This can slow a
program down substantially, since we lose the data in our cache registers (we discuss this later!) in the CPU.
We see no major page faults, and the CPU was working (albeit inefficiently) at maximum. Note the
’-o’ flag specifies the output file the statistics are written to. We can use the statistics generated from this
command to proxy how well our program is using CPU, whether we are idling on I/O, whether we are doing
costly disk-seeks, and whether we are using our cores efficiently. Users can search what these statistics mean,
but we will go on to another tool that allows us to profile our function calls’ resource usage.
51
6.1 Profiling, Iterating, Rewriting...
Command:
python3 -m cProfile -o temp.txt strat3.py
...
import pstats
p = pstats.Stats("temp.txt")
p.strip_dirs().sort_stats("cumulative").print_stats(20)
We use the cProfile tool, a profiler for our Python programs. We write the output to the temp.txt file. After
generating the statistics, we use the pstats module to parse the statistics file and generate useful results. We
want to see the 20 functions in our script that took the most time.
52
7306 0.023 0.000 25.906 0.004 alpha.py:51(compute_eligibles)
1 0.004 0.004 25.514 25.514 strat3.py:18(compute_metas)
1 0.003 0.003 25.214 25.214 alpha.py:23(compute_metas)
180 0.001 0.000 25.016 0.139 rolling.py:529(_apply)
180 0.001 0.000 25.014 0.139 rolling.py:434(_apply_blockwise)
180 0.001 0.000 25.013 0.139 rolling.py:415(_apply_series)
The first 4 lines can be ignored - they are just the entry point to our call on run simulation. The
measure time wraps around our simulation function, and the rest are built in Python functions when kicking
off our process.
Interestingly, we see that a total of 101.260 seconds have been spent inside getitem function in the
indexing.py class. Even though each call costs less than 0.000 seconds, we repeat it 1765412 times. This is
followed by the setitem function - what could possibly be causing this?
p.print_callers(20, "__getitem__")
and we get
53
100588 0.524 5.799 strat3.py:42(compute_forecasts)
201176 1.033 11.272 strat3.py:48(post_risk_management)
We see which parts of our logic were calling these functions (ignore the line annotations, they do
not correspond to the line labelling in our code sections). We see that most time was taken inside the
unit base value function in backtest utils.py. Let’s recall its implementation:
There should not be anything particularly taxing about the logic here, but it turns out the
dataframe.loc[ a , b ]
operation is expensive!
Referring to their documentation [2], we should access single scalar values with a ‘.at’ operation instead of
rows with ‘.loc’. For more interested readers, we again refer them to other literature. Here is an interest-
ing discussion: (link: stackoverflow) Now, let’s go ahead and change all the ‘.loc’ operations that we are
accessing/writing to scalars with the ‘.at’ operator. Rerunning the operation, we get:
down from 81.3337390422821 seconds! That’s almost half of the runtime shaved off from a quick-fix by
profiling our code. Let’s rerun the profiling with the commands:
54
&&
import pstats
p = pstats.Stats("temp.txt")
p.strip_dirs().sort_stats("cumulative").print_stats(20)
we see that the cumulative time spent accessing dataframe cell values have decreased significantly. Before
we do our victory lap - there is much more work to be done.
55
6.1.3 Note on Profilers
We will note that as we add increasingly more granular profilers, the run of our timer on run simulation
will get longer and longer runtimes. This is because the it also measures the time taken for the profiler to
do work - we can remove them later and use the original benchmark to see that we are going in the right
direction!
For more on graphical visualisation of cProfile’s statistics file, see snakeviz. (link : snakeviz)
The previous table, if anything, demonstrated that by changing our scalar access that is called many times
to a more efficient method, we can reduce our runtimes - but the next on the table to target lies in the
compute metas function. In fact, we see that strat3 ’s compute function took 18.576 seconds, but that was
only marginally greater than the time take in the compute function in the alpha.py file, which suggests that
most of the bottleneck was spent in the parent compute metas function.
Let’s verify this. Luckily, like we mentioned, Python has an amazing community of developers. One
written by Robert Kern allows us to profile each individual line. Let’s install the line profiler module with:
In order to tell line profiler to access a function, we use the decorator ‘@profile’. It should look something
as simple as this:
1 @profile
2 def compute_metas ( self , index ) :
3 super () . compute_metas ( index )
4 for inst in self . instruments :
5 print ( inst )
6 self . dfs [ inst ][ " invrisk " ] = np . log (1 / self . dfs [ inst ][ " vol " ]) / np . log (1.3)
7 self . dfs [ inst ][ " smaf " ] = self . dfs [ inst ][ " adj_close " ]. rolling (10) . mean ()
8 self . dfs [ inst ][ " smam " ] = self . dfs [ inst ][ " adj_close " ]. rolling (30) . mean ()
9 self . dfs [ inst ][ " smas " ] = self . dfs [ inst ][ " adj_close " ]. rolling (100) . mean ()
10 self . dfs [ inst ][ " smass " ] = self . dfs [ inst ][ " adj_close " ]. rolling (300) . mean ()
11 self . dfs [ inst ][ " alphas " ] = 0.0 + \
12 ( self . dfs [ inst ][ " smaf " ] > self . dfs [ inst ][ " smam " ]) + \
13 ( self . dfs [ inst ][ " smaf " ] > self . dfs [ inst ][ " smas " ]) + \
14 ( self . dfs [ inst ][ " smaf " ] > self . dfs [ inst ][ " smass " ]) + \
15 ( self . dfs [ inst ][ " smam " ] > self . dfs [ inst ][ " smas " ]) + \
16 ( self . dfs [ inst ][ " smam " ] > self . dfs [ inst ][ " smass " ])
17
56
18 self . dfs [ inst ][ " eligible " ] = \
19 (~ np . isnan ( self . dfs [ inst ][ " smass " ]) ) \
20 & self . dfs [ inst ][ " active " ] \
21 & ( self . dfs [ inst ][ " vol " ] > 0) \
22 & ( self . dfs [ inst ][ " adj_close " ] > 0) \
23 & ( self . dfs [ inst ][ " ret " ]. shift ( -1) < 0.30) \
24 & ( self . dfs [ inst ][ " ret " ]. shift ( -1) > -0.30)
When running the script, we signal to use the script kernprof and run it with the command (‘-l’ is a
line-by-line flag and ‘-v’ is a verbosity flag for printing to console)
57
30 30 8550.0 285.0 0.0 (self.dfs[inst]["smaf"] > self.dfs[inst
31 30 8152.0 271.7 0.0 (self.dfs[inst]["smaf"] > self.dfs[inst
32 30 6436.0 214.5 0.0 (self.dfs[inst]["smam"] > self.dfs[inst
33 30 6382.0 212.7 0.0 (self.dfs[inst]["smam"] > self.dfs[inst
34
35 30 16898.0 563.3 0.1 self.dfs[inst]["eligible"] = \
36 180 36314.0 201.7 0.2 (~np.isnan(self.dfs[inst]["smass"])) \
37 30 2552.0 85.1 0.0 & self.dfs[inst]["active"] \
38 30 6456.0 215.2 0.0 & (self.dfs[inst]["vol"] > 0) \
39 30 6031.0 201.0 0.0 & (self.dfs[inst]["adj_close"] > 0) \
40 30 12133.0 404.4 0.1 & (self.dfs[inst]["ret"].shift(-1) < 0.
41 30 10065.0 335.5 0.0 & (self.dfs[inst]["ret"].shift(-1) > -0
Note that we have intentionally let the output runoff the page margins - you already know how they
are implemented anyway. We verify that 98% of the time was in fact spent in the parent compute, so let’s
profile that instead by moving the profile decorator to the parent file. We get the output as follows:
58
33 30 22129.0 737.6 0.1 self.dfs[inst] = self.dfs[inst].fillna(method="
34 30 40410.0 1347.0 0.2 self.dfs[inst]["ret"] = -1 + self.dfs[inst]["ad
35 30 35012.0 1167.1 0.2 self.dfs[inst]["sampled"] = self.dfs[inst]["adj
36 30 23116166.0 770538.9 99.1 self.dfs[inst]["active"] = self.dfs[inst]["samp
37 1 1.0 1.0 0.0 pass
It turns out that 99% of the time in this function was spent on the line
self.dfs[inst]["active"] = self.dfs[inst]["sampled"]
.rolling(5)
.apply(lambda x: int(np.any(x)))
.fillna(0)
That’s interesting - we might not have guessed that profiling the code with our own two eyes, but the
quantitative probe makes it obvious what slowed us down. Imagine if we spent hours making the other lines
virtually ‘instantaneous’, we would have at most reduced the run time here by 1%!.
For more advanced programmers - we can see the instruction set disassembled by the Python interpreter
using the dis module. A good rule of thumb is that shorter bytecodes correspond to faster runtimes. Let’s
take a simple look:
1 import numpy as np
2 import dis
3
59
17
13 0 LOAD_CONST 1 (0)
2 STORE_FAST 1 (sum)
14 4 LOAD_FAST 0 (arr)
6 GET_ITER
>> 8 FOR_ITER 12 (to 22)
10 STORE_FAST 2 (i)
15 12 LOAD_FAST 1 (sum)
14 LOAD_FAST 2 (i)
16 INPLACE_ADD
18 STORE_FAST 1 (sum)
20 JUMP_ABSOLUTE 8
>> 22 LOAD_CONST 0 (None)
24 RETURN_VALUE
------------------------------------------------------
18 0 LOAD_CONST 1 (0)
2 STORE_FAST 1 (sum)
19 4 LOAD_FAST 0 (arr)
6 GET_ITER
>> 8 FOR_ITER 12 (to 22)
10 STORE_FAST 2 (i)
20 12 LOAD_FAST 1 (sum)
14 LOAD_FAST 2 (i)
16 BINARY_ADD
18 STORE_FAST 1 (sum)
60
20 JUMP_ABSOLUTE 8
>> 22 LOAD_CONST 0 (None)
24 RETURN_VALUE
------------------------------------------------------
23 0 LOAD_GLOBAL 0 (np)
2 LOAD_METHOD 1 (sum)
4 LOAD_FAST 0 (arr)
6 CALL_METHOD 1
8 STORE_FAST 1 (sum)
10 LOAD_CONST 0 (None)
12 RETURN_VALUE
In other cases - this is not as useful (the bytecodes are the same even though we know they behave
differently in runtime!):
1 import dis
2 import pandas as pd
3
4 def func1 () :
5 df = pd . DataFrame ()
6 df . loc [0 , " hi " ]
7 def func2 () :
8 df = pd . DataFrame ()
9 df . at [0 , " hi " ]
10
13 0 LOAD_GLOBAL 0 (pd)
2 LOAD_METHOD 1 (DataFrame)
4 CALL_METHOD 0
6 STORE_FAST 0 (df)
14 8 LOAD_FAST 0 (df)
10 LOAD_ATTR 2 (loc)
12 LOAD_CONST 1 ((0, ’hi’))
14 BINARY_SUBSCR
16 POP_TOP
61
18 LOAD_CONST 0 (None)
20 RETURN_VALUE
---------------------------------------------------------------
16 0 LOAD_GLOBAL 0 (pd)
2 LOAD_METHOD 1 (DataFrame)
4 CALL_METHOD 0
6 STORE_FAST 0 (df)
17 8 LOAD_FAST 0 (df)
10 LOAD_ATTR 2 (at)
12 LOAD_CONST 1 ((0, ’hi’))
14 BINARY_SUBSCR
16 POP_TOP
18 LOAD_CONST 0 (None)
20 RETURN_VALUE
Numba is a just-in-time (JIT) compiler that is designed to compile numpy code via the LLVM compiler at
runtime. Unlike ‘C’, this is done on-the-run, such that the compilation cycle is hidden from the programmer,
using simple decorators that signal to Numba that we want the function to be compiled. Ahead of time
(AOT) compilers, as opposed to JIT, compile our source code into binary machine code before runtime. A
famous example is the gcc compiler. A static binary is created to run on your machine, and usually gives us
the best speedups - but requires code changes by the programmer to be recompiled to create a new binary.
This is quite a hassle! JIT compiles automatically at runtime - meaning on first parse our code runs slower
due to compilation overhead - but requires little to no overhead on the developer’s part. Note that numba
decorators cannot just be slapped on for a ‘quick win’ ! It’s important to think about where the speedup is
coming from.
Numba allows us to do things like GIL release, and static type inference. This means that computations
with many intermediate variables (especially in loops!) benefit the most from compilation. We introduce
these concepts in the later section on an introduction to computing and memory units, Section 8. Vectorized
logic with little variable intermediates do not benefit as much. The speedup in performance from vectorized
numpy code in numba can be attributed to a number of things - memory contiguity, data bus throughput,
vectorized (SIMD) instructions, CPU caching, compilation and GIL unlocking. These are separate (but
62
related) concepts that improve the efficiency of our code, and we should be aware when each kicks in.
Let’s make a jitted function here, using the argument nopython=True to indicate we want to compile in
nopython mode. Functions with this decorator throws Exception when the code logic access Python objects
such as lists or dictionaries, as opposed to standard jitted functions which fall back on Python object mode
silently when the types are compilation incompatible. Note that we can use inspect types to see the datatype
of each object if we cannot get Numba to compile our jitted code in nopython mode - as well as the error
messages thrown. We can use the assert statements to check runtime that the new functionality and old
logic are equivalent! We can also disable asserts in runtime using flags. Our demonstration is just a peek
into Numba’s suite - we encourage readers to take a deeper dive on their own. Let’s take a look:
23 # assert ( self . dfs [ inst ][" sampled "]. rolling (5) . apply ( numba_any , engine =" numba " , raw =
True ) . equals (
24 # self . dfs [ inst ][" sampled "]. rolling (5) . apply ( lambda x : int ( np . any ( x ) ) ) ) )
25
63
31 pass
64
46
47 1 0.0 0.0 0.0 pass
and we see that we have reduced the time spent in the rolling function from 99% down to 87.4%. This
runs at
down from 42.09293055534363 seconds by making use of Numba’s Just in Time compilation. Although
it still takes up majority of the time in the compute metas function, the rolling function is not in itself the
most efficient task and we have managed to speed it up considerably.
with output
65
1258912 5.583 0.000 7.213 0.000 datetimes.py:700(_maybe_cast_for_get_loc)
361372 0.601 0.000 6.628 0.000 indexing.py:2277(__setitem__)
3653 0.360 0.000 6.376 0.002 alpha.py:87(set_weights)
7306 0.256 0.000 6.292 0.001 alpha.py:62(<listcomp>)
7306 0.232 0.000 5.960 0.001 alpha.py:63(<listcomp>)
361372 1.325 0.000 5.507 0.000 indexing.py:2228(__setitem__)
3652 0.339 0.000 5.338 0.001 backtest_utils.py:71(get_pnl_stats)
1261203 1.112 0.000 4.748 0.000 base.py:3585(get_loc)
the function compute metas has been relegated out of the most ‘demanding’ of our attention for CPU
usage - we now want to shift our attention to speeding up other parts of our code logic. Let’s rerun this with
the original problem set to make sure that our edits have been sound and effective - the terminal wealth
matches and we get run time of
compared to the original runtime of 1249.9761579036713 seconds trial-ed in Section 4.5. We have obtained
roughly a 300% speedup.
Applying the same jit technique to our strategy 2 in the drawdown computations with:
self.dfs[inst]["dd"] = \
(1 + self.dfs[inst]["ret"]).rolling(window=12).apply(np.prod, raw=True, engine="numba") \
/ (1 + self.dfs[inst]["ret"]).rolling(window=12).apply(np.prod, raw=True, engine="numba").cummax() \
- 1
We have effectively shown how we first profile the overall CPU requirements of the different functions
in a script run - and then see which functions demand our attention. We may optimize these functions by
either making it run more efficiently or calling them less often all together. We see how we might address
the pain points by reducing the function calls using vectorized operations.
66
8 Vector Operations, CPUs and Memory Fragmentation
Programming in Python carefully hides us from dealing with memory. We do not need to know where the
variable x = 1 is stored in memory, or how it is retrieved, or how it is edited. Is storing 1 and 1.0 any
different? Where is it stored? Let’s briefly introduce the memory interactions - our script runs thanks to
the CPU doing mathematical operations on data retrieved from memory, which can be thought of to consist
of RAM and hard drive. There are data buses that transfers data from the memory units to the computing
units, but since transferring data over a slow bus is costly - the CPU itself has cache registers that store
data and is closer to the actual ‘calculator’ that does things like flip bits to give us mathematical outputs.
The computing unit’s speed depends on how many instructions it can run in a cycle (IPC), and how many
such cycles can run in a second (clock speed measured in Hz ). A single instruction can run on the CPU on
single data units, or on multiple data units (vectorization), without extra overhead! As transistor technology
becomes increasingly more complex - advancements in CPU design have become slower and slower (refer
to debate on Moore’s law for interested readers). These units run machine code instructions, which are
compiled from assembly code, which are disassembled from high level languages such as ’C’. Languages can
be interpreted and run on virtual machines such as ‘Python’, or take the compilation life cycle like in ’C’,
or a mix of both such as ‘Java’. Compilers do smart optimizations such as branch-prediction, out-of-order
execution and pre-emptively loads data that it thinks our program needs into registers such that it would
already be in cache when needed.
There are non-trivial difficulties when dealing with multiple processes sharing the same memory units
- cache registers need to be mapped to the process it is running, and two processes writing to the same
memory on hardware can corrupt each other. Some of these are handled by the kernel, and others need
to be handled by the code - Python makes it ‘safer’ by using a global interpreter lock (GIL), which makes
sure that a single Python process only has access to the Python interpreter once at a time, regardless of the
number of cores used. Each core competes for the GIL, and to deal with this we can either explicitly spawn
more processes (such that each process has its own GIL) or disable the GIL all together.
This is why threads in Python are only used for I/O heavy tasks - when it is equivalently performant to
do ‘nothing’ since we are waiting for a result. Threads are more lightweight concurrency tools than processes
- but threads do not have their own ‘space’. Multiple threads running on the same process are GIL-bounded,
and take turns to execute!
67
8.2 Memory Units
The memory units (whether it is cache registers, RAM or hard disks) differ in how ‘close’ they are to the
CPU, and differ in read/write throughput. The speed of this also depends on how close or far away each
data requested is stored in physical memory. Reading sequentially is significantly cheaper than jumping
memory addresses, and this can be influenced by how the programmer decides to store data (for instance
with our choice of data types). Memory is typically arranged hierarchically such that the data that we think
we need more often/soon is placed closer to the CPU. Imagine we own a hundred shoes, you would place
your ‘favorite pair’ closer to the door and dump the ugly one in the closet - this reduces search time for your
shoes when leaving the house. Accessing different parts of the memory tiers and transferring them around
requires a data bus - which have different throughputs based on their physical properties, such as the bus
width and bus frequency (similar in concept to CPU Hz and IPC). Instructions for vectorized operations
need all the data to be in the CPU when the operations are performed, hence having a larger bus width
favors vectorizations when data is stored in contiguous memory blocks, while bus frequency is useful when
we jump memory addresses. Using data types such as numpy that forces the memory to keep our data
closer together makes vectorized operations quicker and increase underlying memory transfer throughput.
Although Python makes our life simpler by abstracting away lower level details such as the garbage collection
(free-ing up memory), this means we effectively lose control over where our data is stored and freed - and
which are therefore stored sub-optimally. Data that we often need together are fragmented, which means
we cannot transfer the data to the CPU in ‘one go’. These are the inherent issues in native Python that we
keep in mind and attempt to address in our work.
Note that in Linux we can profile our scripts and probe statistics relevant to cycles, instructions, cache
hit/miss rates using
. We will not go into the details here - more advanced programmers can dive into this from other literature.
The statistics give us invaluable insight into the low-level profiling and how our application behaves with
respect to the CPU and memory.
Python does not natively support vectorization - its lists store pointers/references to the data instead of
storing it itself - the data bus cannot transfer required data units from the memory at once. Python
bytecode (similar to assembly language) is not optimized for vectorizations. While store-by-reference can be
useful, especially when working with dynamic datatypes that we know little about its memory requirements,
68
when working with homogeneous data such as batch numerical computations, it is unnecessary! However,
the VM does perform some tricks to try to work with cache misses - such as preemptively moving data from
the RAM into the cache when it predicts related data to the current logic will be accessed. Think about
increment counters in a for loop! This is why grouping together similar logic by simply refactoring our code
can improve performance - even if we perform the same instructions. We should perform overheads such
as disk interactions, memory allocations (C does this explicitly with malloc) - these disrupt the flow of the
program and ‘dirties’ our cache registers.
There are workarounds to Python’s native data structures - by using modules such as array and numpy
we can ask our OS to store data in contiguous chunks of memory. Numpy has underlying optimizations
written in ‘C’ to make use of the vectorized instruction sets with better memory management to give the
CPU the data that it needs when it needs it. Each vectorized instruction in numpy code performs more
computations per single instruction. Moving chunks of data over the data bus to the CPU together reduces
cache misses! We get a double whammy benefit of both memory contiguity (leading to higher cache hit
rates), which enables vectorization (leading to lower total instructions required).
The previous profiling still left us with the knowledge that operations such as getitem takes up significant
amount of time of the overall CPU usage. This is because we make 1644983 calls to it - we see statements
such as
Even though ‘.at’ is relatively more efficient, we know that we often need the same data variable for all the
assets in the universe. We could get it more efficiently by pre-arranging the data in formats by anticipating
the data access patterns.
Most of the difficulty with respect to vectorising our code is in the non-linear logic. Numpy has excellent
support for linear algebra and handling matrices of numbers, but where conditional logic is required, some
trickery is required.
For instance, we want to set weights and units held for non-eligible asset universes to 0. We do this
separately in our non-vectorized code. This is because dealing with non-eligible assets give us a headache.
For instance, including assets that have degenerated volatility would cause a division by zero error - and the
according position size would be infinity. Division by a non-existing null value would result in a NaN value.
We want to avoid these annoyances in our signal calculation by filtering them out beforehand.
69
Here are some quirks with the numpy array and how it deals with invalid operations:
>>> arr1 / 0
<stdin>:1: RuntimeWarning: divide by zero encountered in divide
<stdin>:1: RuntimeWarning: invalid value encountered in divide
array([inf, nan, nan, inf])
It turns out we can do something quite clever with this - if we have an array of flags of whether each asset
in the universe is ‘eligible’ or ‘not eligible’ for consideration, we can use the boolean array as a bit mask by
converting them to 1 and 0. We can then do something like:
After the operations, our vector array would contain the desired results for indexes that correspond to eligible
instruments and NaN values for ineligible instruments.
to cast the NaN and other invalid positions to 0, which correspond to zero units held in ineligible assets!
Perhaps it is better to illustrate the changes at this point with concrete code implementation. Let’s
make some changes to our code based on the previous discussions:
70
(1.) Group administrative logic computations at the beginning of run simulation and group them in ho-
mogenous dataframes for contiguous access.
(3.) Unzip data that we need for each day by using iterrows() and iterate through them together by unrolling
the dataframes together on shared index. (more on row access later!)
(4.) Maintain state for previous units held, FX-adjusted closing prices and weight vectors so that we do not
have to use ‘.loc’ to access them on compute for pnl.
1 import numpy as np
2 import pandas as pd
3 from numba import jit
4
12 class Alpha () :
13
20 def g e t _ t r a d e _ d a t e t i m e _ r a n g e ( self ) :
21 return ( self . configs [ " start " ] , self . configs [ " end " ])
22
71
34 df = pd . DataFrame ( index = index )
35 self . dfs [ inst ][ " vol " ] = ( -1 + self . dfs [ inst ][ " adj_close " ] / self . dfs [ inst ]. shift
(1) [ " adj_close " ]) . rolling (30) . std ()
36 self . dfs [ inst ] = df . join ( self . dfs [ inst ])
37 self . dfs [ inst ] = self . dfs [ inst ]. fillna ( method = " ffill " ) . fillna ( method = " bfill " )
38 self . dfs [ inst ][ " ret " ] = -1 + self . dfs [ inst ][ " adj_close " ] / self . dfs [ inst ]. shift
(1) [ " adj_close " ]
39 self . dfs [ inst ][ " sampled " ] = self . dfs [ inst ][ " adj_close " ] != self . dfs [ inst ]. shift
(1) [ " adj_close " ]
40
41 # assert ( self . dfs [ inst ][" sampled "]. rolling (5) . apply ( numba_any , engine =" numba " ,
raw = True ) . equals (
42 # self . dfs [ inst ][" sampled "]. rolling (5) . apply ( lambda x : int ( np . any ( x ) ) ) ) )
43 self . dfs [ inst ][ " active " ] = self . dfs [ inst ][ " sampled " ]. rolling (5) . apply ( numba_any ,
engine = " numba " , raw = True ) . fillna (0)
44 vols . append ( self . dfs [ inst ][ " vol " ])
45 rets . append ( self . dfs [ inst ][ " ret " ])
46 actives . append ( self . dfs [ inst ][ " active " ])
47 closes . append ( self . dfs [ inst ][ " adj_close " ])
48
72
71 pass
72
92 def set_positions (
93 self ,
94 portfolio_i ,
95 forecasts ,
96 eligibles ,
97 num_trading ,
98 portfolio_vol ,
99 strat_scalar ,
100 invrisk_row ,
101 baseclose_row
102 ):
103 vol_target = 1 \
104 / num_trading \
105 * self . portfolio_df . at [ portfolio_i , " capital " ] \
106 * portfolio_vol \
107 / np . sqrt (253)
108 positions = eligibles \
109 * strat_scalar \
110 * vol_target \
111 * forecasts \
112 * invrisk_row . values \
113 / baseclose_row . values
73
114
124 @timeme
125 def run_sim ulation ( self ) :
126 """
127 Settings
128 """
129 portfolio_vol = 0.20
130 t r a d e _ d a t e t i m e _ r a n g e = pd . date_range (
131 start = self . g e t _ t r a d e _ d a t e t i m e _ r a n g e () [0] ,
132 end = self . g e t _ t r a d e _ d a t e t i m e _ r a n g e () [1] ,
133 freq = " D "
134 )
135
136 """
137 Compute Metas
138 """
139 self . compute_metas ( index = t r a d e _ d a t e t i m e _ r a n g e )
140
141 """
142 I niti al i sa ti o ns
143 """
144 self . portfolio_df = self . i n i t _ p o r t f o l i o _ s e t t i n g s ( trade_range = t r a d e _ d a t e t i m e _ r a n g e )
145
74
157 ( ret_i , ret_row ) ,\
158 ( active_i , active_row ) ,\
159 ( baseclose_i , baseclose_row ) , \
160 ( eligibles_i , eligibles_row ) ,\
161 ( invrisk_i , invrisk_row ) ,\
162 in zip (\
163 self . portfolio_df . iterrows () ,\
164 self . voldf . iterrows () ,\
165 self . retdf . iterrows () ,\
166 self . activedf . iterrows () ,\
167 self . baseclosedf . iterrows () ,\
168 self . eligiblesdf . iterrows () ,\
169 self . invriskdf . iterrows () ,\
170 ):
171
172 strat_scalar = 2
173 eligibles = eligibles_row
174
175 if portfolio_i != 0:
176
75
199
Let’s make the adjustments in our strategy files (note that compute forecasts is still suboptimally writ-
ten!):
76
1 import numpy as np
2 import pandas as pd
3 from numba import jit
4
23 eligibles = []
24 for inst in self . instruments :
25 self . dfs [ inst ][ " alive " ] = self . dfs [ inst ][ " adj_close " ]\
26 . rolling (61) . apply ( numba_notanynan , engine = " numba " , raw = True ) . fillna (0)
27 super () . compute_metas ( index )
28 retagg = pd . DataFrame ( index = index )
29 for inst in self . instruments :
30 print ( inst )
31 self . dfs [ inst ][ " eligible " ] = \
32 self . dfs [ inst ][ " alive " ]. astype ( int ) &\
33 self . dfs [ inst ][ " active " ]. astype ( int ) &\
34 ( self . dfs [ inst ][ " vol " ] > 0) . astype ( int ) &\
35 ( self . dfs [ inst ][ " adj_close " ] > 0) . astype ( int )
36
77
44 self . eligiblesdf = pd . concat ( eligibles , axis =1)
45 self . eligiblesdf . columns = self . instruments
46
52 alpha_scores = {}
53 temp = eligibles_row . values
54 for i in range ( len ( self . instruments ) ) :
55 if temp [ i ]:
56 date_start = self . portfolio_df . at [ portfolio_i - 60 , " datetime " ]
57 x = self . dfs [ " mkt " ][ " ret " ]. loc [ date_start : date ]. values . reshape (( -1 , 1) )
58 y = self . dfs [ self . instruments [ i ]][ " ret " ]. loc [ date_start : date ]
59 model = L i n e a r Re g r e s s i o n () . fit (x , y )
60 ypred = model . predict ( x )
61 residuals = y - ypred
62 alpha_scores [ self . instruments [ i ]] = -1 * skew ( residuals )
63
retagg[inst] = self.dfs[inst]["ret"]
where we are creating a dataframe from partial results by appending each column on every iteration of the
loop.
1 import ...
2
78
7
79
46 return forecasts , self . configs [ " longsize " ] + self . configs [ " shortsize " ]
6 # @profile
7 def compute_metas ( self , index ) :
8 super () . compute_metas ( index )
9 alphas , eligibles , trenders = [] , [] , []
10 for inst in self . instruments :
11 print ( inst )
12 self . dfs [ inst ][ " smaf " ] = self . dfs [ inst ][ " adj_close " ]. rolling (10) . mean ()
13 self . dfs [ inst ][ " smam " ] = self . dfs [ inst ][ " adj_close " ]. rolling (30) . mean ()
14 self . dfs [ inst ][ " smas " ] = self . dfs [ inst ][ " adj_close " ]. rolling (100) . mean ()
15 self . dfs [ inst ][ " smass " ] = self . dfs [ inst ][ " adj_close " ]. rolling (300) . mean ()
16 self . dfs [ inst ][ " alphas " ] = 0.0 + \
17 ( self . dfs [ inst ][ " smaf " ] > self . dfs [ inst ][ " smam " ]) + \
18 ( self . dfs [ inst ][ " smaf " ] > self . dfs [ inst ][ " smas " ]) + \
19 ( self . dfs [ inst ][ " smaf " ] > self . dfs [ inst ][ " smass " ]) + \
20 ( self . dfs [ inst ][ " smam " ] > self . dfs [ inst ][ " smas " ]) + \
21 ( self . dfs [ inst ][ " smam " ] > self . dfs [ inst ][ " smass " ])
22
80
41 self . trendersdf . columns = self . instruments
42 self . eligiblesdf . astype ( ’ int8 ’)
43
Let’s compare their run times pre & post vectorization (reduced problem set for strategy 1 and 2);
PRE
Strategy 1: @timeme: run_simulation took 115.60257959365845 seconds
Strategy 2: @timeme: run_simulation took 69.975656747818 seconds
Strategy 3: @timeme: run_simulation took 31.71785879135132 seconds
POST
Strategy 1: @timeme: run_simulation took 105.40305471420288 seconds
Strategy 2: @timeme: run_simulation took 10.909979820251465 seconds
Strategy 3: @timeme: run_simulation took 11.441979169845581 seconds
We see that vectorization and each of the incremental tricks gives us immense speedup, but for Strategy
1, our CPU efficiency seems ‘stuck’.
Like numpy, pandas data-store is also in contiguous memory blocks, making manipulation (relatively) faster
than with Python list objects - which store elements by reference. While it is built on-top of numpy arrays,
Pandas adds its own suite of functionalities and adds some useful data structures such as NaN bit masks to
make null-aware integer/boolean arrays and so on. This is why our dataframe function calls do not complain
as much (with Exceptions for instance) when we do operations on invalid cells. Their are some quirks - for
instance introducing null elements into integer or boolean arrays promotes the Series into a float type, and
we can cast boolean arrays into integer arrays with tricks such as 0+ boolean series. There are also other
81
subtleties, such as how they are stored in memory, for instance ’C’-like row major formats for numpy arrays
and ‘Fortran’-like column major formats for some dataframes. We will not go into the gritty details - Pandas
have done an excellent job of making sure the user is abstracted away from these lower-level implementations.
Interested readers, should of course, dive deeper - there are optimizations to be found by understanding the
Pandas internals!
There are many ways to perform the same operations in Pandas, but their efficiency depends on the internal
logic’s implementation. Let’s consider a common operation, which is to apply some function either to a row
of data or a rolling column - and reducing it to a scalar.
. In each iteration, we access the row by using ‘.loc’ (which we know to be expensive!). Note that numpy
arrays need homogenous datatypes, such that taking column slices return a view (non-copy) while accessing
rows of heterogenous datatypes give us a Pandas dataframe copy. This is time consuming and is very
inefficient. We can be more Pythonic;
which makes use of generators (generators also use less space!) and is both more compute and memory
efficient - although we still create Series copies we no longer have to do lookups by indices. ‘iterrows’ signal
that we want to access them in sequence!
does the job more efficiently without creating intermediate Python variables (although Series objects
are still copied), while
82
indicates we want to work directly on the raw values and manipulate the underlying numpy data types.
Since we are manipulating in numpy, we can use compilation with numba, which gives us the ability to use
other tricks such as releasing the GIL in nopython compilation mode by doing on jitted functions:
or
if we cython-ized it.
COMMAND
python3 -m cProfile -o temp.txt strat1.py
&&
import pstats
p = pstats.Stats("temp.txt")
p.strip_dirs().sort_stats("cumulative").print_stats(20)
OUTPUT
134224183 function calls (132495513 primitive calls) in 143.278 seconds
83
146211 0.500 0.000 29.979 0.000 indexing.py:1169(_getitem_axis)
142558 0.564 0.000 27.751 0.000 indexing.py:1207(_get_slice_axis)
2230525/1675369 3.495 0.000 23.314 0.000 {built-in method numpy.core...
142558 0.894 0.000 22.972 0.000 base.py:495(_validate_data)
285116 3.511 0.000 22.730 0.000 validation.py:619(check_array)
97050 0.237 0.000 16.688 0.000 common.py:55(new_method)
96928 0.206 0.000 15.841 0.000 series.py:5637(_arith_method)
71279 0.237 0.000 15.778 0.000 validation.py:949(check_X_y)
96928 0.888 0.000 15.494 0.000 base.py:1286(_arith_method)
142558 0.508 0.000 15.431 0.000 datetimes.py:715(slice_indexer)
142558 0.308 0.000 14.695 0.000 base.py:6238(slice_indexer)
71279 1.238 0.000 14.345 0.000 _stats_py.py:968(skew)
142558 0.695 0.000 14.323 0.000 base.py:6441(slice_locs)
6 alpha_scores = {}
7 temp = eligibles_row . values
8 for i in range ( len ( self . instruments ) ) :
9 if temp [ i ]:
10 date_start = self . portfolio_df . at [ portfolio_i - 60 , " datetime " ]
11 x = self . dfs [ " mkt " ][ " ret " ]. loc [ date_start : date ]. values . reshape (( -1 , 1) )
12 y = self . dfs [ self . instruments [ i ]][ " ret " ]. loc [ date_start : date ]
13 model = L i n e a r R e g re s s i o n () . fit (x , y )
14 ypred = model . predict ( x )
15 residuals = y - ypred
16 alpha_scores [ self . instruments [ i ]] = -1 * skew ( residuals )
17
18 alpha_scores = { k : v for k , v in sorted ( alpha_scores . items () , key = lambda pair : pair [1]) }
19 alphalong = list ( alpha_scores . keys () ) [ - self . configs [ " longsize " ]:]
20 alphashort = list ( alpha_scores . keys () ) [: self . configs [ " shortsize " ]]
21 forecaster = lambda inst : 1 if inst in alphalong else ( -1 if inst in alphashort else 0)
22 forecasts = np . array ([ forecaster ( inst ) for inst in self . instruments ])
84
23 return forecasts , self . configs [ " longsize " ] + self . configs [ " shortsize " ]
Recall that it is a better code design to separate data processing logic due to CPU caches. Additionally,
‘.loc’ operations are inefficient compared to rolling ‘.apply’ functions. Could we improve the code design by
moving this to the meta computes? Recall also that we have made repetitive calls to
retagg[inst] = self.dfs[inst]["ret"]
from partial dataframe results. You might have gotten a warning from pandas that your dataframe is ‘highly
fragmented’. It turns out that each concatenation writes into new memory which contains the original
dataframe and a new row! We repeatedly create new dataframe objects and ask the kernel to give us more
memory space - we should build up a list and do it in one go instead. We will fix that here.
Let’s versionate and come with iteratively better compute meta functions. We need to unroll two
dataframes simultaneously to get our predictor and response time-series for regression - but this is not
supported by the pandas toolset. We can get an extension for numpy called numpy ext that allows us to roll
over multiple dataframes together.
85
21 eligibles . append ( self . dfs [ inst ][ " eligible " ])
22 returns . append ( self . dfs [ inst ][ " ret " ])
23
86
61
62 return s t r a t 1 _ c o m p u t e _ f o r e c a s t s (
63 self . instruments ,
64 self . skewsdf . loc [ date ]. values ,
65 portfolio_i ,
66 eligibles_row . values . astype ( np . int32 ) ,
67 self . configs [ " longsize " ] ,
68 self . configs [ " shortsize " ])
Can we do better?
Taking a look at its source code: (link: scikit-learn Github source code), we see that scikit-learn performs
significant amounts of input validation before calling the same function!
If we can guarantee that our input is correct - which we can do ourself with lower overhead - we can skip
the validation drag.
takes
87
@timeme: run_simulation took 27.103562355041504 seconds
Can we do better? We can do the same with the skew computation: see (scipy documentation on skew
computation: (link: scipy skew implementation)
12 m1 = np . mean ( residuals )
13 m2 = 1 / len ( residuals ) * np . sum ( np . power ( residuals - m1 , 2) )
14 m3 = 1 / len ( residuals ) * np . sum ( np . power ( residuals - m1 , 3) )
15 if m2 == 0:
16 return np . nan
17 s = m3 / ( m2 **1.5)
18
19 return s
takes
We have already done the hard work removing ‘jit’ unfriendly calls by creating our custom regression and
skew functions. There are many intermediate variables here such A, m, c, yhat... which are dynamically
typed. We can change the datatype that it refers to midway during the function execution, making it
difficult for our VM to make runtime optimizations of how our code runs at the machine level. However,
88
in our executions, each call of this functions creates the same datatype! We do not need Python’s dynamic
lookup functionality - we can compile this in JIT nopython or Cython. Note that the dynamic mechanism in
runtime does alot of work in the background - for a simple operation such as incrementing integer counters,
it has to check if the add function is implemented - and if it is not walk up the inheritance classes for
function handling. This is stripped away with static typing, allowing us to do the computation in bits and
machine code instead of Python object (objectmode) manipulations.
89
3600/225 0.005 0.000 5.679 0.025 compiler_machinery.py:272(check)
59611 0.637 0.000 5.190 0.000 series.py:323(__init__)
20 0.040 0.002 4.897 0.245 numpy_ext.py:374(rolling_apply)
We see that compute forecasts has dropped out of the top 20 demanding tasks in place of compute metas,but
the latter only costs 7.405 seconds compared to the 124.664 seconds in our original version of compute forecasts.
Now that our code is not (grossly) inefficient, let’s revert back to the full problem set on strategy 1 and
strategy 3 - our concern is how the problem scales, since that would cost us the most waiting time in the
first place. We would optimize on the full problem set from here on...
PRE
Strategy 1: @timeme: run_simulation took 1337.7771863937378 seconds
Strategy 2: @timeme: run_simulation took 225.85177397727966 seconds
Strategy 3: @timeme: run_simulation took 1249.9761579036713 seconds
POST
Strategy 1: @timeme: run_simulation took 26.82828116416931 seconds
Strategy 2: @timeme: run_simulation took 10.909979820251465 seconds
Strategy 3: @timeme: run_simulation took 18.638883590698242 seconds
Unfortunately, when profiling our last optimization, the run-times seem quite similar among the top 20
contestants demanding our attention. Additionally, most function calls are by code not written by us - such
as those by the compiler files. The high level profiling does not give us enough granularity anymore, so lets
step back into kernprof and profile our run simulation line by line.
COMMAND
python3 -m kernprof -l -v strat1.py (annotate @profile on run_simulation)
OUTPUT
Function: run_simulation at line 113
90
116 Settings
117 """
118 1 3.0 3.0 0.0 portfolio_vol = 0.20
119 2 448.0 224.0 0.0 trade_datetime_range = pd.date_range(
120 1 3.0 3.0 0.0 start=self.get_trade_datetime_range()[0],
121 1 2.0 2.0 0.0 end=self.get_trade_datetime_range()[1],
122 1 1.0 1.0 0.0 freq="D"
123 )
124
125 """
126 Compute Metas
127 """
128 1 22900215.0 22900215.0 49.6 self.compute_metas(index=trade_datetime_range)
129
130 """
131 Initialisations
132 """
133 1 4564.0 4564.0 0.0 self.portfolio_df = self.init_portfolio_setting
134
135 1 1.0 1.0 0.0 date_prev = None
136 1 1.0 1.0 0.0 baseclose_prev = None
137 1 1.0 1.0 0.0 units_held, weights_held = [], []
138 """
139 self.eligiblesdf = eligiblesdf
140 self.invriskdf = invriskdf]
141 self.anyinformation required by compute_forecas
142 NOTE: needs to be implemented by the child alph
143 """
144
145 3654 3516396.0 962.3 7.6 for (portfolio_i, portfolio_row),\ ------------
146 3653 12216.0 3.3 0.0 (vol_i, vol_row),\
147 3653 9511.0 2.6 0.0 (ret_i, ret_row),\
148 3653 8177.0 2.2 0.0 (active_i, active_row),\
149 3653 4924.0 1.3 0.0 (baseclose_i, baseclose_row), \
91
150 3653 4866.0 1.3 0.0 (eligibles_i, eligibles_row),\
151 3653 8675.0 2.4 0.0 (invrisk_i, invrisk_row),\
152 2 3.0 1.5 0.0 in zip(\
153 1 2.0 2.0 0.0 self.portfolio_df.iterrows(),\
154 1 2.0 2.0 0.0 self.voldf.iterrows(),\
155 1 2.0 2.0 0.0 self.retdf.iterrows(),\
156 1 2.0 2.0 0.0 self.activedf.iterrows(),\
157 1 3.0 3.0 0.0 self.baseclosedf.iterrows(),\
158 1 2.0 2.0 0.0 self.eligiblesdf.iterrows(),\
159 1 2.0 2.0 0.0 self.invriskdf.iterrows(),\
160 ):
161
162 3653 5464.0 1.5 0.0 strat_scalar = 2
163 3653 8118.0 2.2 0.0 eligibles = eligibles_row
164
165 3653 5381.0 1.5 0.0 if portfolio_i != 0:
166
167 7304 3935557.0 538.8 8.5 day_pnl, nominal_ret = backtest_utils.g
168 3652 6081.0 1.7 0.0 portfolio_df=self.portfolio_df,
169 3652 5737.0 1.6 0.0 last_weights=weights_held[-1],
170 3652 5219.0 1.4 0.0 last_units=units_held[-1],
171 3652 4756.0 1.3 0.0 idx=portfolio_i,
172 3652 4733.0 1.3 0.0 baseclose_row=baseclose_prev,
173 3652 4747.0 1.3 0.0 ret_row=ret_row
174 )
175
176 7304 155474.0 21.3 0.3 strat_scalar = self.get_strat_scaler(
177 3652 5068.0 1.4 0.0 lookback=30,
178 3652 4895.0 1.3 0.0 portfolio_vol=portfolio_vol,
179 3652 4737.0 1.3 0.0 idx=portfolio_i,
180 3652 4728.0 1.3 0.0 default=strat_scalar
181 )
182
183 7304 157674.0 21.6 0.3 self.portfolio_df.at[portfolio_i, "ewst
92
184 4849 25407.0 5.2 0.1 if nominal_ret != 0 else self.portf
185
186 3653 110092.0 30.1 0.2 self.portfolio_df.at[portfolio_i, "strat sc
187
188 3653 4832729.0 1322.9 10.5 forecasts, num_trading = self.compute_forec
189 7306 4719349.0 646.0 10.2 positions, nominal_tot = self.set_positions
190 3653 103556.0 28.3 0.2 capital=self.portfolio_df.at[portfolio_
191 3653 5523.0 1.5 0.0 portfolio_i=portfolio_i,
192 3653 4916.0 1.3 0.0 forecasts=forecasts,
193 3653 4782.0 1.3 0.0 eligibles=eligibles,
194 3653 4891.0 1.3 0.0 num_trading=num_trading,
195 3653 4831.0 1.3 0.0 portfolio_vol=portfolio_vol,
196 3653 4848.0 1.3 0.0 strat_scalar=strat_scalar,
197 3653 4866.0 1.3 0.0 invrisk_row=invrisk_row,
198 3653 4838.0 1.3 0.0 baseclose_row=baseclose_row
199 )
200
201 3653 264251.0 72.3 0.6 weights = self.set_weights(nominal_tot, pos
202 7306 18554.0 2.5 0.0 nominal_tot, positions, weights = self.post
203 3653 5057.0 1.4 0.0 index=portfolio_i,
204 3653 5071.0 1.4 0.0 date=ret_i,
205 3653 4879.0 1.3 0.0 eligibles=eligibles,
206 3653 4830.0 1.3 0.0 nominal_tot=nominal_tot,
207 3653 4894.0 1.3 0.0 positions=positions,
208 3653 4754.0 1.3 0.0 weights=weights
209 )
210
211 3653 181738.0 49.8 0.4 self.portfolio_df.at[portfolio_i, "nominal"
212 3653 121393.0 33.2 0.3 self.portfolio_df.at[portfolio_i, "leverage
213 3653 130809.0 35.8 0.3 self.portfolio_df.at[portfolio_i, "nomi
214
215 3653 4783644.0 1309.5 10.4 print(self.portfolio_df.loc[portfolio_i]) -
216 3653 8099.0 2.2 0.0 date_prev = portfolio_i
217 3653 7350.0 2.0 0.0 units_held.append(positions)
93
218 3653 6181.0 1.7 0.0 weights_held.append(weights)
219 3653 13016.0 3.6 0.0 baseclose_prev = baseclose_row
220
221 1 3.0 3.0 0.0 return self.portfolio_df
The pain points are - compute metas, dataframe unzipping, get pnl stats, compute forecasts, set positions,
and surprisingly - the print statement takes up 10% of our runtime!
We only need the last print statement to verify our terminal wealth is invariant post code edits. Printing is
a blocking operation that stops our logic flow - we can improve the performance by buffering and printing
in chunks or by removing unnecessary print statements. Moving the print statement for the last date entry
outside the loop (to see only the last terminal wealth entry), we get a runtime of 23.934306859970093
seconds down from 26.82828116416931 seconds. Not bad...
15 return s t r a t 1 _ c o m p u t e _ f o r e c a s t s (
16 self . instruments ,
17 self . skewsdf . loc [ date ]. values ,
18 portfolio_i ,
94
19 eligibles_row . values . astype ( np . int32 ) ,
20 self . configs [ " longsize " ] ,
21 self . configs [ " shortsize " ]
Cython is a tool for compiling to Python code to C, compatible with both numpy arrays and Python code. It
allows us to create code that can be compiled into static binary machine code and used like regular functions
in our logic - but in order to prevent call backs into the Python virtual machine we need to make it ‘C’ like,
such as doing static typing on our variable types. The binary machine code contains optimizations run by
the ‘C’ compilers, for instance those performed by gcc. At run time, this has the potential to run in ‘pure C’
level without being Python-aware. ‘C’ like arrays are supported, and numpy supports this by implementing
the memoryview interface. While iterating over numpy arrays in native Python is slow, we can compile it
to make it efficient! We need to write pyx code, then build a binary exectable with setup.py and import it
as if it were any other Python function. Let’s try writing a compiled version of our compute forecasts, which
we saw to be slow.
1 # strat1 . py
2 def c o m p u t e _ f o r e c a s t s ( self , portfolio_i , date , eligibles_row ) :
3 return cythonlib . s t r a t 1 _ c o m p u t e _ f o r e c a s t s (
4 self . instruments ,
5 self . skewsdf . loc [ date ]. values ,
6 portfolio_i ,
7 eligibles_row . values . astype ( np . int32 ) ,
8 self . configs [ " longsize " ] ,
9 self . configs [ " shortsize " ])
10
11 # cythonlib . pyx
12 def s t r a t 1 _ c o m p u t e _ f o r e c a s t s ( instruments , skews , portfolio_i , eligibles_row , longsize ,
shortsize ) :
13 alpha_scores = {}
14 temp = eligibles_row
15 for i in range ( len ( eligibles_row ) ) :
16 if temp [ i ]:
17 alpha_scores [ instruments [ i ]] = -1 * skews [ i ]
18 alpha_scores = { k : v for k , v in sorted ( alpha_scores . items () , key = lambda pair : pair [1]) }
19 alphalong = list ( alpha_scores . keys () ) [ - longsize :]
20 alphashort = list ( alpha_scores . keys () ) [: shortsize ]
21 forecaster = lambda inst : 1 if inst in alphalong else ( -1 if inst in alphashort else 0)
95
22 forecasts = np . array ([ forecaster ( inst ) for inst in instruments ])
23 return forecasts , longsize + shortsize
24
30 setup (
31 ext_modules = cythonize (
32 " cythonlib . pyx " ,
33 c o m p i l e r _ d i r e c t i v e s ={ " langu age_lev el " : 3}
34 ),
35 include_dirs =[ numpy . get_include () ]
36 )
Takes
This didn’t really help us much - on repeated trials, the runtime on Cython is sometimes slower! (note
that overall runtime is dependent many other factors, such as CPU load from other running processes - it is
difficult to compare efficiency on two programs that have marginal differences in execution time)
We can annotate our Cython code and see how ’C’ compatible it is using the command
96
where increasingly yellow lines indicate that at runtime we have to call back into the Python stack. We
see that we have gotten no speedup - because the Cython code is basically doing the same thing by calling
into the virtual machine. It has no recognition of the data types. We are still doing dynamic typing and
manipulating in Python.
10.3 Back to Numpy - All Functions are Equal, But Some are More Equal than
Others
Before we try to come up with a statically typed function to prevent VM callbacks, let’s see if we can first
make it efficient with vectorization in numpy with some trickery:
It turns out this actually takes longer at 25.925340414047241 seconds, up from 23.934306859970093
seconds!
97
6 shorts = np . take ( eligible_args , argso rt_alpha s [: self . configs [ " shortsize " ]]) . astype ( "
int32 " )
7 longs = np . take ( eligible_args , arg sort_alp has [ - self . configs [ " longsize " ]:]) . astype ( " int32
")
8 forecasts = np . zeros ( len ( eligibles_row ) )
9 for i in shorts :
10 forecasts [ i ] = -1
11 for i in longs :
12 forecasts [ i ] = 1
13 return forecasts , self . configs [ " longsize " ] + self . configs [ " shortsize " ]
About 40% of the time was taken to access the skew values from our dataframes - this is a cost
of us maintaining generalizability of alpha logic, so unfortunately we cannot do better than getting the
98
values via ‘.loc’ since the parent alpha cannot anticpate what data we need to access in our signal logic.
‘np.where(eligibles row == 1)’ also costs us quite abit of surprising compute, and we address this later. A
surprisingly significant amount of time ∼ 16% was spent iterating over the shorts and longs to assign binary
values! It turns out that iterating over numpy values and accessing them individually is not quite efficient
(here is a discussion on looping over numpy arrays -link: stackoverflow).
Let’s perhaps delegate the task to a cythonized function (recall dereferencing in pure Python numpy has
overhead in terms of address computations, while in ’C’ we can directly access the items by doing address
offsets - we know how much space each element takes in a homogenous array type!):
1 def caster ( int [:] shorts , int [:] longs , double [:] forecasts ) :
2 for i in range ( len ( shorts ) ) :
3 forecasts [ shorts [ i ]] = -1
4 for i in range ( len ( longs ) ) :
5 forecasts [ longs [ i ]] = 1
6 return forecasts
99
89 # forecasts[i] = -1
90 # for i in longs:
91 # forecasts[i] = 1
92 3653 27173.0 7.4 0.9 forecasts = cythonlib.caster(shorts, longs, for
93 3653 4582.0 1.3 0.2 return forecasts, self.configs["longsize"] + se
we spend just 1% of time setting up the binary casted array. The new runtime squeezed down to
down from 22.466408014297485 seconds. The time spent as a percentage of the total compute time in
run simulation is now 7.5% down from 10.5%, a marginal improvement.
However, in multiple experiments their runtimes are almost similar - this is possibly due to the overhead
of converting the cython object back into numpy objects. Profiling the Cython code is difficult because it
occurs outside the Python call trace in machine compiled code, so we should always verify that the changes
have made tangible improvements. Since their performance is only marginally different, but using pure
Python is more readable - we chose to stick with our original solution instead taking 22.466408014297485
seconds to run. Note that this is also an efficiency tradeoff when the code is harder to maintain, and
compilation cycle is a developer overhead.
In the previous profiling, we saw that dataframe unzipping was costly - we remove the unzipping of
self.voldf.iterrows(), self.activedf.iterrows()
100
since we are not referencing any data from it.
We try to make optimizations that could use multiple cores in the compute metas function:
1 import m ul tip ro ce s si ng as mp
2 @staticmethod
3 def doinst ( instdf , index ) :
4 @jit ( nopython = True )
5 def numba_any ( x ) :
6 return int ( np . any ( x ) )
7
17 ...
18
but when we kernprof it turns out that the process actually runs slower! It seems that the overhead of creating
memory and kernel interactions outweigh the multi-core utilization. We come to a grinding realization that
making incremental improvements are quite difficult once our code is fairly efficient - we have to fight for
scraps.
Recall that the functions get pnl stats and set positions were rather costly (each costing about 10% of
101
the CPU time in run simulation, so let’s profile both functions:
>>
102
103 3652 99693.0 27.3 2.7 portfolio_df.at[idx, "capital ret"] = capital_ret
104 7304 143375.0 19.6 3.8 portfolio_df.at[idx, "ewma"] = 0.06 * (capital_ret*
105 4849 22599.0 4.7 0.6 if nominal_ret != 0 else portfolio_df.at[idx -
106
107 3652 2590.0 0.7 0.1 return day_pnl, nominal_ret
That’s interesting! It should set off alarm bells in our head - each function has one fairly innocent
looking statement taking up most compute. Doing a Python object data type inspection at runtime, it turns
out that in each of those scenarios, at least one of the variables were pandas.Series objects instead of numpy
arrays. Numpy arrays are quicker! Let’s give that a quick fix by rewriting their references to numpy arrays
immediately after unzipping by using ‘.values’
1 ...
2 for ( portfolio_i , portfolio_row ) ,\
3 ( ret_i , ret_row ) ,\
4 ( baseclose_i , baseclose_row ) , \
5 ( eligibles_i , eligibles_row ) ,\
6 ( invrisk_i , invrisk_row ) ,\
7 in zip (\
8 self . portfolio_df . iterrows () ,\
9 self . retdf . iterrows () ,\
10 self . baseclosedf . iterrows () ,\
11 self . eligiblesdf . iterrows () ,\
12 self . invriskdf . iterrows () ,\
13 ):
14 portfolio_row = portfolio_row . values
15 ret_row = ret_row . values
16 baseclose_row = baseclose_row . values
17 eligibles_row = eligibles_row . values
18 invrisk_row = invrisk_row . values
19 ...
down from 22.466408014297485 seconds by ensuring that we are manipulating in the ‘efficient’ datatypes.
We are almost done! Let’s just eke out some more performance - we return to kernprof
103
==============================================================
117 @profile
118 def run_simulation(self):
119 """
120 Settings
121 """
122 1 2.0 2.0 0.0 portfolio_vol = 0.20
123 2 565.0 282.5 0.0 trade_datetime_range = pd.date_range(
124 1 4.0 4.0 0.0 start=self.get_trade_datetime_range()[0],
125 1 2.0 2.0 0.0 end=self.get_trade_datetime_range()[1],
126 1 2.0 2.0 0.0 freq="D"
127 )
128
129 """
130 Compute Metas
131 """
132 1 4123290.0 4123290.0 27.6 self.compute_metas(index=trade_datetime_range)
133
134 """
135 Initialisations
136 """
137 1 5334.0 5334.0 0.0 self.portfolio_df = self.init_portfolio_setting
138
139 1 2.0 2.0 0.0 date_prev = None
140 1 1.0 1.0 0.0 baseclose_prev = None
141 1 1.0 1.0 0.0 units_held, weights_held = [], []
142 """
143 self.eligiblesdf = eligiblesdf
144 self.invriskdf = invriskdf]
145 self.anyinformation required by compute_forecas
146 NOTE: needs to be implemented by the child alph
147 """
148
149 3654 2765613.0 756.9 18.5 for (portfolio_i, portfolio_row),\
104
150 3653 7320.0 2.0 0.0 (ret_i, ret_row),\
151 3653 5294.0 1.4 0.0 (baseclose_i, baseclose_row), \
152 3653 5234.0 1.4 0.0 (eligibles_i, eligibles_row),\
153 3653 5047.0 1.4 0.0 (invrisk_i, invrisk_row),\
154 2 4.0 2.0 0.0 in zip(\
155 1 3.0 3.0 0.0 self.portfolio_df.iterrows(),\
156 1 4.0 4.0 0.0 self.retdf.iterrows(),\
157 1 3.0 3.0 0.0 self.baseclosedf.iterrows(),\
158 1 2.0 2.0 0.0 self.eligiblesdf.iterrows(),\
159 1 2.0 2.0 0.0 self.invriskdf.iterrows(),\
160 ):
161
162 3653 40297.0 11.0 0.3 portfolio_row = portfolio_row.values
163 3653 19793.0 5.4 0.1 ret_row = ret_row.values
164 3653 17726.0 4.9 0.1 baseclose_row = baseclose_row.values
165 3653 17115.0 4.7 0.1 eligibles_row = eligibles_row.values
166 3653 16941.0 4.6 0.1 invrisk_row = invrisk_row.values
167
168 3653 5904.0 1.6 0.0 strat_scalar = 2
169
170 3653 5522.0 1.5 0.0 if portfolio_i != 0:
171
172 7304 1194867.0 163.6 8.0 day_pnl, nominal_ret = backtest_utils.g
173 3652 6202.0 1.7 0.0 portfolio_df=self.portfolio_df,
174 3652 5612.0 1.5 0.0 last_weights=weights_held[-1],
175 3652 5311.0 1.5 0.0 last_units=units_held[-1],
176 3652 4787.0 1.3 0.0 idx=portfolio_i,
177 3652 4805.0 1.3 0.0 baseclose_row=baseclose_prev,
178 3652 4862.0 1.3 0.0 ret_row=ret_row
179 )
180
181 7304 147291.0 20.2 1.0 strat_scalar = self.get_strat_scaler(
182 3652 5115.0 1.4 0.0 lookback=30,
183 3652 4830.0 1.3 0.0 portfolio_vol=portfolio_vol,
105
184 3652 5012.0 1.4 0.0 idx=portfolio_i,
185 3652 4924.0 1.3 0.0 default=strat_scalar
186 )
187
188 7304 163212.0 22.3 1.1 self.portfolio_df.at[portfolio_i, "ewst
189 4182 14071.0 3.4 0.1 if nominal_ret != 0 else self.portf
190
191 3653 111158.0 30.4 0.7 self.portfolio_df.at[portfolio_i, "strat sc
192
193 3653 4892665.0 1339.4 32.7 forecasts, num_trading = self.compute_forec
194 7306 483575.0 66.2 3.2 positions, nominal_tot = self.set_positions
195 3653 100961.0 27.6 0.7 capital=self.portfolio_df.at[portfolio_
196 3653 5455.0 1.5 0.0 portfolio_i=portfolio_i,
197 3653 5105.0 1.4 0.0 forecasts=forecasts,
198 3653 4862.0 1.3 0.0 eligibles=eligibles_row,
199 3653 4881.0 1.3 0.0 num_trading=num_trading,
200 3653 4922.0 1.3 0.0 portfolio_vol=portfolio_vol,
201 3653 4998.0 1.4 0.0 strat_scalar=strat_scalar,
202 3653 4989.0 1.4 0.0 invrisk_row=invrisk_row,
203 3653 4954.0 1.4 0.0 baseclose_row=baseclose_row
204 )
205
206 3653 225311.0 61.7 1.5 weights = self.set_weights(nominal_tot, pos
207 7306 19426.0 2.7 0.1 nominal_tot, positions, weights = self.post
208 3653 5074.0 1.4 0.0 index=portfolio_i,
209 3653 5036.0 1.4 0.0 date=ret_i,
210 3653 5066.0 1.4 0.0 eligibles=baseclose_row,
211 3653 4867.0 1.3 0.0 nominal_tot=nominal_tot,
212 3653 4954.0 1.4 0.0 positions=positions,
213 3653 5014.0 1.4 0.0 weights=weights
214 )
215
216 3653 175155.0 47.9 1.2 self.portfolio_df.at[portfolio_i, "nominal"
217 3653 118534.0 32.4 0.8 self.portfolio_df.at[portfolio_i, "leverage
106
218 3653 128820.0 35.3 0.9 self.portfolio_df.at[portfolio_i, "nomi
219
220 3653 5751.0 1.6 0.0 date_prev = portfolio_i
221 3653 7630.0 2.1 0.1 units_held.append(positions)
222 3653 5925.0 1.6 0.0 weights_held.append(weights)
223 3653 6466.0 1.8 0.0 baseclose_prev = baseclose_row
224
225 1 2213.0 2213.0 0.0 print(self.portfolio_df.loc[portfolio_i])
226
227 1 2.0 2.0 0.0 return self.portfolio_df
Compared to the list append operations, we see that the ‘.at’ is significantly more expensive (about 10 times).
Counting the number of ‘.at’ operations that our functions call all together - we have more than 10 of them!
If each of them costs 1% CPU, then cumulatively they take more than 10% of our total execution runtime.
We can do better here by maintaining stateful lists - we often only require the ‘last’ capital or the ‘last’
strategy scalar. Recall that we already did this with getting our last positions held, and our weight vector
for portfolio allocations. Let’s clean up all these ‘.at’ operations by maintaining a list of them and accessing
the ‘last’ with
list[-1]
107
4 nominal_ret = np . dot ( last_weights , ret_row )
5 capital_ret = nominal_ret * leverages [ -1]
6
Cleaning up everywhere in our code takes some time, and we swap the order of a few lines to maintain
logic correctness: (note the swapping of pnl and strategy scalar calculations)
18 """
19 Compute Metas
20 """
21 self . compute_metas ( index = t r a d e _ d a t e t i m e _ r a n g e )
22
23 """
24 I ni ti ali sa ti on s
25 """
26 capital , ewma , ewstrat , self . portfolio_df = self . i n i t _ p o r t f o l i o _ s e t t i n g s ( trade_range =
trade_datetime_range )
27
28 date_prev = None
29 basec lose_pr ev = None
30 capitals = [ capital ]
108
31 ewmas = [ ewma ]
32 ewstrats = [ ewstrat ]
33 nominalss = []
34 leverages = []
35 strat_scalars = []
36 units_held , weights_held = [] , []
37 """
38 self . eligiblesdf = eligiblesdf
39 self . invriskdf = invriskdf ]
40 self . any informa tion required by c o m p u t e _ f o r e c a s t s
41 """
42
62 strat_scalar = 2
63
64 if portfolio_i != 0:
65 strat_scalar = self . g e t _ s t r a t _ sc a l e r (
66 lookback =30 ,
67 portfolio_vol = portfolio_vol ,
68 idx = portfolio_i ,
69 default = strat_scalars [ -1] ,
70 ewmas = ewmas ,
71 ewstrats = ewstrats
72 )
73
109
74 day_pnl , nominal_ret = b acktest_ utils . get_pnl_stats (
75 portfolio_df = self . portfolio_df ,
76 last_weights = weights_held [ -1] ,
77 last_units = units_held [ -1] ,
78 idx = portfolio_i ,
79 baseclose_row = baseclose_prev ,
80 ret_row = ret_row ,
81 capitals = capitals ,
82 leverages = leverages ,
83 ewmas = ewmas
84 )
85
110
115 nominalss . append ( nominal_tot )
116 leverages . append ( nominal_tot / capitals [ -1])
117 units_held . append ( positions )
118 weights_held . append ( weights )
119
120 if verbose :
121 capital_ser = pd . Series (
122 data = capitals , index = trade_datetime_range , name = " capital " )
123 stratscal_ser = pd . Series (
124 data = strat_scalars , index = trade_datetime_range , name = " strat_scalar " )
125 nominals_ser = pd . Series (
126 data = nominalss , index = trade_datetime_range , name = " nominal_tot " )
127 leverages_ser = pd . Series (
128 data = leverages , index = trade_datetime_range , name = " leverages " )
129 units = pd . DataFrame (
130 data = units_held , index = trade_datetime_range , columns =[ inst + " units " for inst
in self . instruments ])
131 weights = pd . DataFrame (
132 data = weights_held , index = trade_datetime_range , columns =[ inst + " w " for inst in
self . instruments ])
133 self . portfolio_df = pd . concat ([
134 units ,
135 weights ,
136 stratscal_ser ,
137 nominals_ser ,
138 leverages_ser ,
139 capital_ser ,
140 ] , axis =1)
141 print ( self . portfolio_df )
142 else :
143 print ( capitals )
144 return self . portfolio_df
from 15.73388385772705 seconds. The compute time of the last section in reconstructing our dataframe
is actually quite cheap:
111
232 1 984.0 984.0 0.0 stratscal_ser = pd.Series(data=strat_scalar
233 1 1219.0 1219.0 0.0 nominals_ser = pd.Series(data=nominalss, in
234 1 834.0 834.0 0.0 leverages_ser = pd.Series(data=leverages, i
235 1 66953.0 66953.0 0.5 units = pd.DataFrame(data=units_held, index
236 1 62550.0 62550.0 0.5 weights = pd.DataFrame(data=weights_held, i
237 3 3505.0 1168.3 0.0 self.portfolio_df = pd.concat([
238 1 2.0 2.0 0.0 units,
239 1 2.0 2.0 0.0 weights,
240 1 1.0 1.0 0.0 stratscal_ser,
241 1 2.0 2.0 0.0 nominals_ser,
242 1 2.0 2.0 0.0 leverages_ser,
243 1 2.0 2.0 0.0 capital_ser,
244 1 2.0 2.0 0.0 ], axis=1)
245 1 295655.0 295655.0 2.3 print(self.portfolio_df)
246 else:
247 print(capitals)
248 1 4.0 4.0 0.0 return self.portfolio_df
With all the above techniques applied, let’s run the other strategies:
We have not covered all the available tools to the engineer to make performant Python code. Here, we enu-
merate some of the more well-known/common tools for reference that we believe are worth looking into and
fits well for quantitative crunching (we might cover them in future papers!): OpenMP, NumExpr, Dask,
Swifter, PyPy. Some of these tools can be combined with the solutions discussed earlier to hypercharge
our code - such as Numba and OpenMP, Cython and OpenMP, Dask/Swifter and Numba et cetera.
Other topics you might be interested in: Operating System Schedulers, Distributed Systems,
Bytecode and Assembly Language, Hyperthreading.
112
13 Conclusions
We have introduced some of the ‘quick-and-easy’ ways that Python developers can use to speed-up the
efficiency of their programs. Note that this is just the tip of the iceberg, and there are always more ways
to squeeze more compute from your system, as well as specialised engineering you can do based on your
operating system’s architecture to make your code fly - from high level languages to inspecting assembly
code and down to the machine code running on silicon! We have worked with relatively manageable datasets -
and different optimizations may be required when working with huge datasets, such as distributed algorithms
and data structures.
We started with profiling, finding the quick winners and ‘attacking where it hurts’. We focused on
making our code performant with a pure Python approach, then as we hit the languages’ performance
bounds - we circumvented it with using compilation approaches in Numpy, Numba and Cython. Lastly, our
optimizations gave diminishingly smaller returns, such that greater amounts of refactoring were required to
give comparable returns - a good sign that our code is getting decently fast.
We have also not covered in detail the depths of concurrency in Python - most of the speedups intro-
duced have been associated with runtime optimizations that unlock what Python has hidden away from
programmers to deal with thorny issues such as memory management. I hope that for many beginner (and
some intermediate-level) programmers this was a peek into the world of optimizations in programming, and
that this piques your interest into the path of better programming for better standards. In the future, we
hope to get to introduce more on concurrency and writing parallelism into our systems - with discussions on
inter-process communication, semaphores and memory management!
References
113