Broken Backtests

... and what to do about them

Steven E. Pav
(former quant)

Former applied mathematician.
Quant Programmer & Quant Strategist 2007-2015 at two small ML-based hedge funds.
Almost pure quant funds, ML-based, in U.S. ("single name") equities and volatility futures.
Tried many approaches to finding alpha:
- ML based like SVM, random forests, GP.
- traditional techniques: plain old linear regression.
Terrifying feeling of "what am I doing here?"
- How do you write and validate and debug a backtest simulator?
- No open source code available at the time.
- Most academics probably write backtest code from scratch.
- Papers with implicit backtests are a priori suspect, moreso if the strategy is complex.

What makes a profitable strategy?
- Need prediction of future price movements.
But also:
- Turn the predictions into trades.
- No, really. You need to turn the predictions into trades.
- Eliminate or reduce exposure to certain risks.
- Control trade costs. (market impact, commissions, short financing.)
Hard to estimate the effects of the different moving parts separately.
So simulate your trading historically: A backtest.
Backtesting basically implies systematic strategies.
Backtest to decide how much (if any) to deploy in a strategy.

A backtest probably should:
- simulate environment in which you act (presents point-in-time data, accepts orders).
- simulate the reactions of the world (fills, commissions, corporate actions, etc.).
- translate in an obvious way to a real trading strategy.
- provide a guarantee of time safety.
Creating a good backtesting environment requires:
- Software engineering: balance time safety, computational efficiency & developer sanity.
- Domain knowledge and data: How do corporate actions work? How should you simulate fill?
- Great statistical powers: How do you interpret the results? How do you avoid overfitting?
- Good intuition and sleuthing abilities: What new thing is broken?

Use Bayes' Rule:
- Devising a consistently profitable trading strategy is known to be hard.
  (The EMH posits that it is essentially impossible.)
- Bugs are easy to make. A good programmer will make several a day.
If your backtest looks profitable, what is the likelihood the strategy is really profitable? \[\mathcal{O}\left(\left.A\right|B\right) \propto \mathcal{O}\left(A\right) \Lambda\left(\left.B\right|A\right).\]
If you are exploring a new asset class, using a new fill simulator, using new code, testing a new strategy, or reading a paper, and the backtest looks great, it's probably a bug.

Examples:

Paper from March 2012 that claimed Sharpe of 3.5 / sqrt(yr) and 500% annual returns using monthly trading with signal delayed a month.
Three day old tweets give you a Sharpe of around 9 / sqrt(yr) trading on the DJIA index.

The most common error in backtests is time travel: use of future information in simulations.
Time travel is easy to simulate, but hard to implement!
Time travel occurs for many reasons:
- Using crude tools.
- Backfill and survivorship bias.
- Representation of corporate actions: dividends, splits, spinoffs, mergers, warrants.
- Think-os and code boo boos.

Inclusion/exclusion of a company in data may be a form of time travel.
A classic survivorship bias: trading historically on today's S&P500 universe of stocks.
Similarly, data vendors often backfill data for companies.
- You can test for this, or just ask them!
Vendors (or you) do weird things to deal with mergers.
Takeaway: be careful with universe construction.

Corporate actions are notoriously time-leaky.
Representing asset returns as a single time series: in reality, they branch across time.
Corporate actions are just hard to model.
For example, (back) adjusted closes. A portfolio inversely proportional to adjusted close has time-travel 'arb'.

plot of chunk aapl

Align returns to features for training ML models.
Forget that the model is timestamped to the returns.
A warning: the more often I retrain, the better my model!
(Often with an excuse for 'time freshness'.)

Align returns to features for training ML models.
Forget that the model is timestamped to the returns.
A warning: the more often I retrain, the better my model!
(Often with an excuse for 'time freshness'.)

In reality, orders might not get (fully) filled, or might get a bad price.
Hard to simulate given coarse data, like daily bars.
There is 'market impact' where your order affects your fill price.
- Bigger orders lead to bigger impact.
- Decent theoretical models but with uncertain parameters.
- Fitting the parameters is tricky--you only observe one history.
- Impact models often ignore other factors (like the Market).
Fill simulation should introduce a large band of uncertainty around your simulations.

Two forms of overfitting:
- Having an overly optimistic estimate of out-of-sample performance.
- Choosing a suboptimal strategy by having too much freedom.
Two forms or one form?

First kind of overfitting is like 'estimation after selection'.
For example:
- generate 1000 'random' strategies,
- backtest them all,
- pick best one based on maximal in-sample Sharpe,
- estimate the out-of-sample Sharpe of that strategy?
But not entirely a technical problem.
Usually attacked by elaborate 'in-sample' vs. 'out-of-sample' schemes.
In reality, there is 'in-sample' and 'trading-real-money.'
- You ignore data available to you now at your own risk.

The classical overfit problem: too many parameters causes poor live performance.
Applies to portfolio optimization in a subtle way.
"A perfectly rational agent should not be harmed by addition of choices."
There are no perfectly rational systematic strategies.