Testing Cryptocurrency Coin Price Series for Momentum, Mean Reversion, Random Walk

In this post we will test a modified version of the relative strength index to test for the presence of persistent momentum, mean reversion and random walk properties of coin price series.

The Relative Strength Indicator (RSI) was developed by J. Welles Wilder and published in a 1978 book, New Concepts in Technical Trading Systems, and in Commodities magazine (now Futures magazine) in the June 1978 issue (wikipedia).

The RSI can be calculated by:
U = close − close (Upward Change)
D = close − close (Downward Change)

If U is higher, then D = 0. If D is lower, then U = 0.

An (n) period exponential moving average is calculated over the U and D values.

The ratio of these averages is the relative strength or relative strength factor:
RS = EMA(U,n) / EMA(D,n)
n denoting the look back period.

The final RS ratio is converted into an index which oscillates between 0 and 100:

RS = 100 – (100/ 1+100)

In its classic form Wilder recommended a smoothing period of 14. Furthermore, a period of 14 days has been tested on the S&P500, (Adrian Ţăran-Moroşan(2011). The author of the paper selects RSI extreme signal levels of 30 and 70. Sell short the market over 70 and buy the market below 30. The test yielded poor results, however, there may be room for improvement in the look back period that has been selected. There is no consideration for the S&P500 half life of mean reversion. For example. If the S&P500 typically mean reverts on a period of less that 14 days, the classic 14 period would mask most signals. An optimal look back period equal to the half life of mean reversion may have yielded better results.

In this post we will examine a short term RSI set at an arbitrarily chosen look back period of 2 days. We select RSI extreme signal levels of 20 and 70 respectively. The strategy will buy the market when RSI2 is over 70 and will exit the trade when RSI2 drops below 85. Conversely, the strategy will buy the market when RSI2 is below 20 and will sell when RSI2 is over 70.

In essence, the two RSI models above test for momentum (buy high and sell higher) and test for mean reversion (buy low and sell high). Those series where both momentum and mean reversion models do not yield good results, those series may be considered random walks, where random walk series may be attributed to being difficult to predict.

Persistent price anomalies may exist in price series. If this holds true then the models above may show medium / long term stability of predictable returns.

Next, we download 1568 daily coin price series to .csv format using the crypto compare API. Using R we achieve this with the following code:

# Crypto Compare Coin Data
# Download using R
# Andrew Bannerman 11.29.2017

library(jsonlite)
library(data.table)
library(ggplot2)

# Obtain coin list
response = fromJSON("https://www.cryptocompare.com/api/data/coinlist")
df = as.data.frame(data.table::rbindlist(response$Data, fill=TRUE))
# Write coin list to .csv
write.csv(df,file="D:/R Projects/Final Scripts/BitCoin Scripts/coin_list.csv")

# Load Tickets From Coin List
read.data <- read.csv("D:/R Projects/Final Scripts/BitCoin Scripts/coin_list.csv", header=TRUE, stringsAsFactors = FALSE)
tickers <- c(read.data$Symbol)
# Remove all * at end of file names
tickers <- gsub("[*]", "", tickers)

# Obtain API (information only)
#cc <- fromJSON("https://min-api.cryptocompare.com/")
#str(cc)

# Download Daily Data
# Loop for downloading all .csvs
# Save each coin close plot
# Add error catching to loop
# Add completion time of each iteration
i=1
for (i in 1:length(tickers)) {
  tryCatch({
  next.coin <- tickers[i] # next coin in ticker vector
coin_daily <- fromJSON(paste0("https://min-api.cryptocompare.com/data/histoday?fsym=",next.coin,"&tsym=USD&allData=true&e=CCCAGG"))
df <- data.frame("time" = coin_daily$Data$time, "close" = coin_daily$Data$close,"open" = coin_daily$Data$open,"high" = coin_daily$Data$high,"low" = coin_daily$Data$low,"volumefrom" = coin_daily$Data$volumefrom,"volumeto" = coin_daily$Data$volumeto)
# Save plot out put to folder
# Set path, set png plot size
# ggplot 2 to create line plot
mytitle = paste0(next.coin)
graphics.off() # Close graphic device before next plot
p <- ggplot(df, aes(time)) +
  theme_classic()+
  geom_line(aes(y=close), colour="red") +
  ggtitle(mytitle, subtitle = "") +
  labs(x="Time",y="Close")+
  theme(plot.title = element_text(hjust=0.5),plot.subtitle =element_text(hjust=0.5))
ggsave(path="D:/R Projects/Final Scripts/BitCoin Scripts/daily_plot",paste(next.coin,".png"))
# Save each file to .csv
write.csv(df,paste0(file="D:/R Projects/Final Scripts/BitCoin Scripts/data/",next.coin,".csv"))
ptm0 <span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			></span><- proc.time()
Sys.sleep(0.1)
ptm1=proc.time() - ptm0
time=as.numeric(ptm1[3])
cat('\n','Iteration',i,'took', time, "seconds to complete")
  }, error = function(e) { print(paste("i =", i, "failed:")) })
}

This coin universe will be used to test for mean reversion, momentum and random walk using RSI2.

An R script was setup to run the RSI2 momentum and mean reversion model on each coin series. No effort was made to restrict the starting date of each series, rather the entire price series was tested in each instance.

The end of day closing prices were used for testing. To avoid look ahead bias. When an RSI2 signal was triggered. The entry was placed on the exchange at the next days market open. Subsequently, a sell signal was closed out on the same day of the signal. In real life, the script would run 15 minutes before the close and any conditions met would close out prior to end of the day. In this back test, due to my absence of higher frequency coin data, this has not been included. The results would be for guidance only not for exact specifics. However, despite the 15 minute discrepancy, it should still serve as a basis for revealing mean reversion, momentum and random walk series.

No effort has been made to optimize parameters, optimize model parameters in and out of sample data, no statistical testing for mean reversion, momentum and no half life of mean reversion calibrations. These tests may aid in selecting optimal look back periods post initial RSI2 mean reversion, momentum screening and also avoid effects of look ahead bias and over optimization.

First, the top 50 momentum price series results are provided in table 1.0. To refresh, a momentum buy signal was generated when RSI2 closed over 70 and a sell signal generated when RSI2 crossed below 85. Note these parameters have been arbitrarily chosen. Data is sorted by top momentum cumulative returns.

momo_coin
Table 1.0 – Table sorted by highest momentum cumulative return

Lastly, the top 50 mean reversion price series results are provided in table 1.1. To refresh, a mean reversion buy signal was generated when RSI2 closed below 20 and a sell signal generated when RSI2 crossed above 70. Data is sorted by top mean reversion cumulative returns.

mean_rev_coin
Table 1.1 – Table sorted by highest mean reversion cumulative return

The RSI2 momentum, mean reversion and random walk screening reveled a total of 202 momentum series, where momentum out performed buy and hold. A total of 1086 mean reversion series, where mean reversion out performed buy and hold. A total of 340 random walk series, where buy and hold out performed both momentum and mean reversion RSI2 models. Based on this test, it would appear that mean reversion is the dominating anomaly within the coin price series.

Rplot161

The series obtained from the above RSI2 screening should be further selected for in-sample optimization and out of sample testing. The below figure shows an example of how this process may be conducted and typically is referred to as the hold out method:

optimize
Example of hold out method for conducting in-sample optimization and out-of-sample testing

For each in-sample period (orange), the model parameters are optimized and subsequently selected for forward out of sample testing (green). At the end of the first out of sample test, the in-sample period is then re-set by the fixed in-sample look back period. The model parameters are once again optimized and used for subsequent out of sample testing. This process is repeated for the entire sample period.

To conclude, the RSI2 may be a simple method to test for momentum, buy high and sell higher and mean reversion, buy low and sell high. These attributes may provide an under lying anomaly which can be exploited using simple models. The RSI look back period of 2 was arbitrarily chosen. This may serve as a purpose for revealing momentum or mean reversion series.

Furthermore, the RSI 2 may not be the correct look back to use for the the specific series under examination. This would be true because each and every series display different rates of diffusion from the mean. Those with momentum properties move away from the mean at a higher rate than mean reversion series or a slower rate than random walk. Mean reversion series exhibit different variances from the mean and also different speeds of reversion to the mean. The former relating directly to how much profit can be achieved from the mean reverting series. The latter relating to the how long it takes for a series to revert to the mean. A mean reversion speed of 9 months may not be suitable for a shorter term trader, not to mention, may not be the best deployment of capital.

For further research, the speed of mean reversion may be calculated using the Ornstein-Uhlenbeck Formula which is presented in a previous post. An optimal look back period for a mean reversion series would be considered equal to the half life of mean reversion (Ernest P. Chan,2013).

In closing, it should also be noted that there is a large benefit to using statistical testing to confirm the presence of momentum, mean reversion and random walk. This is due to the fact that statistical tests cover all data points in a sample. Where a model that is pre-selected like the above RSI2, only considers days when the RSI2 trigger was active, thus eliminating many days from the sample (Ernest P. Chan,2013).

A statistical test for confirming momentum, mean reversion and random walk is the hurst exponent. This is presented in two previous posts:
Hurst exponent in R
Modelling the Hurst exponent in R

The augmented dickey fuller test may be used to confirm the presence of mean reversion. Where one may reject the null hypothesis.

Good trading!
Andrew

Free Coin Data: Batch download coin data using R

Will use the crypto compare API. The data is free. We will download daily data.

Required packages

library(jsonlite)
library(data.table)
library(ggplot2)

Next download the full coin list from crypto compare and save the list to a data frame

# Obtain coin list
response = fromJSON("https://www.cryptocompare.com/api/data/coinlist")
df = as.data.frame(data.table::rbindlist(response$Data, fill=TRUE))

The format is in a list, for ease we will save it to a .csv

# Write coin list to .csv
write.csv(df,file="D:/R Projects/Final Scripts/BitCoin Scripts/coin_list.csv")

Load the list of symbols from the .csv file and save the symbols in a vector

# Load Tickets From Coin List
read.data <- read.csv("D:/R Projects/Final Scripts/BitCoin Scripts/coin_list.csv", header=TRUE, stringsAsFactors = FALSE)
tickers <- c(read.data$Symbol)

Clean the list

# Those symbols with * at the end of their name
# Remove *
tickers <- gsub("[*]", "", tickers)

Next we write a loop to loop through the ticker symbol vector and download each symbol to .csv format. As an extra bonus, we will also plot the close price and save the plot to our output folder.

# Download Daily Data 
# Loop for downloading all .csvs 
# Save each coin close plot
# Add error catching to loop
# Add completion time of each iteration
i=1
for (i in 1:length(tickers)) {
  tryCatch({
  next.coin <- tickers[i] # next coin in ticker vector
coin_daily <- fromJSON(paste0("https://min-api.cryptocompare.com/data/histoday?fsym=",next.coin,"&tsym=USD&allData=true&e=CCCAGG"))
df <- data.frame("time" = coin_daily$Data$time, "close" = coin_daily$Data$close,"open" = coin_daily$Data$open,"high" = coin_daily$Data$high,"low" = coin_daily$Data$low,"volumefrom" = coin_daily$Data$volumefrom,"volumeto" = coin_daily$Data$volumeto)
# Save plot to output to folder
# Set path, set png plot size
# ggplot 2 to create line plot 
mytitle = paste0(next.coin)
graphics.off() # Close graphic device before next plot
p <- ggplot(df, aes(time)) +
  theme_classic()+
  geom_line(aes(y=close), colour="red") +
  ggtitle(mytitle, subtitle = "") +
  labs(x="Time",y="Close")+
  theme(plot.title = element_text(hjust=0.5),plot.subtitle =element_text(hjust=0.5))
ggsave(path="D:/R Projects/Final Scripts/BitCoin Scripts/daily_plot",paste(next.coin,".png"))
# Save each file to .csv
write.csv(df,paste0(file="D:/R Projects/Final Scripts/BitCoin Scripts/data/",next.coin,".csv"))
ptm0 <- proc.time()
Sys.sleep(0.1)  
ptm1=proc.time() - ptm0
time=as.numeric(ptm1[3])
cat('\n','Iteration',i,'took', time, "seconds to complete")
  }, error = function(e) { print(paste("i =", i, "failed:")) })
}

Tail Risk – Downside Analysis (S&P500)

Tail Risk. The enemy for long term investing with leverage.

Lets express the 1 day daily returns in a box plot.

Rplot78

# Maximum Loss and Gain 
min(df$spx.clret)
max(df$spx.clret)
# Location of min/max
min.loc <- which.min(df$spx.clret)
max.loc <- which.max(df$spx.clret)
# Date of Min 
min <- df$Date[min.loc]
max <- df$Date[max.loc]
min
max
> min
[1] "1987-10-19"
> max
[1] "1933-03-15"
> min(df$spx.clret)
[1] -0.2289973
> max(df$spx.clret)
[1] 0.1536613

The maximum one day loss, -0.3558507% which occurred October 19th 1987 stock crash. The largest 1 day gain, 0.3522206% which occurred March 15th 1933.

If using leverage we obtain a 1 day loss of (-0.7117014 %,2x leverage) or (-1.0675521%, 3x leverage). These one day ‘shocks’ are what is known as tail risk. It means that one could be down to the above magnitude in one trading day.

If we check the frequency 1 day losses over over ‘n’ loss %:

lsoses

Rplot88

Again if using leverage, these losses multiply by their leverage factor. Since 1928, there were 8 occurrences as example of > 20% one day losses. If we use 3x leverage this one day loss increases to >60%.

Furthermore, we also face bear market declines which again pose another issue when holding long term with leverage.

dd

Inclusive of 1929 bear market, peak to trough draw downs can be in excess of 50%. Top of 08 to bottom in 09 is around 56%. Again if we buy and hold a leveraged product. We will see gains 2x or 3x this amount.

So these are the ‘issues’ with buy and holding with leverage. Is there a solution?

One method of exploration is to use a simple moving average as a low pass filter. Using monthly bars, a simple moving average tends to reduce over all volatility.

Rplot83

Areas in red are cash periods or periods below the 12sma.

The below plot shows the maximum draw down for any given simple moving average from 1928 to present day. The simple moving average moves the investor to cash when below:

Rplot80

Over the same period, buy and hold draw down is 0.89783016%. Depending on the sma used, we approx reduce draw downs by 40 to 50%.

The optimal monthly simple moving average ranked by cumulative returns 1928 to present day.

Rplot79

If we perhaps consider a ‘modern market’ 1950 onwards.

Over all draw down reduces from buy and hold 0.5687671 %… to the 25/30% area depending on the simple moving average used.

Rplot81

And the optimal look back for the monthly simple moving average:

Rplot82

The message is pretty clear. Using a simple moving average as a ‘low pass filter’ often reduces volatility of simple buy and hold. This strategy does well to avoid both bear market declines post 1999 and 2007.

The over all results of monthly 12 sma, 1988 to present:

One day shocks like the 1987 stock crash can yield a 1 day drown down using 2 / 3x leverage up to 70 or 90%.

Post 1987 stock crash we see maximum draw downs for varying simple moving averages. Note short term simple moving averages are invested during bear market declines:

Rplot86

The question going forward. Are one day shocks of the 1987 stock crash possible again in the future? Is there anything structural that can change this? Does HFT provide liquidity or will it evaporate during times of distress?

A maximum draw down of 19% since 1987 using the low pass filter approach seems pretty simple to bear. One could, from 1987 to present use a 2 or 3x leverage factor and increase gains SUBSTANTIALLY whilst matching a typical buy and hold draw down of 56%.

Leverage and risk is the trade off. If believe the next 40 years will be as rosy as the previous. One may wish to take this bet. Not to mention will the S&P500 even be the ‘king’ market and continue to provide the same level of returns as the past 100 years?

No one knows 🙂 If the risk profile suits. Then that’s up to you. Otherwise, a normal buy and hold investing approach using a simple moving average will avoid large negative compounding periods.

Long over 12 monthly sma:

Rplot87

The frequency of heavy 1 day losses reduces from buy and hold, This is partly due to avoiding the majority of bear market declines.

Rplot89

I believe the message is this: regardless of what you google about holding 3x leverage products, it should be studied whether the ETF is built soundly, ie has minimal tracking error for there are many ETFs with poor design.

Next – are you comfortable or can you stomach the sickening draw downs? There is no doubt spells of volatility and this is likely to continue.

Perhaps there are LESS VOLATILE instruments with smoother equity curves, perhaps less volatile strategies with smoother equity curves. With little volatility one may add a leverage multiplier at their will.

Basic Data Handling With Python

I have recently been looking for live execution platforms for placing a strategy into production. I have found zipline for python and with the intention of using zipline as a live execution platform I figured it would be prudent to pick up some python.

The learning curve from moving to R to python doesnt look that steep and in this post I will cover some basic data handling using python. We will:

1. Load Data, view data, check formats, convert integer date to Date format
2. Plot price series
3. Calculate cumulative returns / Annualized return, Annualized sharpe ratio
4. Calculate rolling mean, standard deviation and z-score
5. Perform linear regression by preparing two series, joining two data sets by common date and running the ordinary least squares regression

# Python Code For General working of data
# Andrew Bannerman 11.4.2017

# Import library
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.graphics.regressionplots import abline_plot # For
plotting regression line

# Read Data
df = pd.read_csv("Your Data Dir/MASTER_DATA_DUMP/SPY.csv")

# Print First 10 Lines Of Data
print(df.head(10))

# Print First 10 Rows Of Specific Data Frame Column
print(df.Close.head(10))

# Print Last 10 Lines of Data
print(df.tail(10))

# Print Last 10 Lines of Specific Data Frame Column
print(df.Close.tail(10))

# Check Data Frame Types
print(type(df))

# Check Type Of 1 Column
print(df['Close'].dtype)
print(df['Date'].dtype)

# Convert yyyymmdd to date format
df['Date'] = pd.to_datetime(df['Date'].astype(str), format ='%Y%m%d')
print(df['Date'].dtype) # Check date is in date format
print(df.Date.head(10)) # Print head of date column

# Create Plot Of SPY Close Prices
df.plot(x='Date',y='Close')
plt.xlabel('Date')
plt.ylabel('Close Price')
plt.title('SPY Close 1993 To Present')
plt.legend(loc='upper left')
plt.show()

Figure_1

# Calculate % Differences of Close Column
# Calculate Cumualtive Return
# use Dataframe.shift(1) to move series by -1 (yesterday)
df['ret'] = (df.Close - df.Close.shift(1)) / df.Close.shift(1)
df['cum_ret'] = (1 + df.ret).cumprod # Cumulative return
print(df.cum_ret.iloc[-1]) # Print last row of cumulative returns
(use iloc[-1])

# Plot Cumulative Returns
df.plot(x='Date',y='Close')
plt.xlabel('cum_ret')
plt.ylabel('Cumulative Returns')
plt.title('SPY Cumulative Returns')
plt.legend(loc='upper left')
plt.show()

Figure_1-1

# Calculate Annualized Sharpe Ratio
sharpe_ratio = np.sqrt(252) * (df.ret.mean() / df.ret.std())
print("Annualized Share Ratio is",sharpe_ratio)

# Calculate Annualized Return
time_between = (df.Date.iloc[-1] - df.Date.iloc[0]).days
print(time_between)
cagr = (df.Close.iloc[-1] / df.Close.iloc[1]) ** (365/time_between) -1
print("Annualized Return is",cagr, "Percent")

# Calculate Maximum Draw Down
lookback = 252
rolling_max = df['Close'].rolling(lookback, min_periods=1).max()
daily_drawdown = df['Close']/rolling_max - 1
daily_drawdown.plot()
plt.ylabel('Draw Down %')
plt.title('252 Day Rolling Draw Down')
plt.legend(loc='upper left')
plt.show()

Figure_1-2

# Calculate Rolling Statistics
# Calculate Rolling Z-Score
lookback = 12
df['moving_average'] = df.Close.rolling(lookback).mean() # Rolling
Moving Average
df['rolling_std_dev'] = df.Close.rolling(lookback).std() # Rolling stdev
df['rolling_zscore'] = (df.Close - df.moving_average) /
df.rolling_std_dev  # Zscore

# Plot Rolling Mean And Stdev
df.plot(x='Date', y=['Close','moving_average'])
plt.title('SPY Close With Moving Average')
plt.show()
df.plot(x='Date', y=['rolling_zscore'])
plt.title('N Period Look back Rolling Zscore')
plt.show()

Figure_1-3

Figure_1-4

# Linear Regression Two Series
SPX = pd.read_csv("Your Data Dir/MASTER_DATA_DUMP/$SPX.csv")
VIX = pd.read_csv("Your Data Dir/MASTER_DATA_DUMP/$VIX.csv")

# Prepare For Join By Common Date
# Convert yyyymmdd to date format
SPX['Date'] = pd.to_datetime(SPX['Date'].astype(str), format ='%Y%m%d')
VIX['Date'] = pd.to_datetime(VIX['Date'].astype(str), format ='%Y%m%d')

# Join Two Data Series By Common Date
merge_df = pd.merge(left=SPX, left_on='Date', right=VIX, right_on='Date')
print(merge_df.tail(10))

# Scatter Plot of SPX and VIX
merge_df.plot(x='Close_x',y='Close_y', kind='scatter')
plt.title('Scatter Plot VIX and SPX')
plt.show()

Figure_1_5

# Perform Linear Regression On Close Prices
model = sm.OLS(merge_df.Close_y, merge_df.Close_x)
results = model.fit()
print(results.summary())

ols

Pairs Trading – Testing For Conintergration, ADF (unit root), Johansen Test & Half Life of Mean Reversion

Note: I back test this 100% incorrectly. I also wish to add scaling of the independent variable per the hedge ratio correctly. I am leaning towards working in share prices and will update accordingly. What is correct is all the statistical testing, however, the back test is 100% incorrect. I also run the johansen test on the stationary series obtained from the hedge ratio through linear regression, this is an unnecessary step as we can use the hedge ratio from linear regression to scale the independent variable position sizes. I will get to the revisions when I can.

A statistical arbitrage opportunity exists between a cointegrated pair when the price between the pair moves out of equilibrium. This short term disequilibrium creates a trading opportunity where one enters a short position in one series with an equal weighted long position in the other series. The aim is that two series will eventually converge and the trade be closed out with a profit.

Essentially between two cointegrated pairs, the spread is mean reverting. The time of mean reversion is one factor to consider. How long will it take for the period of disequilibrium to revert to the mean and will it ever revert to the mean? Co-integration can and does break down. So one question to answer would be: How do I know if the trade will converge, when is it time to exit with a loss?

The half life of mean reversion which is calculated on the spread can help answer this question. If historically a pair of stocks A and B have shown to mean revert within ‘n’ days, if this threshold is exceeded by a long margin OR the trade meets a pre determined stop loss % OR a stop loss measured in standard deviations. The trade can be closed out with a loss.

A pair of stocks can have a linear trend however, we can form a stationary series by estimating the hedge ratio with ordinary least squares regression. In practice we run 2x ols regressions. One with Stock A as the independent variable and one with Stock B as the independent variable. We then pick the regression with the most significant t-statistic.

First lets get a feel for the data and plot the closing prices of EWA and EWC. I want to use these two series and time period to see if we can replicate Ernie Chans study in his book: Algorithmic Trading: Winning Strategies and Their Rationale.

Rplot72
EWA and EWC visually appear to have a high correlation. The pearson correlation coefficient is 0.9564724. If two series are correlated it essentially means they move in the same direction most of the time. However, it tells you nothing about the nature of the spread between two series and if they will even converge (mean revert) or continue to diverge. With co integration, we form a linear relationship between Stock A and Stock B where the distance between A and B is somewhat fixed and would be estimated to mean revert. In essence, we measure the ‘distance’ between A and B in terms of standard deviations and a trade would be taken when ‘n’ standard deviations is met. Also a stop loss may be placed outwith the normal sigma range ensuring the trade is exited if the relationship breaks down.

A scatter plot shows the variation around the ols regression fit:

# Scatter plot of price series
# plot x,y
plot(df$EWA_Close, df$EWC_Close, main="Scatterplot EWA and EWC With OLS Regression Fit",
     xlab="EWA Close", ylab="EWC Close ", pch=19)
abline(lm(df$EWC_Close~df$EWA_Close), col="red") # regression line (y~x)

Rplot71

We perform ordinary least squares regression using EWA as the independent variable (x) and EWC as the dependent variable (y). Note we set the intercept to 0 in so that the regression fit provides only the hedge ratio between EWA and EWC.


# Ordinary least squares regression
# EWA independant
ols <- lm(df$EWC_Close~df$EWA_Close+0) # regression line (y~x)
summary(ols)
beta <- coef(ols)[1]

resid <- as.data.frame(df$EWC_Close - beta * df$EWA_Close)
resid <- data.frame(resid,"Date" = df$Date)
colnames(resid)[1] <- "residuals"
plot(resid$Date,resid$residual, col = 'navyblue', xlab = 'Date', ylab = 'residuals', type="l", main=" EWA,EWC Residuals - EWA Independant Variable - No Intercept",cex.main=1)

Rplot75

In the above, we used ordinary least squares regression to estimate the hedge ratio (beta) and we form the spread by subtracting EWC close – the beta * EWA close. Note we use the closing prices and not the daily returns. By using the closing prices we form the hedge ratio that would provide us with the number of shares to go long/short for each ETF.

The augmented dickey fuller test is used to determine if a pair is non stationary or stationary (trending or non trending). The purpose of the test is to reject the null hypothesis (unit root). If we have unit root it means we have a non-stationary series. We wish to have a stationary series so that we may trade efficiency using mean reversion.


# urca ADF test
library(urca)
  summary(ur.df(resid$residual, type = "drift", lags = 1))

###############################################
# Augmented Dickey-Fuller Test Unit Root Test #
############################################### 

Test regression drift 

Call:
lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)

Residuals:
     Min       1Q   Median       3Q      Max
-1.17542 -0.17346 -0.00447  0.16173  1.27800 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.0006657  0.0072252   0.092 0.926599
z.lag.1     -0.0176743  0.0053087  -3.329 0.000892 ***
z.diff.lag  -0.1751998  0.0254683  -6.879 8.82e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2781 on 1495 degrees of freedom
Multiple R-squared:  0.04101,	Adjusted R-squared:  0.03973
F-statistic: 31.97 on 2 and 1495 DF,  p-value: 2.548e-14

Value of test-statistic is: -3.3293 5.5752 

Critical values for test statistics:
      1pct  5pct 10pct
tau2 -3.43 -2.86 -2.57
phi1  6.43  4.59  3.78

The ADF t-statistic is -3.3293 which is lower than the critical value of 5% ( below -2.86). We may reject the unit root hypothesis with 95% certainty that we have a non-stationary relationship. If we flip the independent variable from EWA to EWC and re-run OLS and the augmented dickey fuller test we obtain a t-statistic of -3.3194 which is lower than the 5% critical value of -2.86 which means we have 95% certainty the series is non-stationary. In this case we choose the most negative / statistically significant test-statistic of EWA as independent variable.

Next to determine the number of shares we need to purchase / short we may use the Johansen test. The eigenvectors determine the weights for each ETF.

# Johansen test
# Eigen vectors are the number of shares
# Test is cointegration test
coRes=ca.jo(data.frame(df$EWA_Close, df$EWC_Close),type="trace",K=2,ecdet="none", spec="longrun")
summary(coRes)
slot(coRes,"V") # The matrix of eigenvectors, normalised with respect to the first variable.

######################
# Johansen-Procedure #
###################### 

Test type: trace statistic , with linear trend 

Eigenvalues (lambda):
[1] 0.010146673 0.002667727

Values of teststatistic and critical values of test:

          test 10pct  5pct  1pct
r <= 1 |  4.00  6.50  8.18 11.65
r = 0  | 19.28 15.66 17.95 23.52

Eigenvectors, normalised to first column:
(These are the cointegration relations)

                df.EWA_Close.l2 df.EWC_Close.l2
df.EWA_Close.l2       1.0000000       1.0000000
df.EWC_Close.l2      -0.7772904       0.6080793

Weights W:
(This is the loading matrix)

               df.EWA_Close.l2 df.EWC_Close.l2
df.EWA_Close.d     0.004309716    -0.003029255
df.EWC_Close.d     0.028971878    -0.002649008

The eigenvectors form the weights for each ETF. Simply put, if allocate $10,000 to each ETF. Purchase 10,000 * 1 / EWA close to determine the number of shares to long EWA. When shorting EWC, purchase 10,000 * -0.7772904 / EWC Close. Thus our weightings form a stationary portfolio.

In order to determine how long the cointegrated relationship takes to revert to the mean we can perform a linear regression of the 1 lagged daily differences minus the mean of the 1 lagged differences. Also known at the half life of mean reversion.

 

# Calculate half life of mean reversion (residuals)
# Calculate yt-1 and (yt-1-yt)
# pull residuals to a vector
residuals <- c(resid$residual)
y.lag <- c(residuals[2:length(residuals)], 0) # Set vector to lag -1 day
y.lag <- y.lag[1:length(y.lag)-1] # As shifted vector by -1, remove anomalous element at end of vector
residuals <- residuals[1:length(residuals)-1] # Make vector same length as vector y.lag
y.diff <- residuals - y.lag # Subtract todays close from yesterdays close
y.diff <- y.diff [1:length(y.diff)-1] # Make vector same length as vector y.lag
prev.y.mean <- y.lag - mean(y.lag) # Subtract yesterdays close from the mean of lagged differences
prev.y.mean <- prev.y.mean [1:length(prev.y.mean )-1] # Make vector same length as vector y.lag
final.df <- data.frame(y.diff,prev.y.mean) # Create final data frame

# Linear Regression With Intercept
result <- lm(y.diff ~ prev.y.mean, data = final.df)
half_life <- -log(2)/coef(result)[2]
half_life

# Linear Regression With No Intercept
result = lm(y.diff ~ prev.y.mean + 0, data = final.df)
half_life1 = -log(2)/coef(result)[1]
half_life1

# Print general linear regression statistics
summary(result)

The half life of mean reversion is 31.62271 days.

We will now form a trading strategy and calculate a rolling z-score over the residuals with a look back set equal to obtained half life.

# Make z-score of residuals
# Set look back equal to half life
colnames(resid)
resid$mean <- SMA(resid[,"residuals"], round(half_life))
resid$stdev <- runSD(resid[,"residuals"], n = round(half_life), sample = TRUE, cumulative = FALSE)
resid$zscore <- apply(resid[,c('residuals', 'mean','stdev')], 1, function(x) { (x[1]-x[2]) / x[3] })

# Convert all NA to 0
resid[is.na(resid)] <- 0

# Place residuals data frame to original (df)
df <- data.frame(df,resid)

# Plot z-score of residuals
plot(resid$zscore,type="l", main="Rolling Z-Score - Look Back Set Equal To Half Life")

Rplot77

Note the range of standard deviations for the spread is between 3,-3.

We can now perform a back test to see how well it would have done. For the sake of simplicity we will not include optimization on in-sample and perform out of sample testing. We will also not consider share amounts of the weightings or commissions.

###################
# Back Test Script
###################

# Create Trading Logic
# When z-score crossed below 0 buy EWA and sell short EWC per eigenvectors
df$enter.long <- ifelse(df$zscore <= 0, 1,0)   # for long EWA below mean
df$exit.long <- ifelse(df$zscore == 0, 1,0)    # for long EWA below mean
df$enter.short <- ifelse(df$zscore <= 0 , -1,0) # for short EWC below mean
df$exit.short <- ifelse(df$zscore == 0 , -1,0)  # for short EWC below mean

df$enter.short.1 <- ifelse(df$zscore >= 0, -1,0)   # for short EWA above mean
df$exit.short.1 <- ifelse(df$zscore == 0, -1,0)    # for short EWA above mean
df$enter.long.1 <- ifelse(df$zscore >= 0 , 1,0) # for long EWC above mean
df$exit.long.1 <- ifelse(df$zscore == 0 , 1,0)  # for long EWC above mean

# Calculate close to close returns
df$ewa.clret <- ROC(df$EWA_Close, type = c("discrete"))
df$ewa.clret[1] <- 0
df$ewc.clret <- ROC(df$EWC_Close, type = c("discrete"))
df$ewc.clret[1] <- 0

# Long EWA below mean
df <- df %>%
  dplyr::mutate(ewa.long = ifelse(enter.long ==  1, 1,
                                  ifelse(exit.long == 1, 0, 0)))

# Short EWC above mean
df <- df %>%
  dplyr::mutate(ewc.short = ifelse(enter.short ==  -1, 1,
                                  ifelse(exit.short == -1, 0, 0)))

# Short EWA above mean
df <- df %>%
  dplyr::mutate(ewa.short = ifelse(enter.short.1 ==  -1, 1,
                                  ifelse(exit.short.1 == 1, 0, 0)))

# Long EWC above mean
df <- df %>%
  dplyr::mutate(ewc.long = ifelse(enter.long.1 ==  1, 1,
                                   ifelse(exit.long.1 == -1, 0, 0)))

# lag signal by one forward day to signal entry next day
df$ewa.long <- lag(df$ewa.long,1) # Note k=1 implies a move *forward*
df$ewc.short <- lag(df$ewc.short,1) # Note k=1 implies a move *forward*
df$ewa.short <- lag(df$ewa.short,1) # Note k=1 implies a move *forward*
df$ewc.long <- lag(df$ewc.long,1) # Note k=1 implies a move *forward*

# Calculate equity curves
df$equity.long <- apply(df[,c('ewa.long', 'ewa.clret')], 1, function(x) { x[1]*x[2] })
df$equity.short <- apply(df[,c('ewc.short', 'ewc.clret')], 1, function(x) { x[1]*x[2]})
df$equity.long.1 <- apply(df[,c('ewa.short', 'ewa.clret')], 1, function(x) { x[1]*x[2] })
df$equity.short.1 <- apply(df[,c('ewc.long', 'ewc.clret')], 1, function(x) { x[1]*x[2]})

# Combine signals
df$combined <- apply(df[,c('equity.long', 'equity.short','equity.long.1','equity.short.1')], 1, function(x) { x[1]+x[2]+x[3]+x[4] })

# Pull select columns from data frame to make XTS whilst retaining format
xts1 = xts(df$equity.long, order.by=as.POSIXct(df$Date, format="%Y-%m-%d"))
xts2 = xts(df$equity.short, order.by=as.POSIXct(df$Date, format="%Y-%m-%d"))
xts3 = xts(df$equity.long.1, order.by=as.POSIXct(df$Date, format="%Y-%m-%d"))
xts4 = xts(df$equity.short.1, order.by=as.POSIXct(df$Date, format="%Y-%m-%d"))
xts5 = xts(df$combined, order.by=as.POSIXct(df$Date, format="%Y-%m-%d")) 

# Join XTS together
compare <- cbind(xts1,xts2,xts3,xts4,xts5)

# Use the PerformanceAnalytics package for trade statistics
colnames(compare) <- c("EWA Long","EWC Short","EWA Short","EWC Long","Combined")
charts.PerformanceSummary(compare,main="EWC,EWA Pair", wealth.index=TRUE, colorset=rainbow12equal)
performance.table <- rbind(table.AnnualizedReturns(compare), maxDrawdown(compare), CalmarRatio(compare),table.DownsideRisk(compare))
drawdown.table <- rbind(table.Drawdowns(compare))
#dev.off()
#logRets <- log(cumprod(1+compare))
#chart.TimeSeries(logRets, legend.loc='topleft', colorset=rainbow12equal)

print(performance.table)
print(drawdown.table)

Rplot76

One thing to note is that for the whole sample period from April 26th 2006 to April 9th 2012. We use the exact same hedge ratio to form the stationary series. This is likely not the correct procedure, we also use the full sample to record the half life of mean reversion. As we expect the series A & B to change over time I believe it would be better practice to use a fixed rolling window to estimate the hedge ratio and mean reversion half life. When a new signal is generated it will be the hedge ratio/half life derived from the rolling look back period.

It should also be noted that we do not quite match Ernie Chans test from his book, I am using alpha vantage as a source of price data and we have a difference in price between Ernies EWA and the alpha vantage EWA. There is also differences in the urca Johansen test and the one Ernie uses in the jplv7 package. I will attempt to address these issues in upcoming posts.

This serves as a good introduction for me and others to the topic of pairs trading.

 

Full R code below. 

# Test for cointegration
# Download daily price data from Alpha Vantage API
# Plot two series
# Perform Linear Regression
# Plot Residuals
# ADF test for unit root
# Half life of mean reversion
# zscore look back equal to half life
# back test long stock A, short Stock B
# Andrew Bannerman 11.1.2017

require(alphavantager)
require(urca)
require(lattice)
require(lubridate)
require(dplyr)
require(magrittr)
require(TTR)
require(zoo)
require(data.table)
require(xts)
require(PerformanceAnalytics)

# Download price series from Alpha Vantage
# install.packages('alphavantager')
# Key
#https://www.alphavantage.co/query?function=TIME_SERIES_DAILY_ADJUSTED&symbol=EWA&apikey=6RSYX9BPXKZVXUS9&outputsize=full&datatype=csv
av_api_key("your_api_key")
print(av_api_key())
#EWC <- av_get(symbol = "GLD", av_fun = "TIME_SERIES_DAILY_ADJUSTED", outputsize = "full")
#EWA <- av_get(symbol = "SLV", av_fun = "TIME_SERIES_DAILY_ADJUSTED", outputsize = "full")
#bit coin EWC <- av_get(symbol = "BTC", av_fun = "DIGITAL_CURRENCY_DAILY", outputsize = "full", market="CNY")

EWA <- read.csv("D:/R Projects/Final Scripts/Cointegration Scripts/data/daily_adjusted_EWA.csv", header=TRUE,stringsAsFactors=FALSE)
EWC <- read.csv("D:/R Projects/Final Scripts/Cointegration Scripts/data/daily_adjusted_EWC.csv", header=TRUE,stringsAsFactors=FALSE)
str(EWA)
# Convert timestamp to Date format
EWA$timestamp <- ymd(EWA$timestamp)
EWC$timestamp <- ymd(EWC$timestamp)

# Sort
EWA <- arrange(EWA,timestamp)
EWC <- arrange(EWC,timestamp)

# Merge Data Frames By date
df <- merge(EWA,EWC, by='timestamp')
head(df)
# Rename Columns
colnames(df)[1] <- "Date"
colnames(df)[6] <- "EWA_Close"
colnames(df)[14] <- "EWC_Close"

# Subset by date
df <- subset(df, Date >= as.POSIXct("2006-04-26") & Date <= as.POSIXct("2012-04-09"))  

# Scatter plot of price series
# plot x,y
plot(df$EWA_Close, df$EWC_Close, main="Scatterplot EWA and EWC With OLS Regression Fit",
     xlab="EWA Close", ylab="EWC Close ", pch=19)
abline(lm(df$EWC_Close~df$EWA_Close), col="red") # regression line (y~x)

# Line plot of series
require(ggplot2)
require(scales)

ggplot(df, aes(Date)) +
  theme_classic()+
  geom_line(aes(y=EWA_Close), colour="red") +
  geom_line(aes(y=EWC_Close), colour="blue") +
  scale_x_date(breaks = date_breaks("years"), labels = date_format("%Y"))+
  ggtitle("EWC, EWC Close", subtitle = "2006-04-26  to 2012-04-09") +
  labs(x="Year",y="EWA,EWC Close")+
  theme(plot.title = element_text(hjust=0.5),plot.subtitle =element_text(hjust=0.5))+
  annotate("text", label = "EWA", x = as.Date("2006-04-26"), y = 22, color = "blue")+
  annotate("text", label = "EWC", x = as.Date("2006-04-26"), y = 15, color = "red")

# Find corrlation coefficient
# method = pearson, spearman, kendall
EWA.EWC.cor <- cor(df$EWA_Close, df$EWC_Close, use="complete.obs", method="pearson") 

# Ordinary least squares regression
# EWA independant
ols <- lm(df$EWC_Close~df$EWA_Close+0) # regression line (y~x)
summary(ols)
beta <- coef(ols)[1]

resid <- as.data.frame(df$EWC_Close - beta * df$EWA_Close)
resid <- data.frame(resid,"Date" = df$Date)
colnames(resid)[1] <- "residuals"
plot(resid$Date,resid$residual, col = 'navyblue', xlab = 'Date', ylab = 'residuals', type="l", main=" EWA,EWC Residuals - EWA Independant Variable - No Intercept",cex.main=1)

# urca ADF test
library(urca)
  summary(ur.df(resid$residual, type = "drift", lags = 1))

# Johansen test
# Eigen vectors are the number of shares
# Test is cointegration test
coRes=ca.jo(data.frame(df$EWA_Close, df$EWC_Close),type="trace",K=2,ecdet="none", spec="longrun")
summary(coRes)
slot(coRes,"V") # The matrix of eigenvectors, normalised with respect to the first variable.
slot(coRes,"Vorg") # The matrix of eigenvectors, such that \hat V'S_{kk}\hat V = I.

# Calculate half life of mean reversion (residuals)
# Calculate yt-1 and (yt-1-yt)
# pull residuals to a vector
residuals <- c(resid$residual)
y.lag <- c(residuals[2:length(residuals)], 0) # Set vector to lag -1 day
y.lag <- y.lag[1:length(y.lag)-1] # As shifted vector by -1, remove anomalous element at end of vector
residuals <- residuals[1:length(residuals)-1] # Make vector same length as vector y.lag
y.diff <- residuals - y.lag # Subtract todays close from yesterdays close
y.diff <- y.diff [1:length(y.diff)-1] # Make vector same length as vector y.lag
prev.y.mean <- y.lag - mean(y.lag) # Subtract yesterdays close from the mean of lagged differences
prev.y.mean <- prev.y.mean [1:length(prev.y.mean )-1] # Make vector same length as vector y.lag
final.df <- data.frame(y.diff,prev.y.mean) # Create final data frame

# Linear Regression With Intercept
result <- lm(y.diff ~ prev.y.mean, data = final.df)
half_life <- -log(2)/coef(result)[2]
half_life

# Linear Regression With No Intercept
result = lm(y.diff ~ prev.y.mean + 0, data = final.df)
half_life1 = -log(2)/coef(result)[1]
half_life1

# Print general linear regression statistics
summary(result)

# Make z-score of residuals
# Set look back equal to half life
colnames(resid)
half_life = 31
resid$mean <- SMA(resid[,"residuals"], round(half_life))
resid$stdev <- runSD(resid[,"residuals"], n = round(half_life), sample = TRUE, cumulative = FALSE)
resid$zscore <- apply(resid[,c('residuals', 'mean','stdev')], 1, function(x) { (x[1]-x[2]) / x[3] })

# Convert all NA to 0
resid[is.na(resid)] <- 0

# Place residuals data frame to original (df)
df <- data.frame(df,resid)

# Plot z-score of residuals
plot(resid$zscore,type="l", main="Rolling Z-Score - Look Back Set Equal To Half Life", col="darkblue")

###################
# Back Test Script
###################

# Create Trading Logic
# When z-score crossed below 0 buy EWA and sell short EWC per eigenvectors
df$enter.long <- ifelse(df$zscore <= -0.7, 1,0)   # for long EWA below mean
df$exit.long <- ifelse(df$zscore == 0, 1,0)    # for long EWA below mean
df$enter.short <- ifelse(df$zscore <= -0.7 , -1,0) # for short EWC below mean
df$exit.short <- ifelse(df$zscore == 0 , -1,0)  # for short EWC below mean

df$enter.short.1 <- ifelse(df$zscore >= 0.7, -1,0)   # for short EWA above mean
df$exit.short.1 <- ifelse(df$zscore == 0, -1,0)    # for short EWA above mean
df$enter.long.1 <- ifelse(df$zscore >= 0.7 , 1,0) # for long EWC above mean
df$exit.long.1 <- ifelse(df$zscore == 0 , 1,0)  # for long EWC above mean

# Calculate close to close returns
df$ewa.clret <- ROC(df$EWA_Close, type = c("discrete"))
df$ewa.clret[1] <- 0
df$ewc.clret <- ROC(df$EWC_Close, type = c("discrete"))
df$ewc.clret[1] <- 0

# Long EWA below mean
df <- df %>%
  dplyr::mutate(ewa.long = ifelse(enter.long ==  1, 1,
                                  ifelse(exit.long == 1, 0, 0)))

# Short EWC above mean
df <- df %>%
  dplyr::mutate(ewc.short = ifelse(enter.short ==  -1, 1,
                                  ifelse(exit.short == -1, 0, 0)))

# Short EWA above mean
df <- df %>%
  dplyr::mutate(ewa.short = ifelse(enter.short.1 ==  -1, 1,
                                  ifelse(exit.short.1 == 1, 0, 0)))

# Long EWC above mean
df <- df %>%
  dplyr::mutate(ewc.long = ifelse(enter.long.1 ==  1, 1,
                                   ifelse(exit.long.1 == -1, 0, 0)))

# lag signal by one forward day to signal entry next day
df$ewa.long <- lag(df$ewa.long,1) # Note k=1 implies a move *forward*
df$ewc.short <- lag(df$ewc.short,1) # Note k=1 implies a move *forward*
df$ewa.short <- lag(df$ewa.short,1) # Note k=1 implies a move *forward*
df$ewc.long <- lag(df$ewc.long,1) # Note k=1 implies a move *forward*

# Calculate equity curves
df$equity.long <- apply(df[,c('ewa.long', 'ewa.clret')], 1, function(x) { x[1]*x[2] })
df$equity.short <- apply(df[,c('ewc.short', 'ewc.clret')], 1, function(x) { x[1]*x[2]})
df$equity.long.1 <- apply(df[,c('ewa.short', 'ewa.clret')], 1, function(x) { x[1]*x[2] })
df$equity.short.1 <- apply(df[,c('ewc.long', 'ewc.clret')], 1, function(x) { x[1]*x[2]})

tail(df,500)

# Combine signals
df$combined <- apply(df[,c('equity.long', 'equity.short','equity.long.1','equity.short.1')], 1, function(x) { x[1]+x[2]+x[3]+x[4] })

# Pull select columns from data frame to make XTS whilst retaining format
xts1 = xts(df$equity.long, order.by=as.POSIXct(df$Date, format="%Y-%m-%d"))
xts2 = xts(df$equity.short, order.by=as.POSIXct(df$Date, format="%Y-%m-%d"))
xts3 = xts(df$equity.long.1, order.by=as.POSIXct(df$Date, format="%Y-%m-%d"))
xts4 = xts(df$equity.short.1, order.by=as.POSIXct(df$Date, format="%Y-%m-%d"))
xts5 = xts(df$combined, order.by=as.POSIXct(df$Date, format="%Y-%m-%d")) 

# Join XTS together
compare <- cbind(xts1,xts2,xts3,xts4,xts5)

# Use the PerformanceAnalytics package for trade statistics
colnames(compare) <- c("EWA Long","EWC Short","EWA Short","EWC Long","Combined")
charts.PerformanceSummary(compare,main="EWC,EWA Pair", wealth.index=TRUE, colorset=rainbow12equal)
performance.table <- rbind(table.AnnualizedReturns(compare), maxDrawdown(compare), CalmarRatio(compare),table.DownsideRisk(compare))
drawdown.table <- rbind(table.Drawdowns(compare))
#dev.off()
#logRets <- log(cumprod(1+compare))
#chart.TimeSeries(logRets, legend.loc='topleft', colorset=rainbow12equal)

print(performance.table)
print(drawdown.table)