Stock GVP – Mean Reverting Series

Let us explore the ticker symbol GVP. We will test for mean reversion with the Hurst exponent and calculate the half life of mean reversion.

First, lets plot the daily closing prices:

library(ggplot2)
ggplot(new.df, aes(x = Date, y = Close))+
geom_line()+
labs(title = "GVP Close Prices", subtitle = "19950727 to 20170608")+
theme(plot.title = element_text(hjust=0.5),plot.subtitle = element_text(hjust=0.5,size=9), plot.caption = element_text(size=7))

Rplot13

Lets run the Hurst exponent to test for mean reversion, we will do this over the entire history of GVP. For this test we will use a short term lag period of 2:20 days (Explanation Here).

# Hurst Exponent
# Andrew Bannerman
# 8.11.2017

require(lubridate)
require(dplyr)
require(magrittr)
require(zoo)
require(lattice)

# Data path
data.dir <- "D:/R Projects"
output.dir <- "D:/R Projects"
data.read.spx <- paste(data.dir,"GVP.csv",sep="/")

# Read data
read.spx <- read.csv(data.read.spx,header=TRUE, sep=",",skip=0,stringsAsFactors=FALSE)

# Convert Values To Numeric
cols <-c(3:8)
read.spx[,cols] %<>% lapply(function(x) as.numeric(as.character(x)))

# Convert Date Column [1]
read.spx$Date <- ymd(read.spx$Date)

# Make new data frame
new.df <- data.frame(read.spx)

# Subset Date Range
#new.df <- subset(new.df, Date >= "2000-01-06" & Date <= "2017-08-06")
#new.df <- subset(new.df, Date >= as.Date("2017-01-07") ) 

#Create lagged variables
lags <- 2:20

# Function for finding differences in lags. Todays Close - 'n' lag period
getLAG.DIFF <- function(lagdays) {
  function(new.df) {
    c(rep(NA, lagdays), diff(new.df$Close, lag = lagdays, differences = 1, arithmetic = TRUE, na.pad = TRUE))
  }
}
# Create a matrix to put the lagged differences in
lag.diff.matrix <- matrix(nrow=nrow(new.df), ncol=0)

# Loop for filling it
for (i in lags) {
  lag.diff.matrix <- cbind(lag.diff.matrix, getLAG.DIFF(i)(new.df))
}

# Rename columns
colnames(lag.diff.matrix) <- sapply(lags, function(n)paste("lagged.diff.n", n, sep=""))

# Bind to existing dataframe
new.df <-  cbind(new.df, lag.diff.matrix)
head(new.df)

# Calculate Variances of 'n period' differences
variance.vec <- apply(new.df[,9:ncol(new.df)], 2, function(x) var(x, na.rm=TRUE))

# Linear regression of log variances vs log lags
log.linear <- lm(formula = log(variance.vec) ~ log(lags))
# Print general linear regression statistics
summary(log.linear)
# Plot log of variance 'n' lags vs log time
xyplot(log(variance.vec) ~ log(lags),
       main="GVP log variance of price diff Vs log time lags",
       xlab = "Time",
       ylab = "Logged Variance 'n' lags",
       grid = TRUE,
       type = c("p","r"),col.line = "red",
       abline=(h = 0)) 

hurst.exponent = coef(log.linear)[2]/2
hurst.exponent

Rplot14

linear.regression.output

If we divide the log(logs) coefficient by 2 we obtain the Hurst exponent of 0.4598435.

Remember H value less than 0.5 = mean reversion.

0.5 = random walk

0.5 = momentum.

Great.

Lets apply a simple linear strategy to see how it performs over this series. We will setup a rolling z-score and we will buy when the zscore crosses below 0 and we will sell when it crosses back over 0. We use a arbitrarily chosen lookback of 10 days for this.

Here are the results:

Rplot109

The above plot is the compounded growth of $1 and since 1995 $1 has grown to over $800 or over 79,900 %.

Next lets calculate the half life of mean reversion. We do this with linear regression. For the independent variable we use the price difference between today’s close and yesterdays close. For the dependent variable we use the price differences between today’s and yesterdays close – the mean of the price difference between today’s close and yesterdays close.

Note we use the previous 100 days of data to produce this test:

# Calculate yt-1 and (yt-1-yt)
y.lag <- c(random.data[2:length(random.data)], 0)   # Set vector to lag -1 day
y.lag  <- y.lag[1:length(y.lag)-1]    # As shifted vector by -1, remove anomalous element at end of vector
random.data <- random.data[1:length(random.data)-1]  # Shift data by -1 to make same length of vector
y.diff <- random.data - y.lag    # Subtract todays close - close from yesterday
y.diff  <- y.diff [1:length(y.diff)-1]   # Adjust length of vector
prev.y.mean <- y.lag - mean(y.lag)  # Subtract yesterdays close from the mean of lagged differences
prev.y.mean <- prev.y.mean [1:length(prev.y.mean )-1]  # Adjust length of vector
final <- merge(y.diff, prev.y.mean)   # Merge
final.df <- as.data.frame(final)  # Create final data frame

# Linear Regression With Intercept
result <- lm(y.diff ~ prev.y.mean, data = final.df)
half_life <- -log(2)/coef(result)[2]
half_life

We obtain a half life of 4.503093 days.

Next lets see if we can set our linear strategy lookback period equal to the half life to see if it improves results. The original look back period was 10 days chosen arbitrarily. The result of a look back of 4.5 rounded to 5 days is below:

Rplot109

From 1995 to roughly present day the result did not improve significantly but looking at the plot we see a large uptick in the equity curve from 2013 onwards. Lets subset our data to only include data post 2013 and lets re-run the 10 day look back and also the 5 day look back to see if we can see the benefit of optimizing using the mean reversion half life.

First the result of the 10 day look back arbitrarily chosen:

Rplot112

We see that $1 has grown to $8 or 700% increase.

Next the look back of 4.5 rounded to 5 days derived from the mean reversion half life calculation:

Rplot109.png
We see that using a look back set to equal the mean reversion half life of 5 days rounded, we see $1 has grown to over $15 or a 1400% increase.

Lets run the Hurst exponent on both periods, the first from 1995 to 2013. The second from 2013 to roughly present day:

1st test: We see H = 0.4601632
2nd: We see H = 0.4230494

Ok so we see the Hurst exponent become more mean reverting post 2013. If we test >= 2016 and >= 2017 we see:
H = 0.3890816 and 0.2759805 respectively.

Next lets choose a random time frame between 1995 and 2013.

From period 2000 to 2003, H = 0.5198083 which is more a random walk.

If we look at period 2003 to 2008 we have a H value of 0.4167166 which is more mean reverting, however, this H value of 0.41 is actually lower than the post 2013 H value of 0.4230494. So the H value in this case didnt say because H is this, then gains should be that.

This might be caused by other factors, frequency of trades, price range, fluctuations etc..

Note this post is largely theoretical no commissions are included in any of the trades. This demonstrates the combination of using statistical tools and performing a back test.

Advertisements

Hurst Exponent in R

The Hurst Exponent is a statistical testing method which tests if a series is mean reverting, trending or in geometric brownian motion. Using the hurst exponent a time series can be categorized by the following:

Hurst Values < 0.5 = mean reverting

Hurst Vales = 0.5 = geometric brownian motion

Hurst Values > 0.5 = trending

The hurst exponent falls between a range of 0 to 1. Where values closer to 0 signal stronger mean reversion and values closer to 1 signal stronger trending behavior.

Using R, we can calculate the hurst exponent:

# Hurst Exponent
# Andrew Bannerman
# 8.11.2017

require(lubridate)
require(dplyr)
require(magrittr)
require(zoo)
require(lattice)

# Data path
data.dir <- "G:/R Projects"
output.dir <- "G:/R Projects"
data.read.spx <- paste(data.dir,"SPY.csv",sep="/")

# Read data
read.spx <- read.csv(data.read.spx,header=TRUE, sep=",",skip=0,stringsAsFactors=FALSE)

# Convert Values To Numeric
cols <-c(3:8)
read.spx[,cols] %<>% lapply(function(x) as.numeric(as.character(x)))

# Convert Date Column [1]
read.spx$Date <- ymd(read.spx$Date)

# Make new data frame
new.df <- data.frame(read.spx)

# Subset Date Range
new.df <- subset(new.df, Date >= "2000-01-06" & Date <= "2017-08-06")

#Create lagged variables

lags <- 2:20

# Function for finding differences in lags. Todays Close - 'n' lag period
getLAG.DIFF <- function(lagdays) {
  function(new.df) {
    c(rep(NA, lagdays), diff(new.df$Close, lag = lagdays, differences = 1, arithmetic = TRUE, na.pad = TRUE))
  }
}
# Create a matrix to put the lagged differences in
lag.diff.matrix <- matrix(nrow=nrow(new.df), ncol=0)

# Loop for filling it
for (i in 2:20) {
  lag.diff.matrix <- cbind(lag.diff.matrix, getLAG.DIFF(i)(new.df))
}

# Rename columns
colnames(lag.diff.matrix) <- sapply(2:20, function(n)paste("lagged.diff.n", n, sep=""))

# Bind to existing dataframe
new.df <-  cbind(new.df, lag.diff.matrix)

# Calculate Variances of 'n period' differences
variance.vec <- apply(new.df[,9:ncol(new.df)], 2, function(x) var(x, na.rm=TRUE))

# Linear regression of log variances vs log lags
log.linear <- lm(formula = log(variance.vec) ~ log(lags))  
# Print general linear regression statistics  summary(log.linear)  
# Plot log of variance 'n' lags vs log time  
xyplot(log(variance.vec) ~ log(lags),         
main="SPY Daily Price Differences Variance vs Time Lags",        
 xlab = "Time",        
 ylab = "Logged Variance 'n' lags",         grid = TRUE,        
 type = c("p","r"),col.line = "red",         
abline=(h = 0)) 

hurst.exponent = coef(log.linear)[2]/2
hurst.exponent 

# Write output to file write.csv(new.df,file="G:/R Projects/hurst.csv")

For a little explanation of what is actually going on here: 1. First we are computing the lagged difference in close prices for the SPY. We do this by taking today’s SPY close – 2 day lag. This gives us the price difference between today’s SPY close and the SPY close 2 days ago. We do this for each lag 2:20. So for lag 3, this will take today’s SPY close – SPY Close 3 days ago. Repeat the process through to lag 20 (2:20). This will roll through the entire series. This is evident with head(new.df)

 > head(new.df)
           Date Ticker     Open     High      Low    Close  Volume Open.Interest lagged.diff.n2 lagged.diff.n3 lagged.diff.n4 lagged.diff.n5
1753 2000-01-06    SPY 139.2124 141.0819 137.3430 137.3430 6245656           138             NA             NA             NA             NA
1754 2000-01-07    SPY 139.8979 145.3193 139.6486 145.3193 8090507           146             NA             NA             NA             NA
1755 2000-01-10    SPY 145.8178 146.4410 144.5715 145.8178 5758617           146        8.47488             NA             NA             NA
1756 2000-01-11    SPY 145.3816 145.6932 143.0760 143.8861 7455732           144       -1.43326        6.54310             NA             NA
1757 2000-01-12    SPY 144.1976 144.1976 142.4528 142.6398 6932185           143       -3.17808       -2.67956        5.29680             NA
1758 2000-01-13    SPY 144.0730 145.3193 142.8267 144.5715 5173588           145        0.68547       -1.24631       -0.74779        7.22857

There are leading NA’s depending on which lag period we used. This then rolls through the series taking the lagged differences.

2. After we will have all of our lagged differences from 2:20 (or any other range chosen)

3. We then for each ‘n’ lag period, compute the variance for that particular lagged period. This will be the variance of the total length of each lagged difference. We can see this by printing the variance vector:

 > variance.vec
 lagged.diff.n2  lagged.diff.n3  lagged.diff.n4  lagged.diff.n5  lagged.diff.n6  lagged.diff.n7  lagged.diff.n8  lagged.diff.n9 lagged.diff.n10
       4.288337        6.065315        7.823918        9.552756       11.155789       12.702647       14.185067       15.724892       17.180618
lagged.diff.n11 lagged.diff.n12 lagged.diff.n13 lagged.diff.n14 lagged.diff.n15 lagged.diff.n16 lagged.diff.n17 lagged.diff.n18 lagged.diff.n19
      18.651980       20.167477       21.854415       23.647368       25.289570       26.751552       28.403188       30.110954       31.620225
lagged.diff.n20
      33.130844

This shows us the variance for each of our lagged differences from 2 to 20.

4. After we plot the the log variance vs the log lags.

# Linear regression of log variances vs log lags
log.linear <- lm(formula = log(variance.vec) ~ log(lags))

# Plot log of varaince 'n' lags vs log time
xyplot(log(variance.vec) ~ log(lags),
       main="SPY Daily Price Differences Variance vs Time Lags",
       xlab = "Log Lags",
       ylab = "Logged Variance 'n' lags",
       grid = TRUE,
       type = c("p","r"),col.line = "red",
       abline=(h = 0))

Rplot30

5. The hurst exponent is log(lags) estimate / 2 (the slope / 2)

hurst

For date range: “2000-01-06” to “2017-08-06” at our chosen lags of 2:20 days:

SPY Hurst exponent is 0.443483. Which is mean reverting.

Another method is to compute a rolling simple hurst exponent over a rolling ‘n’ day period.

The calculation for simple Hurst:

# Function For Simple Hurst Exponent
x <- new.df$Close # set x variable

simpleHurst <- function(y){
  sd.y <- sd(y)
  m <- mean(y)
  y <- y - m
  max.y <- max(cumsum(y))
  min.y <- min(cumsum(y))
  RS <- (max.y - min.y)/sd.y
  H <- log(RS) / log(length(y))
  return(H)
}
simpleHurst(x) # Obtain Hurst exponent for entire series

What we can do is apply the simple hurst function using rollapply in R over ‘n’ day rolling look back period, we do this using our created getHURST function:

# Hurst Exponent
# Andrew Bannerman
# 8.11.2017

require(lubridate)
require(dplyr)
require(magrittr)
require(zoo)
require(ggplot2)

# Data path
data.dir <- "G:/R Projects"                                #Enter your directry here of you S&p500 data.. you need / between folder names not \
data.read.spx <- paste(data.dir,"SPY.csv",sep="/")

# Read data to read.spx data frame
read.spx <- read.csv(data.read.spx,header=TRUE, sep=",",skip=0,stringsAsFactors=FALSE)

# Make dataframe
new.df <- data.frame(read.spx)

# Convert Values To Numeric
cols <-c(3:8)
new.df[,cols] %<>% lapply(function(x) as.numeric(as.character(x)))

#Convert Date Column [1]
new.df$Date <- ymd(new.df$Date)

# Use for subsetting by date
new.df <- subset(new.df, Date >= "2000-01-06" & Date <= "2017-08-06")    # Change date ranges
#new.df <- subset(new.df, Date >= as.Date("1980-01-01"))                                     # Choose start date to present

# Function For Simple Hurst Exponent
x <- new.df$Close # set x variable

simpleHurst <- function(y){
  sd.y <- sd(y)
  m <- mean(y)
  y <- y - m
  max.y <- max(cumsum(y))
  min.y <- min(cumsum(y))
  RS <- (max.y - min.y)/sd.y
  H <- log(RS) / log(length(y))
  return(H)
}
simpleHurst(x) #Obtain Hurst exponent for entire series

# Calcualte rolling hurst exponent for different 'n' periods
getHURST <- function(rolldays) {
  function(new.df) {
    rollapply(new.df$Close,
              width = rolldays,               # width of rolling window
              FUN = simpleHurst,
              fill = NA,
              align = "right")
  }
}
# Create a matrix to put the roll hurst in
roll.hurst.matrix <- matrix(nrow=nrow(new.df), ncol=0)

# Loop for filling it
for (i in 2:252) {
  roll.hurst.matrix <- cbind(roll.hurst.matrix, getHURST(i)(new.df))
}

# Rename columns
colnames(roll.hurst.matrix) <- sapply(2:252, function(n)paste("roll.hurst.n", n, sep=""))

# Bind to existing dataframe
new.df <-  cbind(new.df, roll.hurst.matrix)

# Line Plot of rolling hurst
ggplot(data=new.df, aes(x = Date)) +
  geom_line(aes(y = roll.hurst.n5), colour = "black") +
labs(title="Hurst Exponent - Rolling 5 Days") +
  labs(x="Date", y="Hurst Exponent")  

# Plot Roll Hurst Histogram
qplot(new.df$roll.hurst.n5,
      geom="histogram",
      binwidth = 0.005,
      main = "Simple Hurst Exponent - Rolling 5 Days",
      fill=I("grey"),
      col=I("black"),
      xlab = "Hurst Exponent")

# Plot S&P500 Close
ggplot(data=new.df, aes(x = Date)) +
  geom_line(aes(y = Close), colour = "darkblue") +
  ylab(label="S&P500 Close") +
  xlab("Date") +
  labs(title="S&P500") 

# Write output to file
write.csv(new.df,file="G:/R Projects/hurst.roll.csv")

This calculates the simple hurst exponent over an ‘n’ day look back period. As we can see from the plotted histograms, shorter time frames for the SPY show hurst exponents < 0.50 and if we extend our period to longer time frames the SPY fits the trending hurst category closer to hurst 1.

hurst 3

hurst5

hurst10

hurst 6 mo

hurst 1 year

Using this information we may design (fit) a model which captures the nature of the series under examination. In this case it would make sense to build mean reversion models on short time periods for SPY and develop trending or momentum models for the longer time frames.

References
Algorithmic Trading: Winning Strategies and Their Rationale – May 28, 2013, by Ernie Chan