S&P500 Seasonal Study + Other Commodities

We study to see if there is seasonality to the S&P500. We perform the procedure below on data from 1928 to present day (10.11.2017)

1. Calculate daily spread of closing prices
2. Group daily spread by month
3. Calculate mean of each month

We simply compute the spread of the the close to close values. We do not use the % returns here, simply close – close for every day in the series.

Next we group all days by their month. We then compute the mean for each grouped month.

The results for the S&P500 are below:

Rplot126

Rplot127

The old adage… ‘Sell In May And Go Away!’ seems to be true.

Other ETF’s:

Rplot134

DIA follows mostly the same seasonal pattern to the S&P500.

Rplot130

The best months for Crude Oil seem to be from Feb through June.

Rplot135

Natural Gas has its worst months in July and August.

Rplot131

Best months for Gold look to be Jan/Feb and August.

Rplot132

 

Silver follows a similar seasonal pattern to Gold.

Commodities tend to exhibit seasonal supply and demand flutuations which are consistently shown in the mean plots above and with a bit of googling may be explained.

In another post we will test for seasonal strategies which will attempt to exploit the above seasonal trends.

Full R Code below:

# S&P500 Seasonal Study 
# Calculate daily price spreads
# Group by month 
# Average each monthly group 

require(lubridate)
require(dplyr)
require(magrittr)
require(TTR)
require(zoo)
require(data.table)
require(xts)
require(ggplot2)
require(ggthemes)

# Data path
data.dir <- "C:/Stock Market Analysis/Market Data/MASTER_DATA_DUMP"
data.read.spx <- paste(data.dir,"$SPX.csv",sep="/")

# Read data
read.spx <- read.csv(data.read.spx,header=TRUE, sep=",",skip=0,stringsAsFactors=FALSE)

# Convert Values To Numeric 
cols <-c(3:8)
read.spx[,cols] %<>% lapply(function(x) as.numeric(as.character(x)))

# Convert Date Column [1] to Date format 
read.spx$Date <- ymd(read.spx$Date)

# Subset Date
#read.spx <- subset(read.spx, Date >= as.Date("1960-01-01") ) 

# Compute daily price differences 
# We replicate NA 1 time in order to maintain correct positioning of differences
# Within the data frame
read.spx$close.diff <- c(rep(NA, 1), diff(read.spx$Close, lag = 1, differences = 1, arithmetic = TRUE, na.pad = TRUE))

# Group each daily difference by month
group <- read.spx %>% dplyr::mutate(mymonth = lubridate::month(Date)) %>% group_by(mymonth) 
read.spx <- data.frame(read.spx,group$mymonth)
read.spx <- arrange(read.spx,group.mymonth)

# Duplicate df
for.mean <- data.frame(read.spx)

# Perform mean
mean <- for.mean %<>%
  group_by(group.mymonth) %>%
  summarise(mean=mean(close.diff,na.rm = TRUE))

# Confidence
jan <- subset(read.spx, group.mymonth  == 1)
feb <- subset(read.spx, group.mymonth  == 2)
mar <- subset(read.spx, group.mymonth  == 3)
apr <- subset(read.spx, group.mymonth  == 4)
may <- subset(read.spx, group.mymonth  == 5)
jun <- subset(read.spx, group.mymonth  == 6)
jul <- subset(read.spx, group.mymonth  == 7)
aug <- subset(read.spx, group.mymonth  == 8)
sep <- subset(read.spx, group.mymonth  == 9)
oct <- subset(read.spx, group.mymonth  == 10)
nov <- subset(read.spx, group.mymonth  == 11)
dec <- subset(read.spx, group.mymonth  == 12)
jan.t.test <- t.test(jan$close.diff, conf.level = 0.95,na.rm = TRUE)
jan.t.test$estimate

# Jan Plot 
hist(jan$close.diff,main="Jan Mean - Normal Distribution",xlab="Mean")

# Plot 
ggplot(mean, aes(group.mymonth, mean)) +
  geom_col()+
  theme_classic()+
  scale_x_continuous(breaks = seq(0, 12, by = 1))+
  ggtitle("UNG - Mean Daily Spead Per Month", subtitle = "2007 To Present") +
  labs(x="Month",y="Mean Daily Spread Per Month")+
  theme(plot.title = element_text(hjust=0.5),plot.subtitle =element_text(hjust=0.5))

ggplot(mean, aes(group.mymonth, mean)) +
  geom_line()+
  theme_bw() +
  scale_x_continuous(breaks = seq(0, 12, by = 1))+
  scale_y_continuous(breaks = seq(-0.15, 0.30, by = 0.02))+
  ggtitle("Mean Daily Spead Per Month", subtitle = "1928 To Present") +
  labs(x="Month",y="Mean Daily Spread Per Month")+
  theme(plot.title = element_text(hjust=0.5),plot.subtitle =element_text(hjust=0.5))+
  geom_rect(aes(xmin=4.5,xmax=9,ymin=-Inf,ymax=Inf),alpha=0.1,fill="#CC6666")+
  geom_rect(aes(xmin=1,xmax=4.5,ymin=-Inf,ymax=Inf),alpha=0.1,fill="#66CC99")+
  geom_rect(aes(xmin=9,xmax=12,ymin=-Inf,ymax=Inf),alpha=0.1,fill="#66CC99")

# Write output to file
write.csv(read.spx,file="C:/R Projects/seasonal.csv")


Advertisements

Half life of Mean Reversion – Ornstein-Uhlenbeck Formula for Mean-Reverting Process

Ernie chan proposes a method to calculate the speed of mean reversion. He proposes to adjust the ADF (augmented dickey fuller test, more stringent) formula from discrete time to differential form. This takes shape of the Ornstein-Uhlenbeck Formula for mean reverting process. Ornstein Uhlenbeck Process – Wikipedia

dy(t) = (λy(t − 1) + μ)dt + dε

Where dε is some Gaussian noise. Chan goes on to mention that using the discrete ADF formula below:

Δy(t) = λy(t − 1) + μ + βt + α1Δy(t − 1) + … + αkΔy(t − k) + ∋t

and performing a linear regression of Δy(t) against y(t − 1) provides λ which is then used in the first equation. However, the advantage of writing the formula in differential form is it allows an analytical solution for the expected value of y(t).

E( y(t)) = y0exp(λt) − μ/λ(1 − exp(λt))

Mean reverting series exhibit negative λ. Conversely positive λ means the series doesn’t revert back to the mean.

When λ is negative, the value of price decays exponentially to the value −μ/λ with the half-life of decay equals to −log(2)/λ. See references.

We can perform the regression of yt-1 and (yt-1-yt) with the below R code on the SPY price series. For this test we will use a look back period of 100 days versus the entire price series (1993 inception to present). If we used all of the data, we would be including how long it takes to recover from bear markets. For trading purposes, we wish to use a shorter sample of data in order to produce a more meaningful statistical test.

The procedure:
1. Lag SPY close by -1 day
2. Subtract todays close – yesterdays close
3. Subtract (todays close – yesterdays close) – mean(todays close – yesterdays close)
4. Perform linear regression of (today close – yesterday) ~ (todays close – yesterdays close) – mean(todays close – yesterdays close)
5. On regression output perform -log(2)/λ

# Calculate yt-1 and (yt-1-yt)
y.lag <- c(random.data[2:length(random.data)], 0) # Set vector to lag -1 day
y.lag <- y.lag[1:length(y.lag)-1] # As shifted vector by -1, remove anomalous element at end of vector
random.data <- random.data[1:length(random.data)-1] # Make vector same length as vector y.lag
y.diff <- random.data - y.lag # Subtract todays close from yesterdays close
y.diff <- y.diff [1:length(y.diff)-1] # Make vector same length as vector y.lag
prev.y.mean <- y.lag - mean(y.lag) # Subtract yesterdays close from the mean of lagged differences
prev.y.mean <- prev.y.mean [1:length(prev.y.mean )-1] # Make vector same length as vector y.lag
final.df <- as.data.frame(final) # Create final data frame

# Linear Regression With Intercept
result <- lm(y.diff ~ prev.y.mean, data = final.df)
half_life <- -log(2)/coef(result)[2]
half_life

# Linear Regression With No Intercept
result = lm(y.diff ~ prev.y.mean + 0, data = final.df)
half_life1 = -log(2)/coef(result)[1]
half_life1

# Print general linear regression statistics
summary(result)

regress

regress..

Observing the output of the above regression we see that the slope is negative and is a mean revering process. We see from summary(results) λ is -0.06165 and when we perform -log(2)/λ we obtain a mean reversion half life of 11.24267 days.

11.24267 days is the half life of mean reversion which means we anticipate the series to fully revert to the mean by 2 * the half life or 22.48534 days. However, to trade mean reversion profitably we need not exit directly at the mean each time. Essentially if a trade extended over 22 days we may expect a short term or permanent regime shift. One may insulate against such defeats by setting a ‘time stop’.

The obtained 11.24267 day half life is short enough for a interday trading horizon. If we obtained a longer half life we may be waiting a long time for the series to revert back to the mean. Once we determine that the series is mean reverting we can trade this series profitably with a simple linear model using a look back period equal to the half life. In a previous post we explored a simple linear zscore model: https://flare9xblog.wordpress.com/2017/09/24/simple-linear-strategy-for-sp500/

The lookback period of 11 days was obtained using a ‘brute force approach’ (maybe luck). An optimal look back period of 11 days produced the best result for the SPY.

Post brute forcing, it was noted during optimization of the above strategy that adjusting the half life from 11 days to any number above or below, we experienced a decrease in performance.

We illustrate the effect of moving the look back period shorter and longer than the obtained half life. For simplicity, we will use the total cumulative returns for comparison:

10

11.

12

We see that a look back of 11 days produced the highest cumulative compounded returns.

Ernie Chan goes on to mention that ‘why bother with statistical testing’. The answer lies in the fact that specific trading rules only trigger when their conditions are met and therefore tend to skip over data. Statistical testing includes data that a model may skip over and thus produce results with higher statistical significance.

Furthermore, once we confirm a series is mean reverting we can be assured to find a profitable trading strategy and not per se the strategy that we just back tested.

References
Algorithmic Trading: Winning Strategies and Their Rationale – May 28, 2013, by Ernie Chan

Modelling The Hurst Exponent

One of the purposes of using the Hurst Exponent is to validate whether a price series is momentum, random walk or mean reverting. If we know this type of information, we may ‘fit’ a model to capture the nature of the series.

The Hurst exponent is categorized as:
H <0.5 = mean reverting
H == 0.5 = random walk 
H >0.5 = momentum

Editable parameters:
mu = mean # Change mean value
eta = theta # Try decreasing theta for less mean reversion, increase for more mean reversion
sigma = standard deviation # Change the height of the peaks and valleys with standard deviation

# Create OU simulation
OU.sim <- function(T = 1000, mu = 0.75, eta = 0.3, sigma = 0.05){
  P_0 = mu # Starting price is the mean
  P = rep(P_0,T)
  for(i in 2:T){
    P[i] = P[i-1] + eta * (mu - P[i-1]) + sigma * rnorm(1) * P[i-1]
  }
  return(P)
}

# Plot
plot(OU.sim(), type="l", main="Mean Reversion Sim")

# Save plot to data frame
plot.df <- data.frame(OU.sim())
plot(plot.df$OU.sim.., type="l",main="Mean Reversion Sim")

Rplot05

Looks pretty mean reverting.

We stored the simulation in a data frame so lets run the Hurst exponent to see which H value we obtain.

# Hurst Exponent (varying lags)
require(magrittr)
require(zoo)
require(lattice)

#Create lagged variables
lags <- 2:20

# Function for finding differences in lags. Todays Close - 'n' lag period
getLAG.DIFF <- function(lagdays) {
  function(plot.df) {
    c(rep(NA, lagdays), diff(plot.df$OU.sim.., lag = lagdays, differences = 1, arithmetic = TRUE, na.pad = TRUE))
  }
}
# Create a matrix to put the lagged differences in
lag.diff.matrix <- matrix(nrow=nrow(plot.df), ncol=0)

# Loop for filling it
for (i in lags) {
  lag.diff.matrix <- cbind(lag.diff.matrix, getLAG.DIFF(i)(plot.df))
}

# Rename columns
colnames(lag.diff.matrix) <- sapply(lags, function(n)paste("lagged.diff.n", n, sep=""))

# Bind to existing dataframe
plot.df <-  cbind(plot.df, lag.diff.matrix)

# Calculate Variances of 'n period' differences
variance.vec <- apply(plot.df[,2:ncol(plot.df)], 2, function(x) var(x, na.rm=TRUE))

# Linear regression of log variances vs log lags
log.linear <- lm(formula = log(variance.vec) ~ log(lags))
# Print general linear regression statistics
summary(log.linear)
# Plot log of variance 'n' lags vs log time
xyplot(log(variance.vec) ~ log(lags),
       main="SPY Daily Price Differences Variance vs Time Lags",
       xlab = "Time",
       ylab = "Logged Variance 'n' lags",
       grid = TRUE,
       type = c("p","r"),col.line = "red",
       abline=(h = 0)) 

hurst.exponent = coef(log.linear)[2]/2
hurst.exponent

We obtain a hurst exponent of 0.1368407
which is significantly mean reverting.

Lets change some of the parameters of the simulation to create a moderately mean reverting series. We can alter the theta, if we change eta = 0.3 to eta = 0.04 we obtain this output:

Rplot06

Looks less mean reverting than the first and H = 0.4140561. This is below H 0.50 and is considered mean reverting.

Let us test the SPY from 1993 (inception) to present (9.23.2017) to see what the H value is. The below chart output is a linear regression between the SPY lagged log differences and the log time. The Hurst exponent is the slope / 2 (code included).

Rplot07

The Hurst exponent for the SPY daily bars on time lags 2:20 is 0.4378202. We know that price series display different characteristics over varying time frames. If we simply plot the SPY daily closes:

Rplot10

Observing the  long term trend we see that the series looks more trending or momentum. We already tested a 2:20 day lagged period which is H value of 0.4378202 and if place the lags from 6 months to 1 and a half years (126:378 trading days) we see that H=0.6096454 and is on the momentum side of the scale.

So far – it is as expected.

What does a random series look like?

We can create this using randn from the ramify package. We simply cumsum each random generated data point and add a small positive drift to make it a trending series.

# Plot Random Walk With A Trend
require(ramify)
random.walk = cumsum(randn(10000)+0.025)
plot(random.walk, type="l", main="Random Walk")

# Random Walk Data Frame
random.df <- data.frame(cumsum(randn(10000)+0.03))
colnames(random.df)[1] <- "random"
plot(random.df$random, type="l", main="Random Walk")

Rplot09

The H for this series (lags 2:20) is 0.4999474 which rounded is 0.50 a random walk.

It would seem that based on the statistical tests the Hurst exponent is somewhat accurate in reflecting the nature of the series. It should be noted that different lags produce different regimes. 2:20 lags exhibit stronger mean reversion, on a 6 month to 1 and a half year time period (lags 126:378) the market exhibited stronger momentum H 0.6096454. At lags 50:100 its close to a random walk at H 0.5093078. What does this mean? Not only when optimizing models, we must optimize time frames.

To recap we:

1. Created a mean reverting price series with mu = 0.75, eta = 0.3, sigma = 0.05
2. We saved the output to a data frame and used the hurst calculation (linear regression of log lagged price differences vs log time) over a 2:20 lagged period to obtain the H value. See this post for more information on the hurst exponent calculation: https://flare9xblog.wordpress.com/2017/08/11/hurst-exponent-in-r/
3. The result was significantly mean reverting as we expected.
4. We tested SPY closes 1993 to 9.23.2017. On a lagged period of 2:20 the series was mean reverting and on a 6 month to 1.5 year time period the series was more momentum. This was as expected.
5. We created a random set of numbers and added a small drift to each data point to create a random walk trend. We obtained a H value of 0.5 rounded. Which is as expected.

The parameters for the simulated series can be edited to change the characteristics and the Hurst exponent can be calculated on each output. Try making the series more mean reverting or less mean reverting and the H value should adjust accordingly.

Full R code below:


# Modelling different price series 
# Mean reverison, random and momentum 
# Andrew Bannerman 9.24.2017

# Create OU simulation
# mu = mean
# eta = theta # Try decreasing theta for less mean reversion, increase for more mean reversion
# sigma = standard deviation # Change the height of the peaks and valleys with standard deviation
OU.sim <- function(T = 1000, mu = 0.75, eta = 0.04, sigma = 0.05){
  P_0 = mu # Starting price is the mean
  P = rep(P_0,T)
  for(i in 2:T){
    P[i] = P[i-1] + eta * (mu - P[i-1]) + sigma * rnorm(1) * P[i-1]
  }
  return(P)
}

# Plot
plot(OU.sim(), type="l", main="Mean Reversion Sim")

# Save plot to data frame 
plot.df <- data.frame(OU.sim())
plot(plot.df$OU.sim.., type="l",main="Mean Reversion Sim")

# Hurst Exponent Mean Reversion (varying lags)
require(magrittr)
require(zoo)
require(lattice)

#Create lagged variables
lags <- 2:20

# Function for finding differences in lags. Todays Close - 'n' lag period
getLAG.DIFF <- function(lagdays) {
  function(plot.df) {
    c(rep(NA, lagdays), diff(plot.df$OU.sim.., lag = lagdays, differences = 1, arithmetic = TRUE, na.pad = TRUE))
  }
}
# Create a matrix to put the lagged differences in
lag.diff.matrix <- matrix(nrow=nrow(plot.df), ncol=0)

# Loop for filling it
for (i in 2:20) {
  lag.diff.matrix <- cbind(lag.diff.matrix, getLAG.DIFF(i)(plot.df))
}

# Rename columns
colnames(lag.diff.matrix) <- sapply(2:20, function(n)paste("lagged.diff.n", n, sep=""))

# Bind to existing dataframe
plot.df <-  cbind(plot.df, lag.diff.matrix)

# Calculate Variances of 'n period' differences
variance.vec <- apply(plot.df[,2:ncol(plot.df)], 2, function(x) var(x, na.rm=TRUE))

# Linear regression of log variances vs log lags
log.linear <- lm(formula = log(variance.vec) ~ log(lags))  
# Print general linear regression statistics  
summary(log.linear) 
# Plot log of variance 'n' lags vs log time  
xyplot(log(variance.vec) ~ log(lags),         
       main="SPY Daily Price Differences Variance vs Time Lags",        
       xlab = "Time",        
       ylab = "Logged Variance 'n' lags",       
       grid = TRUE,        
       type = c("p","r"),col.line = "red",        
       abline=(h = 0)) 

hurst.exponent = coef(log.linear)[2]/2
hurst.exponent

# Write output to file write.csv(new.df,file="G:/R Projects/hurst.csv")

  # Plot Random Walk With A Trend
  require(ramify)
  random.walk = cumsum(randn(10000)+0.025)
  plot(random.walk, type="l", main="Random Walk")
  
  # Random Walk Data Frame 
  random.df <- data.frame(cumsum(randn(10000)+0.03))
  colnames(random.df)[1] <- "random"
  plot(random.df$random, type="l", main="Random Walk")

# Hurst Exponent Random Walk (varying lags)
require(magrittr)
require(zoo)
require(lattice)

#Create lagged variables
lags <- 2:20

# Function for finding differences in lags. Todays Close - 'n' lag period
getLAG.DIFF <- function(lagdays) {
  function(random.df) {
    c(rep(NA, lagdays), diff(random.df$random, lag = lagdays, differences = 1, arithmetic = TRUE, na.pad = TRUE))
  }
}
# Create a matrix to put the lagged differences in
lag.diff.matrix <- matrix(nrow=nrow(random.df), ncol=0)

# Loop for filling it
for (i in 2:20) {
  lag.diff.matrix <- cbind(lag.diff.matrix, getLAG.DIFF(i)(random.df))
}

# Rename columns
colnames(lag.diff.matrix) <- sapply(2:20, function(n)paste("lagged.diff.n", n, sep=""))

# Bind to existing dataframe
random.df <-  cbind(random.df, lag.diff.matrix)

# Calculate Variances of 'n period' differences
variance.vec <- apply(random.df[,2:ncol(random.df)], 2, function(x) var(x, na.rm=TRUE))

# Linear regression of log variances vs log lags
log.linear <- lm(formula = log(variance.vec) ~ log(lags))  
# Print general linear regression statistics  
summary(log.linear) 
# Plot log of variance 'n' lags vs log time  
xyplot(log(variance.vec) ~ log(lags),         
       main="SPY Daily Price Differences Variance vs Time Lags",        
       xlab = "Time",        
       ylab = "Logged Variance 'n' lags",       
       grid = TRUE,        
       type = c("p","r"),col.line = "red",        
       abline=(h = 0)) 

hurst.exponent = coef(log.linear)[2]/2
hurst.exponent

References
Algorithmic Trading: Winning Strategies and Their Rationale – May 28, 2013, by Ernie Chan

Simple S&P500 Linear Strategy

From our hurst exponent test during a previous post we determined that the S&P500 was a mean reverting series on an interday time frame. For this back test we have created a very simple linear model to ‘fit’ the nature of the series.

We use a rolling ‘n’ day zscore and we test using a look back of 11 days, for some reason the lookback of 11 days provided the best results. For the sake of simplicity we will test over the full sample period and we will not include any commissions or slippage.

We will use the SPY for this back test from inception to present.

Here are the parameters:
1. Rolling lookback period of z-score = 11 days
2. Entry level = < -0.2
3. Exit Time = 4 days

As you can see the parameters are kept to a minimum. The main goal here is to fit a model to the existing anomaly versus force fit a model to the data.

11.

result.

As we can see we obtain better performance than buy and hold. We have about a 21% better DD than buy and hold and a 10% annualized from 1993 to present. Notice the time invested is 48%. It means we have capital available to execute to other strategies. To offset the draw downs even further we could add another uncorrelated strategy but this as a base is not a bad attempt to capture the mean reversion nature of the S&P500 on an interday basis. Note this might not be the most stellar performance but its good example to illustrate the train of thought.

Some questions to ponder?

1. What happens if we ‘fit’ a mean reversion model to S&p500 daily data from 1929 to 1980?
2. What happens if we fit a momentum model to S&p500 daily data from 1929 to 1980?
3. What happens if we fit a momentum model for S&P500 daily data from 1993 to present?

Those questions good readers shall leave for you to answer!

The rules of this system are:
1. When zscore crosses below <-0.2 buy next days open
2. After 4 days has passed sell all, if another signal is generated while in the existing trade we will take the next signal also

Notes about the rules of the backtest:
1. We calculate S&P500 close to close returns and also open to close returns. We do this in order to set the first buying day to calculate an open to close return (At the end of a day a signal is generated, we then buy next days open, thus the returns for the first day are open to close returns, on 2 day and over the returns are close to close returns.)
2. To avoid look ahead bias, after a signal is generated, we lag +1 the time series forward a day so that we are simulating buying the NEXT days open.

All parameters can be optimized and entry and exit rules adjusted. Some ideas could be:

1. For exit, exit on specific zscore value… ie buy <-0.20 and sell when it crossed back over 0.
2. For exit, ignore repeat trades while in existing trade
3. Exit soon as the price is higher than our entry

Different rules produce slightly different results. I may go into some of these different selling rules and the code for them in future posts! For now this is a good start 🙂

# Calculate rolling z-score of SPY close price
# Sell at fixed n day hold
# 8.13.2017
# Andrew Bannerman
require(lubridate)
require(dplyr)
require(magrittr)
require(TTR)
require(zoo)
require(data.table)
require(xts)
require(PerformanceAnalytics)

# Data path
data.dir <- "D:/R Projects"
data.read.spx <- paste(data.dir,"SPY.csv",sep="/")

# Read data
read.spx <- read.csv(data.read.spx,header=TRUE, sep=",",skip=0,stringsAsFactors=FALSE)

# Make dataframe
new.df <- data.frame(read.spx)
tail(new.df)
# Convert Values To Numeric
cols <-c(3:8)
new.df[,cols] %% lapply(function(x) as.numeric(as.character(x)))

#Convert Date Column [1]
new.df$Date <- ymd(new.df$Date)

# Use TTR package to create rolling SMA n day moving average
# Create function and loop in order to repeat the desired number of SMAs for example 2:30
getSMA <- function(numdays) {
function(new.df) {
SMA(new.df[,"Close"], numdays) # Calls TTR package to create SMA
}
}
# Create a matrix to put the SMAs in
sma.matrix <- matrix(nrow=nrow(new.df), ncol=0)

# Loop for filling it
for (i in 2:60) {
sma.matrix <- cbind(sma.matrix, getSMA(i)(new.df))
}

# Rename columns
colnames(sma.matrix) <- sapply(2:60, function(n)paste("close.sma.n", n, sep=""))

# Bind to existing dataframe
new.df <- cbind(new.df, sma.matrix)

# Use TTR package to create rolling Standard Deviation
# Create function and loop in order to repeat the desired number of Stdev for example 2:30
getSD <- function(numdays) {
function(new.df) {
runSD(new.df$Close, numdays, cumulative = FALSE) # Calls TTR package to create SMA
}
}
# Create a matrix to put the SMAs in
sd.matrix <- matrix(nrow=nrow(new.df), ncol=0)

# Loop for filling it
for (i in 2:60) {
sd.matrix <- cbind(sd.matrix, getSD(i)(new.df))
}

# Rename columns
colnames(sd.matrix) <- sapply(2:60, function(n)paste("close.sd.n", n, sep=""))

# Bind to existing dataframe
new.df <- cbind(new.df, sd.matrix)

# Use base R to work out the rolling z-score (Close – roll mean) / stdev
new.df$close.zscore.n2 <- apply(new.df[,c('Close','close.sma.n2', 'close.sd.n2')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n3 <- apply(new.df[,c('Close','close.sma.n3', 'close.sd.n3')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n4 <- apply(new.df[,c('Close','close.sma.n4', 'close.sd.n4')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n5 <- apply(new.df[,c('Close','close.sma.n5', 'close.sd.n5')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n6 <- apply(new.df[,c('Close','close.sma.n6', 'close.sd.n6')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n7 <- apply(new.df[,c('Close','close.sma.n7', 'close.sd.n7')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n8 <- apply(new.df[,c('Close','close.sma.n8', 'close.sd.n8')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n9 <- apply(new.df[,c('Close','close.sma.n9', 'close.sd.n9')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n10 <- apply(new.df[,c('Close','close.sma.n10', 'close.sd.n10')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n11 <- apply(new.df[,c('Close','close.sma.n11', 'close.sd.n11')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n12 <- apply(new.df[,c('Close','close.sma.n12', 'close.sd.n12')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n13 <- apply(new.df[,c('Close','close.sma.n13', 'close.sd.n13')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n14 <- apply(new.df[,c('Close','close.sma.n14', 'close.sd.n14')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n15 <- apply(new.df[,c('Close','close.sma.n15', 'close.sd.n15')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n16 <- apply(new.df[,c('Close','close.sma.n16', 'close.sd.n16')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n17 <- apply(new.df[,c('Close','close.sma.n17', 'close.sd.n17')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n18 <- apply(new.df[,c('Close','close.sma.n18', 'close.sd.n18')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n19 <- apply(new.df[,c('Close','close.sma.n19', 'close.sd.n19')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n20 <- apply(new.df[,c('Close','close.sma.n20', 'close.sd.n20')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n21 <- apply(new.df[,c('Close','close.sma.n21', 'close.sd.n21')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n22 <- apply(new.df[,c('Close','close.sma.n22', 'close.sd.n22')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n23 <- apply(new.df[,c('Close','close.sma.n23', 'close.sd.n23')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n24 <- apply(new.df[,c('Close','close.sma.n24', 'close.sd.n24')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n25 <- apply(new.df[,c('Close','close.sma.n25', 'close.sd.n25')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n26 <- apply(new.df[,c('Close','close.sma.n26', 'close.sd.n26')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n27 <- apply(new.df[,c('Close','close.sma.n27', 'close.sd.n27')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n28 <- apply(new.df[,c('Close','close.sma.n28', 'close.sd.n28')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n29 <- apply(new.df[,c('Close','close.sma.n29', 'close.sd.n29')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n30 <- apply(new.df[,c('Close','close.sma.n30', 'close.sd.n30')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n31 <- apply(new.df[,c('Close','close.sma.n31', 'close.sd.n31')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n32 <- apply(new.df[,c('Close','close.sma.n32', 'close.sd.n32')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n33 <- apply(new.df[,c('Close','close.sma.n33', 'close.sd.n33')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n34 <- apply(new.df[,c('Close','close.sma.n34', 'close.sd.n34')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n35 <- apply(new.df[,c('Close','close.sma.n35', 'close.sd.n35')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n36 <- apply(new.df[,c('Close','close.sma.n36', 'close.sd.n36')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n37 <- apply(new.df[,c('Close','close.sma.n37', 'close.sd.n37')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n38 <- apply(new.df[,c('Close','close.sma.n38', 'close.sd.n38')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n39 <- apply(new.df[,c('Close','close.sma.n39', 'close.sd.n39')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n40 <- apply(new.df[,c('Close','close.sma.n40', 'close.sd.n40')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n41 <- apply(new.df[,c('Close','close.sma.n41', 'close.sd.n41')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n42 <- apply(new.df[,c('Close','close.sma.n42', 'close.sd.n42')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n43 <- apply(new.df[,c('Close','close.sma.n43', 'close.sd.n43')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n44 <- apply(new.df[,c('Close','close.sma.n44', 'close.sd.n44')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n45 <- apply(new.df[,c('Close','close.sma.n45', 'close.sd.n45')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n46 <- apply(new.df[,c('Close','close.sma.n46', 'close.sd.n46')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n47 <- apply(new.df[,c('Close','close.sma.n47', 'close.sd.n47')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n48 <- apply(new.df[,c('Close','close.sma.n48', 'close.sd.n48')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n49 <- apply(new.df[,c('Close','close.sma.n49', 'close.sd.n49')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n50 <- apply(new.df[,c('Close','close.sma.n50', 'close.sd.n50')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n51 <- apply(new.df[,c('Close','close.sma.n51', 'close.sd.n51')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n52 <- apply(new.df[,c('Close','close.sma.n52', 'close.sd.n52')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n53 <- apply(new.df[,c('Close','close.sma.n53', 'close.sd.n53')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n54 <- apply(new.df[,c('Close','close.sma.n54', 'close.sd.n54')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n55 <- apply(new.df[,c('Close','close.sma.n55', 'close.sd.n55')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n56 <- apply(new.df[,c('Close','close.sma.n56', 'close.sd.n56')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n57 <- apply(new.df[,c('Close','close.sma.n57', 'close.sd.n57')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n58 <- apply(new.df[,c('Close','close.sma.n58', 'close.sd.n58')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n59 <- apply(new.df[,c('Close','close.sma.n59', 'close.sd.n59')], 1, function(x) { (x[1]-x[2])/x[3] } )
new.df$close.zscore.n60 <- apply(new.df[,c('Close','close.sma.n60', 'close.sd.n60')], 1, function(x) { (x[1]-x[2])/x[3] } )

# Convert all NA to 0
new.df[is.na(new.df)] <- 0

# Calculate quartiles, where close is relation to range (Close – High) / (High – Low)
#new.df$quartile <- apply(new.df[,c('Close', 'Low', 'High')], 1, function(x) { (x[1]-x[2])/(x[3]-x[2])} )

# Calculate Returns from open to close
new.df$ocret <- apply(new.df[,c('Open', 'Close')], 1, function(x) { (x[2]-x[1])/x[1]} )

# Calculate Close-to-Close returns
new.df$clret <- ROC(new.df$Close, type = c("discrete"))
new.df$clret[1] <- 0

# Name indicators
indicator <- new.df$close.zscore.n12

# Create Long Signals
new.df$signal <- ifelse(indicator < -.2, 1, 0)

######## Fixed 'n' day hold logic ##########
# Variable for loop
indicator <- new.df$signal

# Create Vector From Signal
signal.1 <- c(indicator)

# Set variable for number of days to hold
n.day <- 4

# Loop for fixed 'n' day hold
res <- NULL
while (length(res) < length(signal.1)) {
if (signal.1[length(res)+1] == 1) {
res <- c(res, rep(1,n.day))
} else {
res <- c(res, 0)
}
}
res <- res[1:length(signal.1)]
new.df <- data.frame(new.df,response = res)

# lag signal by one forward day to signal entry next day
new.df$response %
group_by(RunID) %>%
mutate(equity.curve = ifelse(response == 0, 0,
ifelse(row_number() == 1, ocret, clret))) %>%
ungroup() %>%
select(-RunID)

# Pull select columns from data frame to make XTS whilst retaining formats
xts1 = xts(new.df$equity.curve, order.by=as.Date(new.df$Date, format=”%m/%d/%Y”))
xts2 = xts(new.df$clret, order.by=as.Date(new.df$Date, format=”%m/%d/%Y”))

# Join XTS together
compare <- cbind(xts1,xts2)

# Use the PerformanceAnalytics package for trade statistics
# install.packages("PerformanceAnalytics")
require(PerformanceAnalytics)
colnames(compare) <- c("Mean Reversion","Buy And Hold")
charts.PerformanceSummary(compare,main="Cumulative Returns", wealth.index=TRUE, colorset=rainbow12equal)
#png(filename="20090606_rsi2_performance_updated.png", 720, 720)
performance.table <- rbind(table.AnnualizedReturns(compare), maxDrawdown(compare), CalmarRatio(compare),table.DownsideRisk(compare))
drawdown.table <- rbind(table.Drawdowns(compare))
#dev.off()
logRets %
group_by(RunID) %>%
dplyr::mutate(perc.output = ifelse(response == 0, 0,
ifelse(row_number() == n(),
(last(Close) – first(Open))/first(Open), 0))) %>%
ungroup() %>%
select(-RunID)

# All NA to 0
new.df[is.na(new.df)] <- 0
# Win / Loss %
# 1 Day Hold Trades
winning.trades '0', na.rm=TRUE)
losing.trades <- sum(new.df$equity.curve < '0', na.rm=TRUE)
total.days <- NROW(new.df$equity.curve)
# Multi Day Hold Trades
multi.winning.trades '0', na.rm=TRUE)
multi.losing.trades <- sum(new.df$perc.output < '0', na.rm=TRUE)
multi.total.days <- NROW(new.df$perc.output)
multi.losing.trades
# % Time Invested (Same column for 1 day and multi hold trades)
time.invested <- (winning.trades + losing.trades) / total.days
winning.trades + losing.trades
winning.trades
losing.trades

# Calcualte win loss %
# 1 Day Hold Trades
total <- winning.trades + losing.trades
win.percent <- winning.trades / total
loss.percent <- losing.trades / total
# Multi Day Hold Trades
multi.total <- multi.winning.trades + multi.losing.trades
multi.win.percent <- multi.winning.trades / multi.total
multi.loss.percent <- multi.losing.trades / multi.total
# Calculate Consecutive Wins Loss
# 1 Day Hold Trades
remove.zero <- new.df[-which(new.df$equity.curve == 0 ), ] # removing rows 0 values
consec.wins <- max(rle(sign(remove.zero$equity.curve))[[1]][rle(sign(remove.zero$equity.curve))[[2]] == 1])
consec.loss <- max(rle(sign(remove.zero$equity.curve))[[1]][rle(sign(remove.zero$equity.curve))[[2]] == -1])

# Multi Day Hold Trades
multi.remove.zero <- new.df[-which(new.df$perc.output == 0 ), ] # removing rows 0 values
multi.consec.wins <- max(rle(sign(multi.remove.zero$perc.output))[[1]][rle(sign(multi.remove.zero$perc.output))[[2]] == 1])
multi.consec.loss <-max(rle(sign(multi.remove.zero$perc.output))[[1]][rle(sign(multi.remove.zero$perc.output))[[2]] == -1])

# Calculate Summary Statistics
# 1 Day Hold Trades OR all days if multi holding
average.trade <- mean(new.df$equity.curve)
average.win 0])
average.loss <- mean(new.df$equity.curve[new.df$equity.curve <0])
median.win 0])
median.loss <- median(new.df$equity.curve[new.df$equity.curve <0])
max.gain <- max(new.df$equity.curve)
max.loss <- min(new.df$equity.curve)
win.loss.ratio <- winning.trades / abs(losing.trades)
summary <- cbind(winning.trades,losing.trades,win.percent,loss.percent,win.loss.ratio,time.invested,average.trade,average.win,average.loss,median.win,median.loss,consec.wins,consec.loss,max.gain,max.loss)
summary <- as.data.frame(summary)
colnames(summary) <- c("Winning Trades","Losing Trades","Win %","Loss %","Win Loss Ratio","Time Invested","Average Trade","Average Win","Average Loss","Median Gain","Median Loss","Consec Wins","Consec Loss","Maximum Win","Maximum Loss")
print(summary)

# Multi Day Hold Trades
multi.average.trade <- mean(new.df$perc.output)
multi.average.win 0])
multi.average.loss <- mean(new.df$perc.output[new.df$perc.output <0])
multi.median.win 0])
multi.median.loss <- median(new.df$perc.output[new.df$perc.output <0])
multi.win.loss.ratio <- multi.average.win / abs(multi.average.loss)
multi.max.gain <- max(new.df$perc.output)
multi.max.loss <- min(new.df$perc.output)
multi.summary <- cbind(multi.winning.trades,multi.losing.trades,multi.win.percent,multi.loss.percent,multi.win.loss.ratio,time.invested,multi.average.trade,multi.average.win,multi.average.loss,multi.median.win,multi.median.loss,multi.consec.wins,multi.consec.loss,multi.max.gain,multi.max.loss)
multi.summary <- as.data.frame(multi.summary)
colnames(multi.summary) <- c("Winning Trades","Losing Trades","Win %","Loss %","Win Loss Ratio","Time Invested","Average Trade","Average Win","Average Loss","Median Gain","Median Loss","Consec Wins","Consec Loss","Maximum Win","Maximum Loss")
print(multi.summary)
print(performance.table)
print(drawdown.table)
table.Drawdowns(xts1, top=10)
charts.PerformanceSummary(compare,main="Cumulative Returns",wealth.index=TRUE,colorset=rainbow12equal)

# Write output to file
write.csv(new.df,file="D:/R Projects/spy_mean.csv")

R – Multi Day Hold Trading Logic – Replacing a for loop to back test multi hold day trading rules

I wanted to expand on some trading logic written over at FOSS trading blog. Joshua demonstrates how to back test 1 day hold strategies. Here is an example of one of his back test scripts.

RSI2 Back Test Script

http://blog.fosstrading.com/2009/04/testing-rsi2-with-r.html

We can look at one of the trading rules from the above back test script:

# Create the long (up) and short (dn) signals
sigup <- ifelse(rsi < 10, 1, 0)
sigdn <- ifelse(rsi > 90, -1, 0)

This is basically saying if rsi is below 10 go long, any time its not over 10 get out (it means you cant hold from 10 all way up to 90 for example). Then for going short, we short over 90 and any time rsi is not over 90 we are not short ( this means we cant short over 90 and hold all way to 10 for example)

The above long and short rules are essentially designed mostly for short term trading or 1 day trading hold times.

Lets expand on the above example and create a multi day trading rule.

We will use R and use a dummy data set to simulate an indicator:


# Random Indicator
any.indicator <- c(runif(1500, min=0, max=100))   #create random numbers between 0 and 100, create 1500 data points between that range
df <- data.frame(any.indicator)   # place the vector above into a data frame

# Create Entry and Exit Rule
# Ifelse statement (print 1 else 0)
# We want to buy when any.indicator is below 10, when below 10 we want to buy so signal.enter will = 1
# We want to exit our trade when any.indicator is over 90, when over  90 we want to sell so signal.exit will = 1
# This sets the boundary for our multi day hold
df$signal.enter <- ifelse(any.indicator < 10, 1,0)  # Buy when indicator is over 0
df$signal.exit <- ifelse(any.indicator > 90, 1,0)   # Sell when indicator is less than 0.  

# Generate Multi Day Hold Trading logic # Use this for loop
# This will find the first 1, in df$signal.enter, it will continue to loop until # we meet a df$signal.exit = 1. During our entry / exit, the loop will create a # df$signal == 1 on each row of the data frame so that we can back test multi
# day hold trades. Once we exit, the we will print 0's until we meet another
# df$signal.enter == 1.

df$signal[[1]] = ifelse(df$signal.enter[[1]] == (1), 1, 0)

for (i in 2:nrow(df)){
  df$signal[i] = ifelse(df$signal.enter[i] == (1), 1,
                              ifelse(df$signal.exit[i] == (1), 0,
                                     df$signal.exit[i-1]))
}

The code above uses a for loop to create entry and exits for multi day hold trades. In this example we are buying any.indicator when its below 10 and selling when its over 90 . Or we can short when its above 90 and close the short when it reaches 10. The for loop above will become quite slow during large data sets. We can keep the ‘vectorized theme’ of R and we can use dplyr to replace the loop. This will speed up the code.

# Dplyr solution
library(dplyr) df %>%
  dplyr::mutate(signal = ifelse(signal.enter == 1, 1,
                    ifelse(signal.exit ==1, 0, 0)))

The code is more compact and produces the exact same result as the for loop above. It also runs much faster on larger data sets.

Full code with the dplyr solution replacing the for loop for multi day trading rules:


# Create random indicator
any.indicator <- c(runif(1500, min=0, max=100))
df <- data.frame(any.indicator)

# Ifelse statement (print 1 else 0)
# Create entry and exit signals
df$signal.enter <- ifelse(any.indicator < 10, 1,0)  # create enter signal
df$signal.exit <- ifelse(any.indicator > 90, 1,0) # create exit signal

# Multi day hold trading rules
library(dplyr)

df %>%
  dplyr::mutate(signal = ifelse(signal.enter == 1, 1,
                    ifelse(signal.exit ==1, 0, 0)))

Hurst Exponent in R

The Hurst Exponent is a statistical testing method which tests if a series is mean reverting, trending or in geometric brownian motion. Using the hurst exponent a time series can be categorized by the following:

Hurst Values < 0.5 = mean reverting

Hurst Vales = 0.5 = geometric brownian motion

Hurst Values > 0.5 = trending

The hurst exponent falls between a range of 0 to 1. Where values closer to 0 signal stronger mean reversion and values closer to 1 signal stronger trending behavior.

Using R, we can calculate the hurst exponent:

# Hurst Exponent
# Andrew Bannerman
# 8.11.2017

require(lubridate)
require(dplyr)
require(magrittr)
require(zoo)
require(lattice)

# Data path
data.dir <- "G:/R Projects"
output.dir <- "G:/R Projects"
data.read.spx <- paste(data.dir,"SPY.csv",sep="/")

# Read data
read.spx <- read.csv(data.read.spx,header=TRUE, sep=",",skip=0,stringsAsFactors=FALSE)

# Convert Values To Numeric
cols <-c(3:8)
read.spx[,cols] %<>% lapply(function(x) as.numeric(as.character(x)))

# Convert Date Column [1]
read.spx$Date <- ymd(read.spx$Date)

# Make new data frame
new.df <- data.frame(read.spx)

# Subset Date Range
new.df <- subset(new.df, Date >= "2000-01-06" & Date <= "2017-08-06")

#Create lagged variables

lags <- 2:20

# Function for finding differences in lags. Todays Close - 'n' lag period
getLAG.DIFF <- function(lagdays) {
  function(new.df) {
    c(rep(NA, lagdays), diff(new.df$Close, lag = lagdays, differences = 1, arithmetic = TRUE, na.pad = TRUE))
  }
}
# Create a matrix to put the lagged differences in
lag.diff.matrix <- matrix(nrow=nrow(new.df), ncol=0)

# Loop for filling it
for (i in 2:20) {
  lag.diff.matrix <- cbind(lag.diff.matrix, getLAG.DIFF(i)(new.df))
}

# Rename columns
colnames(lag.diff.matrix) <- sapply(2:20, function(n)paste("lagged.diff.n", n, sep=""))

# Bind to existing dataframe
new.df <-  cbind(new.df, lag.diff.matrix)

# Calculate Variances of 'n period' differences
variance.vec <- apply(new.df[,9:ncol(new.df)], 2, function(x) var(x, na.rm=TRUE))

# Linear regression of log variances vs log lags
log.linear <- lm(formula = log(variance.vec) ~ log(lags))  
# Print general linear regression statistics  summary(log.linear)  
# Plot log of variance 'n' lags vs log time  
xyplot(log(variance.vec) ~ log(lags),         
main="SPY Daily Price Differences Variance vs Time Lags",        
 xlab = "Time",        
 ylab = "Logged Variance 'n' lags",         grid = TRUE,        
 type = c("p","r"),col.line = "red",         
abline=(h = 0)) 

hurst.exponent = coef(log.linear)[2]/2
hurst.exponent 

# Write output to file write.csv(new.df,file="G:/R Projects/hurst.csv")

For a little explanation of what is actually going on here: 1. First we are computing the lagged difference in close prices for the SPY. We do this by taking today’s SPY close – 2 day lag. This gives us the price difference between today’s SPY close and the SPY close 2 days ago. We do this for each lag 2:20. So for lag 3, this will take today’s SPY close – SPY Close 3 days ago. Repeat the process through to lag 20 (2:20). This will roll through the entire series. This is evident with head(new.df)

 > head(new.df)
           Date Ticker     Open     High      Low    Close  Volume Open.Interest lagged.diff.n2 lagged.diff.n3 lagged.diff.n4 lagged.diff.n5
1753 2000-01-06    SPY 139.2124 141.0819 137.3430 137.3430 6245656           138             NA             NA             NA             NA
1754 2000-01-07    SPY 139.8979 145.3193 139.6486 145.3193 8090507           146             NA             NA             NA             NA
1755 2000-01-10    SPY 145.8178 146.4410 144.5715 145.8178 5758617           146        8.47488             NA             NA             NA
1756 2000-01-11    SPY 145.3816 145.6932 143.0760 143.8861 7455732           144       -1.43326        6.54310             NA             NA
1757 2000-01-12    SPY 144.1976 144.1976 142.4528 142.6398 6932185           143       -3.17808       -2.67956        5.29680             NA
1758 2000-01-13    SPY 144.0730 145.3193 142.8267 144.5715 5173588           145        0.68547       -1.24631       -0.74779        7.22857

There are leading NA’s depending on which lag period we used. This then rolls through the series taking the lagged differences.

2. After we will have all of our lagged differences from 2:20 (or any other range chosen)

3. We then for each ‘n’ lag period, compute the variance for that particular lagged period. This will be the variance of the total length of each lagged difference. We can see this by printing the variance vector:

 > variance.vec
 lagged.diff.n2  lagged.diff.n3  lagged.diff.n4  lagged.diff.n5  lagged.diff.n6  lagged.diff.n7  lagged.diff.n8  lagged.diff.n9 lagged.diff.n10
       4.288337        6.065315        7.823918        9.552756       11.155789       12.702647       14.185067       15.724892       17.180618
lagged.diff.n11 lagged.diff.n12 lagged.diff.n13 lagged.diff.n14 lagged.diff.n15 lagged.diff.n16 lagged.diff.n17 lagged.diff.n18 lagged.diff.n19
      18.651980       20.167477       21.854415       23.647368       25.289570       26.751552       28.403188       30.110954       31.620225
lagged.diff.n20
      33.130844

This shows us the variance for each of our lagged differences from 2 to 20.

4. After we plot the the log variance vs the log lags.

# Linear regression of log variances vs log lags
log.linear <- lm(formula = log(variance.vec) ~ log(lags))

# Plot log of varaince 'n' lags vs log time
xyplot(log(variance.vec) ~ log(lags),
       main="SPY Daily Price Differences Variance vs Time Lags",
       xlab = "Log Lags",
       ylab = "Logged Variance 'n' lags",
       grid = TRUE,
       type = c("p","r"),col.line = "red",
       abline=(h = 0))

Rplot30

5. The hurst exponent is log(lags) estimate / 2 (the slope / 2)

hurst

For date range: “2000-01-06” to “2017-08-06” at our chosen lags of 2:20 days:

SPY Hurst exponent is 0.443483. Which is mean reverting.

Another method is to compute a rolling simple hurst exponent over a rolling ‘n’ day period.

The calculation for simple Hurst:

# Function For Simple Hurst Exponent
x <- new.df$Close # set x variable

simpleHurst <- function(y){
  sd.y <- sd(y)
  m <- mean(y)
  y <- y - m
  max.y <- max(cumsum(y))
  min.y <- min(cumsum(y))
  RS <- (max.y - min.y)/sd.y
  H <- log(RS) / log(length(y))
  return(H)
}
simpleHurst(x) # Obtain Hurst exponent for entire series

What we can do is apply the simple hurst function using rollapply in R over ‘n’ day rolling look back period, we do this using our created getHURST function:

# Hurst Exponent
# Andrew Bannerman
# 8.11.2017

require(lubridate)
require(dplyr)
require(magrittr)
require(zoo)
require(ggplot2)

# Data path
data.dir <- "G:/R Projects"                                #Enter your directry here of you S&p500 data.. you need / between folder names not \
data.read.spx <- paste(data.dir,"SPY.csv",sep="/")

# Read data to read.spx data frame
read.spx <- read.csv(data.read.spx,header=TRUE, sep=",",skip=0,stringsAsFactors=FALSE)

# Make dataframe
new.df <- data.frame(read.spx)

# Convert Values To Numeric
cols <-c(3:8)
new.df[,cols] %<>% lapply(function(x) as.numeric(as.character(x)))

#Convert Date Column [1]
new.df$Date <- ymd(new.df$Date)

# Use for subsetting by date
new.df <- subset(new.df, Date >= "2000-01-06" & Date <= "2017-08-06")    # Change date ranges
#new.df <- subset(new.df, Date >= as.Date("1980-01-01"))                                     # Choose start date to present

# Function For Simple Hurst Exponent
x <- new.df$Close # set x variable

simpleHurst <- function(y){
  sd.y <- sd(y)
  m <- mean(y)
  y <- y - m
  max.y <- max(cumsum(y))
  min.y <- min(cumsum(y))
  RS <- (max.y - min.y)/sd.y
  H <- log(RS) / log(length(y))
  return(H)
}
simpleHurst(x) #Obtain Hurst exponent for entire series

# Calcualte rolling hurst exponent for different 'n' periods
getHURST <- function(rolldays) {
  function(new.df) {
    rollapply(new.df$Close,
              width = rolldays,               # width of rolling window
              FUN = simpleHurst,
              fill = NA,
              align = "right")
  }
}
# Create a matrix to put the roll hurst in
roll.hurst.matrix <- matrix(nrow=nrow(new.df), ncol=0)

# Loop for filling it
for (i in 2:252) {
  roll.hurst.matrix <- cbind(roll.hurst.matrix, getHURST(i)(new.df))
}

# Rename columns
colnames(roll.hurst.matrix) <- sapply(2:252, function(n)paste("roll.hurst.n", n, sep=""))

# Bind to existing dataframe
new.df <-  cbind(new.df, roll.hurst.matrix)

# Line Plot of rolling hurst
ggplot(data=new.df, aes(x = Date)) +
  geom_line(aes(y = roll.hurst.n5), colour = "black") +
labs(title="Hurst Exponent - Rolling 5 Days") +
  labs(x="Date", y="Hurst Exponent")  

# Plot Roll Hurst Histogram
qplot(new.df$roll.hurst.n5,
      geom="histogram",
      binwidth = 0.005,
      main = "Simple Hurst Exponent - Rolling 5 Days",
      fill=I("grey"),
      col=I("black"),
      xlab = "Hurst Exponent")

# Plot S&P500 Close
ggplot(data=new.df, aes(x = Date)) +
  geom_line(aes(y = Close), colour = "darkblue") +
  ylab(label="S&P500 Close") +
  xlab("Date") +
  labs(title="S&P500") 

# Write output to file
write.csv(new.df,file="G:/R Projects/hurst.roll.csv")

This calculates the simple hurst exponent over an ‘n’ day look back period. As we can see from the plotted histograms, shorter time frames for the SPY show hurst exponents < 0.50 and if we extend our period to longer time frames the SPY fits the trending hurst category closer to hurst 1.

hurst 3

hurst5

hurst10

hurst 6 mo

hurst 1 year

Using this information we may design (fit) a model which captures the nature of the series under examination. In this case it would make sense to build mean reversion models on short time periods for SPY and develop trending or momentum models for the longer time frames.

References
Algorithmic Trading: Winning Strategies and Their Rationale – May 28, 2013, by Ernie Chan

R – SPY ETF Data Clean Up

In the first post we looked at how to load data in R, view data and convert formats and sort by date. In this post we will work with a data set from the Global Street Advisors which the SPY ETF belongs too. The data is quite awkward as dates are not in typical date format which can easily be read. Dates are stored in Day – Jul – Year format which means we have to parse the month as text. The numerical values are also stored as characters and any arithmetic performed on these columns would result in 0 (within excel). It should also be noted that if this original file is opened in excel, the data is highly merged and awkward to work with.

In this post we will:
1. Load a .xls file into R
2. Convert specific columns to numerical formats
3. Parse date Day – Jul – Year format and convert it to yyyymmdd
4. Change the date order
5. Remove rows using a function

Lets begin!

You can download the SPY ETF data used in this example from their website (google).

In the last post we worked with a .csv file. This time round the data is in .xls format. We can work with .xls in R. For loading the .xls file in R we can use the readxl package.

To install readxl:

install.packages('readxl')
library(readlxl)

Like last time we set out data dir and our file name, this time the file format is .xls.

# Data path
data.dir <- "C:/your/source/dir/here" #remember the forward /
data.file <- paste(data.dir,"SPY_HistoricalNav.xls",sep="/")

We use the readlxl package to read the .xls file as a data.frame using read_excel command:

# Read data
read.spy <- read_excel(data.file1, skip=3)

Now that our data is loaded as a data.frame we can check the formats of our file using str() command:

> str(read.spy)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	3452 obs. of  9 variables:
 $ Date              : chr  "25-Jul-2017" "24-Jul-2017" "21-Jul-2017" "20-Jul-2017" ...
 $ Nav               : chr  "247.364738" "246.64531" "246.901703" "246.991417" ...
 $ Shares Outstanding: chr  "980382116.0" "975232116.0" "961932116.0" "962982116.0" ...
 $ Total Net Assets  : chr  "242511965719.74" "240536427437.52" "237502677431.15" "23784

Note the date is in character format stored as Day-Jul-2017. The month is stored as text versus a traditional numerical value to signify the month. The historical NAV, shares outstanding and total net assets column are stored as characters. If we were to perform any arithmetic on these formats the result would be 0 as adding a character to a character = 0. We therefore proceed to change the Date column to the correct date format and change the other columns to numerical format.

We can convert to numerical format as below with the help of magrittr package.

# Convert Nav `Shares Outstanding` `Total Net Assets` columns to numeric format
# Using magrittr
cols <-c(2:4)
read.spy[,cols] %<>% lapply(function(x) as.numeric(as.character(x)))

If we look at the date format with head(read.spy) you will notice the format is: 25-Jul-2017 – notice the month is written with text, Jul, Jan etc etc… we wish to parse this with the parse_date_time function:

# Convert date from Day - Jan - Year format to readable date format
spy.date <- parse_date_time(x = read.spy$Date,
                            orders = c("d B Y"),
                            locale = "eng")

Notice orders = c(“d B Y”),the B in the middle signifies that the month is written in text form. Thus we can parse the date format and we can then convert to yyyymmdd format with the following line:

# Convert dates to YYYMMDD
cspydate <- as.character(as.Date(spy.date, "%m/%b/%Y"), format = "%Y%m%d")

Notice we placed the parsing of the date format in spy.date variable (yes look back up) and also our date conversion was stored in the cspydate variable. We wish to join the final output yyyymmdd dates stored in the cspydate variable back into our original data frame. We do this as below:

# Make new dataframe with yyyymmdd and original data
new.df.spy <- data.frame(cspydate,read.spy)

Quick note here – if we head(new.df.spy) we will see that the cspydate was added to our original data frame with the column name cspydate. We want to now drop our old original Date column which is stored in column [2]. After its dropped we then rename cspydate to Date as below:

# Drop original Date column from dataframe
final.df.spy <- new.df.spy[,-2]

# Rename Column Name to Date
colnames(final.df.spy)[1] <- "Date"

Next we will change the order of our data

# Sort data.frame from 'newest to oldest' to 'oldest to newest'
# Using plyr package
final.df.spy <- arrange(final.df.spy, Date)

If we tail(final.df.spy) we will see that there is rows with NA values. I want to remove these from the bottom of the data frame using the following function:

# Remove NA rows at bottom of data frame
# Note Requires changing row numbers
removeRows <- function(rowNum, data) {
  newData <- data[-rowNum, , drop = FALSE]
  rownames(newData) <- NULL
  newData
}
final.df.spy <- removeRows(3441:3452, final.df.spy)

Note this requires us to change the row numbers on a daily basis. There is likely a more elegant way to remove rows from the bottom of a data frame so its more robust to new rows of data being added. If know a better way please drop a message in the comment box.

With our data frame now clean, we can then export as .csv

# Write dataframes to csv file
write.csv(final.df.spy,file='C:/R Projects/spy_clean.csv', row.names = FALSE)

In upcoming posts we will touch on working with loops where we can perform the same functions above on multiple files. For example, we can process the above commands on every .xls historical ETF data available from Global Street Advisors. The benefit of the loop is that we reduce the lines of code that is to be written.

The succinct code is below:

#Load in Global Street Advsitors .xls
#Clean Data, convert date, convert to numerical, sort by date, remove NA vales at bottom of data frame
#Export as .csv
#Andrew Bannerman
#7.26.2017

require(readxl)
require(magrittr)
require(plyr)

# Data path
data.dir <- "G:/R Projects"
data.file1 <- paste(data.dir,"SPY_HistoricalNav.xls",sep="/")

# Read data
read.spy <- read_excel(data.file1, skip=3)

# Convert Nav `Shares Outstanding` `Total Net Assets` columns to numeric format
# Using magrittr
cols <-c(2:4)
read.spy[,cols] %<>% lapply(function(x) as.numeric(as.character(x)))

# Convert date from Day - Jan - Year format to readable date format
spy.date <- parse_date_time(x = read.spy$Date,
                            orders = c("d B Y"),
                            locale = "eng")

# Convert dates to YYYMMDD
cspydate <- as.character(as.Date(spy.date, "%m/%b/%Y"), format = "%Y%m%d")

# Make new dataframe with yyyymmdd and original data
new.df.spy <- data.frame(cspydate,read.spy)

# Drop original Date column from dataframe
final.df.spy <- new.df.spy[,-2]

# Rename Column Name to Date
colnames(final.df.spy)[1] <- "Date"

# Sort data.frame from 'newest to oldest' to 'oldest to newest'
# Using plyr package
final.df.spy <- arrange(final.df.spy, Date)

# Remove NA rows at bottom of data frame
# Note Requires changing row numbers
removeRows <- function(rowNum, data) {
  newData <- data[-rowNum, , drop = FALSE]
  rownames(newData) <- NULL
  newData
}
final.df.spy <- removeRows(3441:3452, final.df.spy)

# Write dataframes to csv file
write.csv(final.df.spy,file='C:/R Projects/spy_clean.csv', row.names = FALSE)