Julia – Build any time resolution using 1 minute data

Reliable data makes for more accurate models. It is not the end of the world if there are minor discrepancies although data does need to be representative to build models and make good assumptions. Common data errors are known to be found at market closing times. We want the auction price not the last price. […]

 

Reliable data makes for more accurate models. It is not the end of the world if there are minor discrepancies although data does need to be representative to build models and make good assumptions.

Common data errors are known to be found at market closing times. We want the auction price not the last price. Last price might be some fluff trade with 1 lot. We want the real close or the auction close. This increases the accuracy of the daily close data.

To achieve this we can create any resolution using 1 minute bars and sample them at which ever time resolution one wishes. For example. If we want to create more accurate daily close data using the auction close price at 15:15 for ES. We may simply sample every 1 minute close at time stamp of 15:15.

If we want to build models and avoid a period of volatility, the last 15 minutes of trade we may sample the time at every 15:00 time stamp.

So in order to have more control over the creation of data I created the code attached to this blog post.

If we build 15 minute data. We may sample the 1 minute close price at each :00, :15, :30, :45 time increment.

For 30 minute data. We may sample the 1 minute close price at each :00 and :30 time increment.

# Handling missing data

Where missing data was experienced this was dealt with by forward filling the closest value. If there was a time stamp as follows: 09:01, 09:02, 09:03, 09:05, 09:07, 09:08. Where 09:04 and 09:06 are missing. To fill missing 09:04 we forward fill 09:03 data points and to fill missing 09:06 we forward fill 09:05.

This methodology seems consistent with how tradestation builds their larger resolution data from 1 minute data. Although if the data point is simply too far away, TS has an ignore feature (after studying further TS forward or back fill missing data if on the same day, I am since now looking to make this edit as it makes complete sense!). So they would miss the point 100% in the new time resolution sample.

Despite this, I feel the procedure deployed and the low frequency of missing data makes the data set good enough to build models on.

# Data Accuracy

A point to note pertaining to data accuracy. When building models on OHLC price data it makes sense to form statistical distributions by altering the original reported OHLC price by some % of the days ATR over many iterations. Thus we may observe how sensitive the models are to changes in the reported OHLC prices. In the absence of expensive bid ask data this is a good alternative.

# How to have confidence

The data of course has to be representative and accurate to how it unfolded in real time. Artificial signals would ensue if the data was of terrible quality. For this reason by forming distributions we may land in some place of confidence. If a trading model can take a hammering over the days random % ATR adjustment.  Future confidence can be gained and suggests the model is fit to signal where noise/errors in OHLC reporting does not throw off the model and make it redundant.

The code for building any resolution may be found on my github with the steps as:

  1. Load ES daily data to index the days of trading. This removes all market holiday closures.
  2. Create a date and time index and join original 1 minute data to the date/time index.
  3. Forward fill missing data. For loop within acts like na.locf.
  4. Find half day market closures. Here we load 30 minute data and loop through looking for early closures and late market open (1x, sep 11, 2002)
  5. Save market holiday dates and times and loop through the series filling in the holiday times with “” or 0.0
  6. Reduce to remove all 0 and “”
  7. Build any time frame resolution. 10min, 15 min, 30 min etc…

Hope you enjoy and can save time vs. downloading each and every time resolution 🙂

See bar plot of number of errors in re-sampled vs Tradestation:

Errors

julia> out_all
6×2 DataFrames.DataFrame
│ Row │ x1 │ x2 │
├─────┼───────┼────┤
│ 1 │ Date │ 0 │
│ 2 │ Time │ 0 │
│ 3 │ Open │ 58 │
│ 4 │ High │ 9 │
│ 5 │ Low │ 13 │
│ 6 │ Close │ 0 │

58 discrepancies involved with the open price. There are a total of 67387 30 min bars in the example. This represents 0.0861% of the total sample.

Tradestation had missing data around the open 1 min prices. I forward filled the previous day close in this case. Where it would be more accurate to backfill the next available quote on the same day. I will likely update this pretty soon. There were other special closes such as 1 minute silences which I didn’t account for in my data set. A list may be found:

Nyse-Closings.pdf

Thanks for reading.

Estimating the Hurst Exponent Using R/S Range

In this post we will estimate the Hurst exponent using the R/S method. The Hurst exponent determines the long range memory of a time series (more below). If a series has no memory ie if each point in time is independent from previous points in time then its said to be more of a random process. Examples of random processes are markov processes, Brownian motion and white noise.

A series which trends has autocorrelation where one point in time is correlated with a lagged version of itself. The autocorrelation of a series may be calculated with the following formula:

auto correlation

In English the top part of the equation (numerator) is the covariance where Yi is the series and Yi+k some lagged version of itself. The bottom part of the equation (denominator) is the variance.

So we simply calculate the covariance of the series and a lagged version of itself and divide by the variance.

We may plot the autocorrelation of the SP500 futures contract close prices.
source: https://www.quandl.com/data/CHRIS/CME_SP1-S-P-500-Futures-Continuous-Contract-1-SP1-Front-Month

Can not validate the accuracy of this data but good enough for illustration purposes.

auto_cor.png

This is far from random and close prices have a high degree of correlation between some lag of itself.

Now we may estimate the Hurst exponent over a SP futures return series using the rescaled analysis method (RS). This is the original method calculated by Harold Edwin Hurst who devised the formula to measure the mean reversion tendency of the Nile river.

We shall estimate H for a 1 bar return series and a 100 bar return series.

The R/S calculation steps are depicted on Wikipedia.

The Hurst exponent is more of an estimate than an absolute calculation. Corresponding H values signal:

H values less than 0.5 = anti persistent behavior, mean reverting
H values of 0.5 = random process
H values greater than 0.5 = persistent behavior, trending

The Hurst exponent procedure is as follows which calculates the R/S range and estimates the Hurst exponent by regression log(R/S) and log(n-lag):

1. Calculate a return series, We will estimate the Hurst exponent on a 1 bar return series. The thesis here is that it will be close to a random process if indeed each bar is independent from the last, or no auto correlation, H=0.5. If we pick a longer return series such as 100 we would expect to see higher H values and higher auto correlation coefficients.
2. Hurst exponent is estimated by regression of a power law of log(R/S) vs log(n-lag). The Slope being the Hurst exponent. For that reason logarithmically spaced block sizes were chosen at lags: [8,16,32,64,128,256,512,1024]
2. Calculate the mean for each block size. In this case 1:8, 2:9 etc.. or i-n+1:i. Do this for each block size.
3. For said block size, subtract the mean from each value in the return series
4. Sum the deviations from the mean
5. Find the maximum and minimum of the sum of deviations from the mean
6. Find R, the range of summed deviations from the mean by subtracting maximum – minimum
7. Calculate the standard deviation of the deviations from the mean
8. Calculate R/S, the rescaled range by dividing R by the stdev of deviations from the mean
9. After rolling through the series calculating the RS value along each block size. We take the mean RS value for each block size. So for each lag (block size) we have one mean value.
10. Perform a regression log2(mean_RS) vs log2(n_lag). The slope is the hurst exponent. The procedure above is detailed below using Julia language:


####################################################################################
# Rolling Hurst
####################################################################################

# initialize outputs
m = Array{Float64}(length(d_es_close),0)
log_n_out = m
out_RS = m

# Set lags (range or specific)
max_lag = 100
n = 200
lags = n:max_lag
# or specific lags #comment out where necessary
lags = [8,16,32,64,128,256,512,1024]
# Specify return series lag
n_lag = 2000

i=1
j=1
c=1
    for j = lags
        n=lags[c]
    # Calculate returns of the series
    #n_lag = lags[c] # set n_lag 1 for 1 day returns
    rets = zeros(d_es_close)
    n_lag = n_lag
    for i in n_lag+1:length(d_es_close)
        rets[i] = ((d_es_close[i] / d_es_close[i-n_lag])-1) # rets[i] = ((d_es_close[i] / d_es_close[i-n_lag+1])-1)
    end
    #rets = d_es_close
    # Find mean width of lookback
    mean_out = zeros(rets)
    for i = n:size(rets,1)
                mean_out[i] = mean(rets[i-n+1:i])
            end
    # Subtract deviations from the mean
    dev_mean_out = zeros(mean_out)
    for i = n:size(mean_out,1)
                dev_mean_out[i] = (rets[i] - mean_out[i])
            end
    # Roll sum the deviations from the mean
    sum_out = zeros(dev_mean_out)
    for i = n:size(dev_mean_out,1)
            sum_out[i] = sum(dev_mean_out[i-n+1:i])
        end
    # Find the maximum and minimum of sum of the mean deviations
    max_out = zeros(sum_out)
    for i = n:size(sum_out,1)
                max_out[i] = maximum(sum_out[i-n+1:i])
            end
    min_out = zeros(sum_out)
    for i = n:size(sum_out,1)
                min_out[i] = minimum(sum_out[i-n+1:i])
            end
    # R = Range, max - min
    R_out = zeros(dev_mean_out)
    for i= n:size(dev_mean_out,1)
        R_out[i] = max_out[i] - min_out[i]
    end
    # Rolling standard deviation of (returns) the mean
    stdev_out = zeros(rets)
    for i = n:size(rets,1)
            stdev_out[i] = sqrt(var(dev_mean_out[i-n+1:i]))
        end
    # Calculate rescaled range (Range (R_out) / stdev of returns )
    RS_out = zeros(R_out)
    for i = n:size(R_out,1)
            RS_out[i] = R_out[i] / stdev_out[i]
        end
    # Calculate log_n (n)
    index = fill(n,length(rets))
    log_n = zeros(rets)
    for i =n:size(index,1)
        log_n[i] = log2(index[i])
    end

# out
log_n_out = hcat(log_n_out, log_n)
#log_rs_out = hcat(log_rs_out, log_RS)
out_RS = hcat(out_RS, RS_out) # re-scaled range

c = c+1
end

# access dims of the matrix
# row ,col
#log_rs_out[20,:]

# Calculate expected value for R/S over various n
# Take the rolling mean over each varying n lag at n width
expected_RS = zeros(size(out_RS,1), size(out_RS,2))
n=100
for j = 1:size(out_RS,2)  # loop on column dimension
    for i = n:size(out_RS,1) # loop on row dimension
            expected_RS[i,j] =  mean(out_RS[i-n+1:i,j])
            end
        end

# log 2
expected_log_RS = zeros(size(out_RS,1), size(out_RS,2))
c=1
for j = 1:size(expected_log_RS,2)  # loop on column dimension
    for i = n:size(expected_log_RS,1) # loop on row dimension
            expected_log_RS[i,j] =  log2(expected_RS[i,j])
            end
                        c=c+1
        end

    # Regression slope of log(n) and expected value of log(R/S)
    # x = log(n), y = log(R/S)
    b_slope = zeros(size(out_RS,1))
    A_intercept = zeros(size(out_RS,1))

    for i = n:size(out_RS,1)
        xi = log_n_out[i,:]  # grab the row for varying lags of log_n
        yi = expected_log_RS[i,:] # grab the row for varying lags of r/s value
        # Mean of X (Mx)
        Mx = mean(xi)
        # Mean of Y (My)
        My = mean(yi)
        # Standard deviation of X (sX)
        sX = sqrt(var(xi))
        # Standard Deviation of Y (sY)
        sY = sqrt(var(yi))
        # Correlation of x and y  (r)
        r = cor(xi,yi)
        # slope (b) = r * (sY/sX)
        b = r * (sY/sX)
    # find intercept A = MY - bMX
    A = My - (b*Mx)

# out
b_slope[i] = b
A_intercept[i] = A
end

Using the code above we may plot the hurst exponent for a 1 bar return series.

Hurst (2)

If we have a random process, successive points in time are independent from each other. We see Hurst values fluctuated around the .50 area with H .75 in 2009 during a strong trending period.

We may see the effect of analyzing a longer return stream, lets say 100 bar return series:

H_close.png

We see Hurst values closer to 1 for a 100 bar return series.

Furthermore to see how the Hurst exponent and auto correlation are related, we may calculate a rolling auto correlation:

# Rolling Autocorrelation
# Sliding window (k width)
mean_out = zeros(size(d_es_close,1))
var_out = zeros(size(d_es_close,1))
cov_out = zeros(size(d_es_close,1))
autocor_out = zeros(size(d_es_close,1))
devmean_out = zeros(size(d_es_close,1))
n=1000
k = 1 # set lag
n_lag = 100 # return series lag

# Calculate 1 bar return series
rets = zeros(size(d_es_close,1))
for i in n_lag+1:length(d_es_close)
    rets[i] = ((d_es_close[i] / d_es_close[i-n_lag])-1) # rets[i] = ((d_es_close[i] / d_es_close[i-n_lag+1])-1)
end

# lag the series by k
lagged_series = [fill(0,k); rets[1:end-k]]

# Calculate the mean of the rolling  sample
for i = n:size(rets,1)
    mean_out[i] = mean(rets[i-n+1:i])
end

# Calculate deviations from the mean
for i = n:size(rets,1)
devmean_out[i] = rets[i] - mean_out[i]
end

# Calculate rolling variance
for i = n:size(rets,1)
    var_out[i] = var(rets[i-n+1:i])
end

# Calculate rolling covariance
for i = n:size(rets,1)
    if i+k >= size(rets,1)
        break
    end
    cov_out[i] = cov(rets[i-n+1:i],lagged_series[i-n+1:i])
end

# Rolling Autocorrelation
for i =n:size(rets,1)
    autocor_out[i] = cov_out[i] / var_out[i]
end

For a 1 bar return series at lag, k=1

auto_cor_1_bar

We see very little correlation from one point in time to the next. The respective hurst exponent is around 0.40 to 0.55 range.

And the auto correlation for a 100 bar return series at lag , k=1

auto_cor_100_bar.png

We see strong correlation at the 100 bar return series with correlation coefficients greater than .96 with respective hurst exponents closer to 1.0.

So whats the point in all of this?

As depicted above information pertaining to the (non) randomness of series can may be of some benefit when designing models to fit to the data. The autocorrelation / Hurst exponent may be used to compliment existing strategies signalling the type of regime the market currently is under and may add to strategies which better suit current conditions.

As a  note of interest. On a 1 bar daily return series for the SPY ETF at lags 2:20:

H_two_twenty

Values of H are below .50 and in mean reversion territory. Lets shorten the lags one more time 2:10.

H_two_ten

We see significant mean reversion H values. At holding periods sub 20 bars, a mean reversion model would be best suited.

A way I like to think of this is how quickly does something diffuse from the mean? A mean reversion series is almost stuck in the mud, and the prices are centered around a mean. I like to think of it as a compressed coil or in the fractal dimension, pertains to the roughness of a surface where mean reversion is most certainly rough. On the other side a strong trend has a faster rate of diffusion from the mean or in the fractal dimension has a smooth surface. The S&P500 is a blend of this action where on a shorter holding period, there is mean reversion tendency (rough) but on longer time horizon the market displays trending behavior (smooth).

In closing and to recap. We see how the Hurst exponent and the auto correlation of a series are related to each other. Random processes show no dependence from one point in time to a successive point in time, or having no correlation with a subsequent Hurst value of 0.5. On the other side of this, a series showing persistent behavior, a trending series, displayed high correlation coefficients where successive points in time were correlated with each other and subsequent H values closer to 1.

Code may be found on my github.