Reliable data makes for more accurate models. It is not the end of the world if there are minor discrepancies although data does need to be representative to build models and make good assumptions.
Common data errors are known to be found at market closing times. We want the auction price not the last price. Last price might be some fluff trade with 1 lot. We want the real close or the auction close. This increases the accuracy of the daily close data.
To achieve this we can create any resolution using 1 minute bars and sample them at which ever time resolution one wishes. For example. If we want to create more accurate daily close data using the auction close price at 15:15 for ES. We may simply sample every 1 minute close at time stamp of 15:15.
If we want to build models and avoid a period of volatility, the last 15 minutes of trade we may sample the time at every 15:00 time stamp.
So in order to have more control over the creation of data I created the code attached to this blog post.
If we build 15 minute data. We may sample the 1 minute close price at each :00, :15, :30, :45 time increment.
For 30 minute data. We may sample the 1 minute close price at each :00 and :30 time increment.
# Handling missing data
Where missing data was experienced this was dealt with by forward filling the closest value. If there was a time stamp as follows: 09:01, 09:02, 09:03, 09:05, 09:07, 09:08. Where 09:04 and 09:06 are missing. To fill missing 09:04 we forward fill 09:03 data points and to fill missing 09:06 we forward fill 09:05.
This methodology seems consistent with how tradestation builds their larger resolution data from 1 minute data. Although if the data point is simply too far away, TS has an ignore feature (after studying further TS forward or back fill missing data if on the same day, I am since now looking to make this edit as it makes complete sense!). So they would miss the point 100% in the new time resolution sample.
Despite this, I feel the procedure deployed and the low frequency of missing data makes the data set good enough to build models on.
# Data Accuracy
A point to note pertaining to data accuracy. When building models on OHLC price data it makes sense to form statistical distributions by altering the original reported OHLC price by some % of the days ATR over many iterations. Thus we may observe how sensitive the models are to changes in the reported OHLC prices. In the absence of expensive bid ask data this is a good alternative.
# How to have confidence
The data of course has to be representative and accurate to how it unfolded in real time. Artificial signals would ensue if the data was of terrible quality. For this reason by forming distributions we may land in some place of confidence. If a trading model can take a hammering over the days random % ATR adjustment. Future confidence can be gained and suggests the model is fit to signal where noise/errors in OHLC reporting does not throw off the model and make it redundant.
The code for building any resolution may be found on my github with the steps as:
- Load ES daily data to index the days of trading. This removes all market holiday closures.
- Create a date and time index and join original 1 minute data to the date/time index.
- Forward fill missing data. For loop within acts like na.locf.
- Find half day market closures. Here we load 30 minute data and loop through looking for early closures and late market open (1x, sep 11, 2002)
- Save market holiday dates and times and loop through the series filling in the holiday times with “” or 0.0
- Reduce to remove all 0 and “”
- Build any time frame resolution. 10min, 15 min, 30 min etc…
Hope you enjoy and can save time vs. downloading each and every time resolution 🙂
See bar plot of number of errors in re-sampled vs Tradestation:
│ Row │ x1 │ x2 │
│ 1 │ Date │ 0 │
│ 2 │ Time │ 0 │
│ 3 │ Open │ 58 │
│ 4 │ High │ 9 │
│ 5 │ Low │ 13 │
│ 6 │ Close │ 0 │
58 discrepancies involved with the open price. There are a total of 67387 30 min bars in the example. This represents 0.0861% of the total sample.
Tradestation had missing data around the open 1 min prices. I forward filled the previous day close in this case. Where it would be more accurate to backfill the next available quote on the same day. I will likely update this pretty soon. There were other special closes such as 1 minute silences which I didn’t account for in my data set. A list may be found:
Thanks for reading.