Julia – Build any time resolution using 1 minute data

Reliable data makes for more accurate models. It is not the end of the world if there are minor discrepancies although data does need to be representative to build models and make good assumptions. Common data errors are known to be found at market closing times. We want the auction price not the last price. […]

 

Reliable data makes for more accurate models. It is not the end of the world if there are minor discrepancies although data does need to be representative to build models and make good assumptions.

Common data errors are known to be found at market closing times. We want the auction price not the last price. Last price might be some fluff trade with 1 lot. We want the real close or the auction close. This increases the accuracy of the daily close data.

To achieve this we can create any resolution using 1 minute bars and sample them at which ever time resolution one wishes. For example. If we want to create more accurate daily close data using the auction close price at 15:15 for ES. We may simply sample every 1 minute close at time stamp of 15:15.

If we want to build models and avoid a period of volatility, the last 15 minutes of trade we may sample the time at every 15:00 time stamp.

So in order to have more control over the creation of data I created the code attached to this blog post.

If we build 15 minute data. We may sample the 1 minute close price at each :00, :15, :30, :45 time increment.

For 30 minute data. We may sample the 1 minute close price at each :00 and :30 time increment.

# Handling missing data

Where missing data was experienced this was dealt with by forward filling the closest value. If there was a time stamp as follows: 09:01, 09:02, 09:03, 09:05, 09:07, 09:08. Where 09:04 and 09:06 are missing. To fill missing 09:04 we forward fill 09:03 data points and to fill missing 09:06 we forward fill 09:05.

This methodology seems consistent with how tradestation builds their larger resolution data from 1 minute data. Although if the data point is simply too far away, TS has an ignore feature (after studying further TS forward or back fill missing data if on the same day, I am since now looking to make this edit as it makes complete sense!). So they would miss the point 100% in the new time resolution sample.

Despite this, I feel the procedure deployed and the low frequency of missing data makes the data set good enough to build models on.

# Data Accuracy

A point to note pertaining to data accuracy. When building models on OHLC price data it makes sense to form statistical distributions by altering the original reported OHLC price by some % of the days ATR over many iterations. Thus we may observe how sensitive the models are to changes in the reported OHLC prices. In the absence of expensive bid ask data this is a good alternative.

# How to have confidence

The data of course has to be representative and accurate to how it unfolded in real time. Artificial signals would ensue if the data was of terrible quality. For this reason by forming distributions we may land in some place of confidence. If a trading model can take a hammering over the days random % ATR adjustment.  Future confidence can be gained and suggests the model is fit to signal where noise/errors in OHLC reporting does not throw off the model and make it redundant.

The code for building any resolution may be found on my github with the steps as:

  1. Load ES daily data to index the days of trading. This removes all market holiday closures.
  2. Create a date and time index and join original 1 minute data to the date/time index.
  3. Forward fill missing data. For loop within acts like na.locf.
  4. Find half day market closures. Here we load 30 minute data and loop through looking for early closures and late market open (1x, sep 11, 2002)
  5. Save market holiday dates and times and loop through the series filling in the holiday times with “” or 0.0
  6. Reduce to remove all 0 and “”
  7. Build any time frame resolution. 10min, 15 min, 30 min etc…

Hope you enjoy and can save time vs. downloading each and every time resolution 🙂

See bar plot of number of errors in re-sampled vs Tradestation:

Errors

julia> out_all
6×2 DataFrames.DataFrame
│ Row │ x1 │ x2 │
├─────┼───────┼────┤
│ 1 │ Date │ 0 │
│ 2 │ Time │ 0 │
│ 3 │ Open │ 58 │
│ 4 │ High │ 9 │
│ 5 │ Low │ 13 │
│ 6 │ Close │ 0 │

58 discrepancies involved with the open price. There are a total of 67387 30 min bars in the example. This represents 0.0861% of the total sample.

Tradestation had missing data around the open 1 min prices. I forward filled the previous day close in this case. Where it would be more accurate to backfill the next available quote on the same day. I will likely update this pretty soon. There were other special closes such as 1 minute silences which I didn’t account for in my data set. A list may be found:

Nyse-Closings.pdf

Thanks for reading.

R – SPY ETF Data Clean Up

In the first post we looked at how to load data in R, view data and convert formats and sort by date. In this post we will work with a data set from the Global Street Advisors which the SPY ETF belongs too. The data is quite awkward as dates are not in typical date format which can easily be read. Dates are stored in Day – Jul – Year format which means we have to parse the month as text. The numerical values are also stored as characters and any arithmetic performed on these columns would result in 0 (within excel). It should also be noted that if this original file is opened in excel, the data is highly merged and awkward to work with.

In this post we will:
1. Load a .xls file into R
2. Convert specific columns to numerical formats
3. Parse date Day – Jul – Year format and convert it to yyyymmdd
4. Change the date order
5. Remove rows using a function

Lets begin!

You can download the SPY ETF data used in this example from their website (google).

In the last post we worked with a .csv file. This time round the data is in .xls format. We can work with .xls in R. For loading the .xls file in R we can use the readxl package.

To install readxl:

install.packages('readxl')
library(readlxl)

Like last time we set out data dir and our file name, this time the file format is .xls.

# Data path
data.dir <- "C:/your/source/dir/here" #remember the forward /
data.file <- paste(data.dir,"SPY_HistoricalNav.xls",sep="/")

We use the readlxl package to read the .xls file as a data.frame using read_excel command:

# Read data
read.spy <- read_excel(data.file1, skip=3)

Now that our data is loaded as a data.frame we can check the formats of our file using str() command:

> str(read.spy)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	3452 obs. of  9 variables:
 $ Date              : chr  "25-Jul-2017" "24-Jul-2017" "21-Jul-2017" "20-Jul-2017" ...
 $ Nav               : chr  "247.364738" "246.64531" "246.901703" "246.991417" ...
 $ Shares Outstanding: chr  "980382116.0" "975232116.0" "961932116.0" "962982116.0" ...
 $ Total Net Assets  : chr  "242511965719.74" "240536427437.52" "237502677431.15" "23784

Note the date is in character format stored as Day-Jul-2017. The month is stored as text versus a traditional numerical value to signify the month. The historical NAV, shares outstanding and total net assets column are stored as characters. If we were to perform any arithmetic on these formats the result would be 0 as adding a character to a character = 0. We therefore proceed to change the Date column to the correct date format and change the other columns to numerical format.

We can convert to numerical format as below with the help of magrittr package.

# Convert Nav `Shares Outstanding` `Total Net Assets` columns to numeric format
# Using magrittr
cols <-c(2:4)
read.spy[,cols] %<>% lapply(function(x) as.numeric(as.character(x)))

If we look at the date format with head(read.spy) you will notice the format is: 25-Jul-2017 – notice the month is written with text, Jul, Jan etc etc… we wish to parse this with the parse_date_time function:

# Convert date from Day - Jan - Year format to readable date format
spy.date <- parse_date_time(x = read.spy$Date,
                            orders = c("d B Y"),
                            locale = "eng")

Notice orders = c(“d B Y”),the B in the middle signifies that the month is written in text form. Thus we can parse the date format and we can then convert to yyyymmdd format with the following line:

# Convert dates to YYYMMDD
cspydate <- as.character(as.Date(spy.date, "%m/%b/%Y"), format = "%Y%m%d")

Notice we placed the parsing of the date format in spy.date variable (yes look back up) and also our date conversion was stored in the cspydate variable. We wish to join the final output yyyymmdd dates stored in the cspydate variable back into our original data frame. We do this as below:

# Make new dataframe with yyyymmdd and original data
new.df.spy <- data.frame(cspydate,read.spy)

Quick note here – if we head(new.df.spy) we will see that the cspydate was added to our original data frame with the column name cspydate. We want to now drop our old original Date column which is stored in column [2]. After its dropped we then rename cspydate to Date as below:

# Drop original Date column from dataframe
final.df.spy <- new.df.spy[,-2]

# Rename Column Name to Date
colnames(final.df.spy)[1] <- "Date"

Next we will change the order of our data

# Sort data.frame from 'newest to oldest' to 'oldest to newest'
# Using plyr package
final.df.spy <- arrange(final.df.spy, Date)

If we tail(final.df.spy) we will see that there is rows with NA values. I want to remove these from the bottom of the data frame using the following function:

# Remove NA rows at bottom of data frame
# Note Requires changing row numbers
removeRows <- function(rowNum, data) {
  newData <- data[-rowNum, , drop = FALSE]
  rownames(newData) <- NULL
  newData
}
final.df.spy <- removeRows(3441:3452, final.df.spy)

Note this requires us to change the row numbers on a daily basis. There is likely a more elegant way to remove rows from the bottom of a data frame so its more robust to new rows of data being added. If know a better way please drop a message in the comment box.

With our data frame now clean, we can then export as .csv

# Write dataframes to csv file
write.csv(final.df.spy,file='C:/R Projects/spy_clean.csv', row.names = FALSE)

In upcoming posts we will touch on working with loops where we can perform the same functions above on multiple files. For example, we can process the above commands on every .xls historical ETF data available from Global Street Advisors. The benefit of the loop is that we reduce the lines of code that is to be written.

The succinct code is below:

#Load in Global Street Advsitors .xls
#Clean Data, convert date, convert to numerical, sort by date, remove NA vales at bottom of data frame
#Export as .csv
#Andrew Bannerman
#7.26.2017

require(readxl)
require(magrittr)
require(plyr)

# Data path
data.dir <- "G:/R Projects"
data.file1 <- paste(data.dir,"SPY_HistoricalNav.xls",sep="/")

# Read data
read.spy <- read_excel(data.file1, skip=3)

# Convert Nav `Shares Outstanding` `Total Net Assets` columns to numeric format
# Using magrittr
cols <-c(2:4)
read.spy[,cols] %<>% lapply(function(x) as.numeric(as.character(x)))

# Convert date from Day - Jan - Year format to readable date format
spy.date <- parse_date_time(x = read.spy$Date,
                            orders = c("d B Y"),
                            locale = "eng")

# Convert dates to YYYMMDD
cspydate <- as.character(as.Date(spy.date, "%m/%b/%Y"), format = "%Y%m%d")

# Make new dataframe with yyyymmdd and original data
new.df.spy <- data.frame(cspydate,read.spy)

# Drop original Date column from dataframe
final.df.spy <- new.df.spy[,-2]

# Rename Column Name to Date
colnames(final.df.spy)[1] <- "Date"

# Sort data.frame from 'newest to oldest' to 'oldest to newest'
# Using plyr package
final.df.spy <- arrange(final.df.spy, Date)

# Remove NA rows at bottom of data frame
# Note Requires changing row numbers
removeRows <- function(rowNum, data) {
  newData <- data[-rowNum, , drop = FALSE]
  rownames(newData) <- NULL
  newData
}
final.df.spy <- removeRows(3441:3452, final.df.spy)

# Write dataframes to csv file
write.csv(final.df.spy,file='C:/R Projects/spy_clean.csv', row.names = FALSE)

R – Load .csv Files Into R. Change Data Formats. Change Date Order. Export as .csv file.

If you are new to R or want to make a move from excel to working with R then these posts and subsequent posts should be useful.

The goal of this post is to walk through step by step loading data into R from a .csv file. Once the data is loaded. We will look at the format of the data and make the correct conversions. This will make our data usable for working with other packages such as ggplot2 for plotting charts or for timeseries  such as ZOO or XTS. We will also change the order of our data by sorting by date. We will then export our worked data to .csv format.

In this post we will:

  1. Load .csv file from local computer
  2. Read .csv file as a data frame
  3. View the data using head, tail and print commands
  4. Check the format of each column within the data frame
  5. Convert date column from ‘character’ to correct ‘Date’ format
  6. Order the date from ”newest to oldest’ to ‘oldest to newest’. Ie newest line of data bottom of data set
  7. Save output to .csv file.

I will assume you have downloaded and installed R and also Rstudio. Rstudio is optional but I find its a nice GUI R client that I like to use.

First we are going to load our data in R. I am going to assume you have some .csv data that perhaps came from subscription data providers or from the internet. If you do not have a .csv file in mind. You may perhaps download UltraPRO ETF historical NAV data for an example file to work with (google).

In this example I will be using the 3x ETF UPRO historical Navs / Assets under management .csv file which I downloaded from their website.

The following code sets a data directory to point to on our hard drive and specifies the .csv file name.

# Data path
data.dir <- "G:/R Projects"
data.read.upro <- paste(data.dir,"UPRO-historical_nav.csv",sep="/")

One note here about R and folder directory’s notice the forward slashes are in this format / versus if you look at a traditional path on your computer it will be the other way around \ such as: C:\data\dir. If you have errors loading data into R, this may be the culprit. Just change your slashes to the / position.

Now I want to read the .csv file. Using the read.csv() command:

# Read data
read.upro <- read.csv(data.read.upro,header=TRUE, sep=",",skip=0,stringsAsFactors=FALSE)

header=TRUE because we have a header. sep=’,’ for comma delimited. The skip=0 tells us that we do not want to skip any lines to read the header. In the UPRO .csv the header is on line 1 so I do not wish to skip any lines in reading the data. If for example we have a file with a header on line (row) 4 . We would set skip=3. This will then read the header on line 4.

stringsAsFactors=FALSE means any strings in the original .csv will be read in as characters. If we do not use this, then our strings would import as factors which then would need to be converted to the correct format.

The read.csv function above reads the data into R as a data frame.

 # Data path
 class(read.upro)
 [1] "data.frame" 

Lets check to see if our data loaded correctly. We view our data in a few ways using head, tail and print. The head views the top portion of our data as below:

head(read.upro)
        Date            ProShares.Name Ticker     NAV Prior.NAV NAV.Change.... NAV.Change.....1
1 2009-06-23 ProShares UltraPro S&P500   UPRO 6.66675   6.66675        0.00000           0.0000
2 2009-06-24 ProShares UltraPro S&P500   UPRO 6.79750   6.66670        1.96199           0.1308
3 2009-06-25 ProShares UltraPro S&P500   UPRO 7.23425   6.79745        6.42594           0.4368
4 2009-06-26 ProShares UltraPro S&P500   UPRO 7.20500   7.23430       -0.40501          -0.0293
5 2009-06-29 ProShares UltraPro S&P500   UPRO 7.40250   7.20500        2.74115           0.1975
6 2009-06-30 ProShares UltraPro S&P500   UPRO 7.21500   7.40250       -2.53293          -0.1875
  Shares.Outstanding..000. Assets.Under.Management
1                  1200.01                 8000180
2                  1200.01                 8157082
3                  1200.01                 8681187
4                  1200.01                 8646086
5                  1200.01                 8883089
6                  3600.01                25974087

This reads the first (top) 6 lines. We can specify to read more lines by:

head(read.upro,12)
         Date            ProShares.Name Ticker     NAV Prior.NAV NAV.Change.... NAV.Change.....1
1  2009-06-23 ProShares UltraPro S&P500   UPRO 6.66675   6.66675        0.00000           0.0000
2  2009-06-24 ProShares UltraPro S&P500   UPRO 6.79750   6.66670        1.96199           0.1308
3  2009-06-25 ProShares UltraPro S&P500   UPRO 7.23425   6.79745        6.42594           0.4368
4  2009-06-26 ProShares UltraPro S&P500   UPRO 7.20500   7.23430       -0.40501          -0.0293
5  2009-06-29 ProShares UltraPro S&P500   UPRO 7.40250   7.20500        2.74115           0.1975
6  2009-06-30 ProShares UltraPro S&P500   UPRO 7.21500   7.40250       -2.53293          -0.1875
7  2009-07-01 ProShares UltraPro S&P500   UPRO 7.31075   7.21495        1.32780           0.0958
8  2009-07-02 ProShares UltraPro S&P500   UPRO 6.67325   7.31075       -8.72004          -0.6375
9  2009-07-06 ProShares UltraPro S&P500   UPRO 6.72250   6.67320        0.73878           0.0493
10 2009-07-07 ProShares UltraPro S&P500   UPRO 6.32325   6.72255       -5.93971          -0.3993
11 2009-07-08 ProShares UltraPro S&P500   UPRO 6.30325   6.32325       -0.31629          -0.0200
12 2009-07-09 ProShares UltraPro S&P500   UPRO 6.36825   6.30325        1.03121           0.0650
   Shares.Outstanding..000. Assets.Under.Management
1                   1200.01                 8000180
2                   1200.01                 8157082
3                   1200.01                 8681187
4                   1200.01                 8646086
5                   1200.01                 8883089
6                   3600.01                25974087
7                   3600.01                26318788
8                   3600.01                24023780
9                   3600.01                24201081
10                  3600.01                22763776
11                  3600.01                22691776
12                  3600.01                22925776

This reads the first (top) 12 lines.

If we want to look at the bottom of the data we can do this:

tail(read.upro)
           Date            ProShares.Name Ticker    NAV Prior.NAV NAV.Change....
2015 2017-06-22 ProShares UltraPro S&P500   UPRO 105.33    105.48       -0.14221
2016 2017-06-23 ProShares UltraPro S&P500   UPRO 105.81    105.33        0.45571
2017 2017-06-26 ProShares UltraPro S&P500   UPRO 105.90    105.81        0.08506
2018 2017-06-27 ProShares UltraPro S&P500   UPRO 103.32    105.90       -2.43626
2019 2017-06-28 ProShares UltraPro S&P500   UPRO 106.08    103.32        2.67131
2020 2017-06-29 ProShares UltraPro S&P500   UPRO 103.34    106.08       -2.58296
     NAV.Change.....1 Shares.Outstanding..000. Assets.Under.Management
2015            -0.15                     8450               890038500
2016             0.48                     8450               894094500
2017             0.09                     8650               916035000
2018            -2.58                     8500               878220000
2019             2.76                     8500               901680000
2020            -2.74                     8500               878390000

This is reading the bottom 6 lines. Again if we want to view more of the bottom lines we can simply:

tail(read.upro,20)

Which will read the bottom 20 lines.

If we want to view the entire data frame we can use print:

print(read.upro)

Note: if the data set, like in this case is very large, it will max out the terminal. The default is about 111 lines.

To extend the max print we can use the following:

options(max.print=1000000)

This extends the limit of the print function.

Now that is general viewing of the data. Next we want to check which formats our data is in. The data we loaded in this example is the 3x ETF UPRO from the UltraPRO etf family. This is considered time series data. Each data point = 1 day. And we have a Date column as well as numerical values.

As you recall from earier we read the .csv file with the read.csv() command. This reads the data into R as a data frame.

We can check the format of our data as below using str() command:

str(read.upro)
'data.frame':	2020 obs. of  9 variables:
 $ Date                    : chr  "06/29/2017" "06/28/2017" "06/27/2017" "06/26/2017" ...
 $ ProShares.Name          : chr  "ProShares UltraPro S&P500" "ProShares UltraPro S&P500" "ProShares UltraPro S&P500" "ProShares UltraPro S&P500" ...
 $ Ticker                  : chr  "UPRO" "UPRO" "UPRO" "UPRO" ...
 $ NAV                     : num  103 106 103 106 106 ...
 $ Prior.NAV               : num  106 103 106 106 105 ...
 $ NAV.Change....          : num  -2.583 2.6713 -2.4363 0.0851 0.4557 ...
 $ NAV.Change.....1        : num  -2.74 2.76 -2.58 0.09 0.48 -0.15 -0.19 -2.18 2.63 0.08 ...
 $ Shares.Outstanding..000.: num  8500 8500 8500 8650 8450 8450 8450 8450 8500 8500 ...
 $ Assets.Under.Management : num  8.78e+08 9.02e+08 8.78e+08 9.16e+08 8.94e+08 ...

What the str() command has told us is:

The data is in a data.frame.

Date column is stored as  character (thanks to stringsAsFactors=FALSE during reading of .csv)

ProShares.Name is stored as character (thanks to stringsAsFactors=FALSE during reading of .csv)

Ticker column is character (thanks to stringsAsFactors=FALSE during reading of .csv)

Our numbers are in numerical format. All good. Apart from date column. If we want to plot our data with ggplot2 (or others) or create a time series object we are going to run into formatting issues with the date being in character format. The easiest way to change the format is to use a package called lubridate.

We can install lubridate with the following command:

install.packages("lubridate")

we then need to load the package into R using:

 library(lubridate)

# Convert date column [1] using lubridate package
read.upro$Date <- mdy(read.upro$Date)

we chose the mdy order because within our data our Date column is ordered by  month – day – year. If we had it in other orders, we may change it to ymd which would cater to a year – month – day order.

Now after running lubridate for date conversion we can check the format of the Date column by:

 str(read.upro$Date)
 Date[1:2020], format: "2017-06-29" "2017-06-28" "2017-06-27" "2017-06-26" "2017-06-23" "2017-06-22" ...

We now have our Date format as Date.

We also notice that the default date ordering from the UltraPRO historical NAV .csv is sorted ‘newest to oldest’. This means the newest data is always at the top of the data set. We can change this to sort ‘oldest to newest’ with the help of the plyr package. Like last time we can install the plyr package with:

install.packages("plyr")

Next we load the package into R with:

 library(plyr) 

Now we can sort the date to ‘oldest to newest’:

# Sort data.frame from 'newest to oldest' to 'oldest to newest'
# Using plyr package
read.upro <- arrange(read.upro, Date)

We can verify the data is ‘oldest to newest’ using the tail() command:

tail(read.upro)
           Date            ProShares.Name Ticker    NAV Prior.NAV NAV.Change....
2015 2017-06-22 ProShares UltraPro S&P500   UPRO 105.33    105.48       -0.14221
2016 2017-06-23 ProShares UltraPro S&P500   UPRO 105.81    105.33        0.45571
2017 2017-06-26 ProShares UltraPro S&P500   UPRO 105.90    105.81        0.08506
2018 2017-06-27 ProShares UltraPro S&P500   UPRO 103.32    105.90       -2.43626
2019 2017-06-28 ProShares UltraPro S&P500   UPRO 106.08    103.32        2.67131
2020 2017-06-29 ProShares UltraPro S&P500   UPRO 103.34    106.08       -2.58296
     NAV.Change.....1 Shares.Outstanding..000. Assets.Under.Management
2015            -0.15                     8450               890038500
2016             0.48                     8450               894094500
2017             0.09                     8650               916035000
2018            -2.58                     8500               878220000
2019             2.76                     8500               901680000
2020            -2.74                     8500               878390000

Voila!

We can then save our data.frame to an output.csv file. The following code achieves this using the write.csv() command.

# Write data.frame to .csv file
 write.csv(read.upro,file='C:/your/data/dir/clean_data.csv', row.names = FALSE)

Note: row.names=FALSE. This is optional. Default export to .csv numbers each row from 1,2,3,4,5,6 etc… If we do not want the row numbers as part of our exported .csv file we set row.names=FALSE.

In this example our numbers are already in numerical format. However, if we came across a situation where our numbers were not in numerical format. We can also convert them easily with R. I will get into this on the next post where a particular data set from an ETF provider has numbers in character format. We will work at cleaning this data set up next and reiterate some the commands we used today.

See you next time.

Succinct code below:

# Read UPRO 3x ETF .csv Source: proshares website
# Use lubridate to convert date format
# Use plyr to sort date
# You can install packages via: install.packages("packageName")
# install.packages(c("lubridate","plyr"))

require(lubridate)
require(plyr)

# Data path
data.dir <- "G:/R Projects"
data.read.upro <- paste(data.dir,"UPRO-historical_nav.csv",sep="/")

# Read data
read.upro <- read.csv(data.read.upro,header=TRUE, sep=",",skip=0,stringsAsFactors=FALSE)

# Convert date column [1] using lubridate package
read.upro$Date <- mdy(read.upro$Date)

# Sort data.frame from 'newest to oldest' to 'oldest to newest'
# Using plyr package
read.upro <- arrange(read.upro, Date)

# Write data.frame to .csv file
write.csv(read.upro,file='G:/R Projects/clean.csv', row.names = FALSE)

## Additional commands used to check and work with the data
head(read.upro) #Print top (head)
head(read.upro,20) #Print top 20 lines
tail(read.upro) #Print bottom (tail)
tail(read.upro,20) #Print bottom 20 lines
print(read.upro) #Print entire data set to R terminal
options(max.print=1000000) #Extend limits of R print
class(read.upro) #Check class
str(read.upro) #Check formats
str(read.upro$Assets.Under.Management) #Checks format of one particular column