R – SPY ETF Data Clean Up

In the first post we looked at how to load data in R, view data and convert formats and sort by date. In this post we will work with a data set from the Global Street Advisors which the SPY ETF belongs too. The data is quite awkward as dates are not in typical date format which can easily be read. Dates are stored in Day – Jul – Year format which means we have to parse the month as text. The numerical values are also stored as characters and any arithmetic performed on these columns would result in 0 (within excel). It should also be noted that if this original file is opened in excel, the data is highly merged and awkward to work with.

In this post we will:
1. Load a .xls file into R
2. Convert specific columns to numerical formats
3. Parse date Day – Jul – Year format and convert it to yyyymmdd
4. Change the date order
5. Remove rows using a function

Lets begin!

You can download the SPY ETF data used in this example from their website (google).

In the last post we worked with a .csv file. This time round the data is in .xls format. We can work with .xls in R. For loading the .xls file in R we can use the readxl package.

To install readxl:

install.packages('readxl')
library(readlxl)

Like last time we set out data dir and our file name, this time the file format is .xls.

# Data path
data.dir <- "C:/your/source/dir/here" #remember the forward /
data.file <- paste(data.dir,"SPY_HistoricalNav.xls",sep="/")

We use the readlxl package to read the .xls file as a data.frame using read_excel command:

# Read data
read.spy <- read_excel(data.file1, skip=3)

Now that our data is loaded as a data.frame we can check the formats of our file using str() command:

> str(read.spy)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	3452 obs. of  9 variables:
 $ Date              : chr  "25-Jul-2017" "24-Jul-2017" "21-Jul-2017" "20-Jul-2017" ...
 $ Nav               : chr  "247.364738" "246.64531" "246.901703" "246.991417" ...
 $ Shares Outstanding: chr  "980382116.0" "975232116.0" "961932116.0" "962982116.0" ...
 $ Total Net Assets  : chr  "242511965719.74" "240536427437.52" "237502677431.15" "23784

Note the date is in character format stored as Day-Jul-2017. The month is stored as text versus a traditional numerical value to signify the month. The historical NAV, shares outstanding and total net assets column are stored as characters. If we were to perform any arithmetic on these formats the result would be 0 as adding a character to a character = 0. We therefore proceed to change the Date column to the correct date format and change the other columns to numerical format.

We can convert to numerical format as below with the help of magrittr package.

# Convert Nav `Shares Outstanding` `Total Net Assets` columns to numeric format
# Using magrittr
cols <-c(2:4)
read.spy[,cols] %<>% lapply(function(x) as.numeric(as.character(x)))

If we look at the date format with head(read.spy) you will notice the format is: 25-Jul-2017 – notice the month is written with text, Jul, Jan etc etc… we wish to parse this with the parse_date_time function:

# Convert date from Day - Jan - Year format to readable date format
spy.date <- parse_date_time(x = read.spy$Date,
                            orders = c("d B Y"),
                            locale = "eng")

Notice orders = c(“d B Y”),the B in the middle signifies that the month is written in text form. Thus we can parse the date format and we can then convert to yyyymmdd format with the following line:

# Convert dates to YYYMMDD
cspydate <- as.character(as.Date(spy.date, "%m/%b/%Y"), format = "%Y%m%d")

Notice we placed the parsing of the date format in spy.date variable (yes look back up) and also our date conversion was stored in the cspydate variable. We wish to join the final output yyyymmdd dates stored in the cspydate variable back into our original data frame. We do this as below:

# Make new dataframe with yyyymmdd and original data
new.df.spy <- data.frame(cspydate,read.spy)

Quick note here – if we head(new.df.spy) we will see that the cspydate was added to our original data frame with the column name cspydate. We want to now drop our old original Date column which is stored in column [2]. After its dropped we then rename cspydate to Date as below:

# Drop original Date column from dataframe
final.df.spy <- new.df.spy[,-2]

# Rename Column Name to Date
colnames(final.df.spy)[1] <- "Date"

Next we will change the order of our data

# Sort data.frame from 'newest to oldest' to 'oldest to newest'
# Using plyr package
final.df.spy <- arrange(final.df.spy, Date)

If we tail(final.df.spy) we will see that there is rows with NA values. I want to remove these from the bottom of the data frame using the following function:

# Remove NA rows at bottom of data frame
# Note Requires changing row numbers
removeRows <- function(rowNum, data) {
  newData <- data[-rowNum, , drop = FALSE]
  rownames(newData) <- NULL
  newData
}
final.df.spy <- removeRows(3441:3452, final.df.spy)

Note this requires us to change the row numbers on a daily basis. There is likely a more elegant way to remove rows from the bottom of a data frame so its more robust to new rows of data being added. If know a better way please drop a message in the comment box.

With our data frame now clean, we can then export as .csv

# Write dataframes to csv file
write.csv(final.df.spy,file='C:/R Projects/spy_clean.csv', row.names = FALSE)

In upcoming posts we will touch on working with loops where we can perform the same functions above on multiple files. For example, we can process the above commands on every .xls historical ETF data available from Global Street Advisors. The benefit of the loop is that we reduce the lines of code that is to be written.

The succinct code is below:

#Load in Global Street Advsitors .xls
#Clean Data, convert date, convert to numerical, sort by date, remove NA vales at bottom of data frame
#Export as .csv
#Andrew Bannerman
#7.26.2017

require(readxl)
require(magrittr)
require(plyr)

# Data path
data.dir <- "G:/R Projects"
data.file1 <- paste(data.dir,"SPY_HistoricalNav.xls",sep="/")

# Read data
read.spy <- read_excel(data.file1, skip=3)

# Convert Nav `Shares Outstanding` `Total Net Assets` columns to numeric format
# Using magrittr
cols <-c(2:4)
read.spy[,cols] %<>% lapply(function(x) as.numeric(as.character(x)))

# Convert date from Day - Jan - Year format to readable date format
spy.date <- parse_date_time(x = read.spy$Date,
                            orders = c("d B Y"),
                            locale = "eng")

# Convert dates to YYYMMDD
cspydate <- as.character(as.Date(spy.date, "%m/%b/%Y"), format = "%Y%m%d")

# Make new dataframe with yyyymmdd and original data
new.df.spy <- data.frame(cspydate,read.spy)

# Drop original Date column from dataframe
final.df.spy <- new.df.spy[,-2]

# Rename Column Name to Date
colnames(final.df.spy)[1] <- "Date"

# Sort data.frame from 'newest to oldest' to 'oldest to newest'
# Using plyr package
final.df.spy <- arrange(final.df.spy, Date)

# Remove NA rows at bottom of data frame
# Note Requires changing row numbers
removeRows <- function(rowNum, data) {
  newData <- data[-rowNum, , drop = FALSE]
  rownames(newData) <- NULL
  newData
}
final.df.spy <- removeRows(3441:3452, final.df.spy)

# Write dataframes to csv file
write.csv(final.df.spy,file='C:/R Projects/spy_clean.csv', row.names = FALSE)

Author: Andrew Bannerman

Integrity Inspector. Quantitative Analysis is a favorite past time.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s