Hurst Exponent in R

The Hurst Exponent is a statistical testing method which tests if a series is mean reverting, trending or in geometric brownian motion. Using the hurst exponent a time series can be categorized by the following:

Hurst Values < 0.5 = mean reverting

Hurst Vales = 0.5 = geometric brownian motion

Hurst Values > 0.5 = trending

The hurst exponent falls between a range of 0 to 1. Where values closer to 0 signal stronger mean reversion and values closer to 1 signal stronger trending behavior.

Using R, we can calculate the hurst exponent:

# Hurst Exponent
# Andrew Bannerman
# 8.11.2017

require(lubridate)
require(dplyr)
require(magrittr)
require(zoo)
require(lattice)

# Data path
data.dir <- "G:/R Projects"
output.dir <- "G:/R Projects"
data.read.spx <- paste(data.dir,"SPY.csv",sep="/")

# Read data
read.spx <- read.csv(data.read.spx,header=TRUE, sep=",",skip=0,stringsAsFactors=FALSE)

# Convert Values To Numeric
cols <-c(3:8)
read.spx[,cols] %<>% lapply(function(x) as.numeric(as.character(x)))

# Convert Date Column [1]
read.spx$Date <- ymd(read.spx$Date)

# Make new data frame
new.df <- data.frame(read.spx)

# Subset Date Range
new.df <- subset(new.df, Date >= "2000-01-06" & Date <= "2017-08-06")

#Create lagged variables

lags <- 2:20

# Function for finding differences in lags. Todays Close - 'n' lag period
getLAG.DIFF <- function(lagdays) {
  function(new.df) {
    c(rep(NA, lagdays), diff(new.df$Close, lag = lagdays, differences = 1, arithmetic = TRUE, na.pad = TRUE))
  }
}
# Create a matrix to put the lagged differences in
lag.diff.matrix <- matrix(nrow=nrow(new.df), ncol=0)

# Loop for filling it
for (i in 2:20) {
  lag.diff.matrix <- cbind(lag.diff.matrix, getLAG.DIFF(i)(new.df))
}

# Rename columns
colnames(lag.diff.matrix) <- sapply(2:20, function(n)paste("lagged.diff.n", n, sep=""))

# Bind to existing dataframe
new.df <-  cbind(new.df, lag.diff.matrix)

# Calculate Variances of 'n period' differences
variance.vec <- apply(new.df[,9:ncol(new.df)], 2, function(x) var(x, na.rm=TRUE))

# Linear regression of log variances vs log lags
log.linear <- lm(formula = log(variance.vec) ~ log(lags))  
# Print general linear regression statistics  summary(log.linear)  
# Plot log of variance 'n' lags vs log time  
xyplot(log(variance.vec) ~ log(lags),         
main="SPY Daily Price Differences Variance vs Time Lags",        
 xlab = "Time",        
 ylab = "Logged Variance 'n' lags",         grid = TRUE,        
 type = c("p","r"),col.line = "red",         
abline=(h = 0)) 

hurst.exponent = coef(log.linear)[2]/2
hurst.exponent 

# Write output to file write.csv(new.df,file="G:/R Projects/hurst.csv")

For a little explanation of what is actually going on here: 1. First we are computing the lagged difference in close prices for the SPY. We do this by taking today’s SPY close – 2 day lag. This gives us the price difference between today’s SPY close and the SPY close 2 days ago. We do this for each lag 2:20. So for lag 3, this will take today’s SPY close – SPY Close 3 days ago. Repeat the process through to lag 20 (2:20). This will roll through the entire series. This is evident with head(new.df)

 > head(new.df)
           Date Ticker     Open     High      Low    Close  Volume Open.Interest lagged.diff.n2 lagged.diff.n3 lagged.diff.n4 lagged.diff.n5
1753 2000-01-06    SPY 139.2124 141.0819 137.3430 137.3430 6245656           138             NA             NA             NA             NA
1754 2000-01-07    SPY 139.8979 145.3193 139.6486 145.3193 8090507           146             NA             NA             NA             NA
1755 2000-01-10    SPY 145.8178 146.4410 144.5715 145.8178 5758617           146        8.47488             NA             NA             NA
1756 2000-01-11    SPY 145.3816 145.6932 143.0760 143.8861 7455732           144       -1.43326        6.54310             NA             NA
1757 2000-01-12    SPY 144.1976 144.1976 142.4528 142.6398 6932185           143       -3.17808       -2.67956        5.29680             NA
1758 2000-01-13    SPY 144.0730 145.3193 142.8267 144.5715 5173588           145        0.68547       -1.24631       -0.74779        7.22857

There are leading NA’s depending on which lag period we used. This then rolls through the series taking the lagged differences.

2. After we will have all of our lagged differences from 2:20 (or any other range chosen)

3. We then for each ‘n’ lag period, compute the variance for that particular lagged period. This will be the variance of the total length of each lagged difference. We can see this by printing the variance vector:

 > variance.vec
 lagged.diff.n2  lagged.diff.n3  lagged.diff.n4  lagged.diff.n5  lagged.diff.n6  lagged.diff.n7  lagged.diff.n8  lagged.diff.n9 lagged.diff.n10
       4.288337        6.065315        7.823918        9.552756       11.155789       12.702647       14.185067       15.724892       17.180618
lagged.diff.n11 lagged.diff.n12 lagged.diff.n13 lagged.diff.n14 lagged.diff.n15 lagged.diff.n16 lagged.diff.n17 lagged.diff.n18 lagged.diff.n19
      18.651980       20.167477       21.854415       23.647368       25.289570       26.751552       28.403188       30.110954       31.620225
lagged.diff.n20
      33.130844

This shows us the variance for each of our lagged differences from 2 to 20.

4. After we plot the the log variance vs the log lags.

# Linear regression of log variances vs log lags
log.linear <- lm(formula = log(variance.vec) ~ log(lags))

# Plot log of varaince 'n' lags vs log time
xyplot(log(variance.vec) ~ log(lags),
       main="SPY Daily Price Differences Variance vs Time Lags",
       xlab = "Log Lags",
       ylab = "Logged Variance 'n' lags",
       grid = TRUE,
       type = c("p","r"),col.line = "red",
       abline=(h = 0))

Rplot30

5. The hurst exponent is log(lags) estimate / 2 (the slope / 2)

hurst

For date range: “2000-01-06” to “2017-08-06” at our chosen lags of 2:20 days:

SPY Hurst exponent is 0.443483. Which is mean reverting.

Another method is to compute a rolling simple hurst exponent over a rolling ‘n’ day period.

The calculation for simple Hurst:

# Function For Simple Hurst Exponent
x <- new.df$Close # set x variable

simpleHurst <- function(y){
  sd.y <- sd(y)
  m <- mean(y)
  y <- y - m
  max.y <- max(cumsum(y))
  min.y <- min(cumsum(y))
  RS <- (max.y - min.y)/sd.y
  H <- log(RS) / log(length(y))
  return(H)
}
simpleHurst(x) # Obtain Hurst exponent for entire series

What we can do is apply the simple hurst function using rollapply in R over ‘n’ day rolling look back period, we do this using our created getHURST function:

# Hurst Exponent
# Andrew Bannerman
# 8.11.2017

require(lubridate)
require(dplyr)
require(magrittr)
require(zoo)
require(ggplot2)

# Data path
data.dir <- "G:/R Projects"                                #Enter your directry here of you S&p500 data.. you need / between folder names not \
data.read.spx <- paste(data.dir,"SPY.csv",sep="/")

# Read data to read.spx data frame
read.spx <- read.csv(data.read.spx,header=TRUE, sep=",",skip=0,stringsAsFactors=FALSE)

# Make dataframe
new.df <- data.frame(read.spx)

# Convert Values To Numeric
cols <-c(3:8)
new.df[,cols] %<>% lapply(function(x) as.numeric(as.character(x)))

#Convert Date Column [1]
new.df$Date <- ymd(new.df$Date)

# Use for subsetting by date
new.df <- subset(new.df, Date >= "2000-01-06" & Date <= "2017-08-06")    # Change date ranges
#new.df <- subset(new.df, Date >= as.Date("1980-01-01"))                                     # Choose start date to present

# Function For Simple Hurst Exponent
x <- new.df$Close # set x variable

simpleHurst <- function(y){
  sd.y <- sd(y)
  m <- mean(y)
  y <- y - m
  max.y <- max(cumsum(y))
  min.y <- min(cumsum(y))
  RS <- (max.y - min.y)/sd.y
  H <- log(RS) / log(length(y))
  return(H)
}
simpleHurst(x) #Obtain Hurst exponent for entire series

# Calcualte rolling hurst exponent for different 'n' periods
getHURST <- function(rolldays) {
  function(new.df) {
    rollapply(new.df$Close,
              width = rolldays,               # width of rolling window
              FUN = simpleHurst,
              fill = NA,
              align = "right")
  }
}
# Create a matrix to put the roll hurst in
roll.hurst.matrix <- matrix(nrow=nrow(new.df), ncol=0)

# Loop for filling it
for (i in 2:252) {
  roll.hurst.matrix <- cbind(roll.hurst.matrix, getHURST(i)(new.df))
}

# Rename columns
colnames(roll.hurst.matrix) <- sapply(2:252, function(n)paste("roll.hurst.n", n, sep=""))

# Bind to existing dataframe
new.df <-  cbind(new.df, roll.hurst.matrix)

# Line Plot of rolling hurst
ggplot(data=new.df, aes(x = Date)) +
  geom_line(aes(y = roll.hurst.n5), colour = "black") +
labs(title="Hurst Exponent - Rolling 5 Days") +
  labs(x="Date", y="Hurst Exponent")  

# Plot Roll Hurst Histogram
qplot(new.df$roll.hurst.n5,
      geom="histogram",
      binwidth = 0.005,
      main = "Simple Hurst Exponent - Rolling 5 Days",
      fill=I("grey"),
      col=I("black"),
      xlab = "Hurst Exponent")

# Plot S&P500 Close
ggplot(data=new.df, aes(x = Date)) +
  geom_line(aes(y = Close), colour = "darkblue") +
  ylab(label="S&P500 Close") +
  xlab("Date") +
  labs(title="S&P500") 

# Write output to file
write.csv(new.df,file="G:/R Projects/hurst.roll.csv")

This calculates the simple hurst exponent over an ‘n’ day look back period. As we can see from the plotted histograms, shorter time frames for the SPY show hurst exponents < 0.50 and if we extend our period to longer time frames the SPY fits the trending hurst category closer to hurst 1.

hurst 3

hurst5

hurst10

hurst 6 mo

hurst 1 year

Using this information we may design (fit) a model which captures the nature of the series under examination. In this case it would make sense to build mean reversion models on short time periods for SPY and develop trending or momentum models for the longer time frames.

References
Algorithmic Trading: Winning Strategies and Their Rationale – May 28, 2013, by Ernie Chan

R – SPY ETF Data Clean Up

In the first post we looked at how to load data in R, view data and convert formats and sort by date. In this post we will work with a data set from the Global Street Advisors which the SPY ETF belongs too. The data is quite awkward as dates are not in typical date format which can easily be read. Dates are stored in Day – Jul – Year format which means we have to parse the month as text. The numerical values are also stored as characters and any arithmetic performed on these columns would result in 0 (within excel). It should also be noted that if this original file is opened in excel, the data is highly merged and awkward to work with.

In this post we will:
1. Load a .xls file into R
2. Convert specific columns to numerical formats
3. Parse date Day – Jul – Year format and convert it to yyyymmdd
4. Change the date order
5. Remove rows using a function

Lets begin!

You can download the SPY ETF data used in this example from their website (google).

In the last post we worked with a .csv file. This time round the data is in .xls format. We can work with .xls in R. For loading the .xls file in R we can use the readxl package.

To install readxl:

install.packages('readxl')
library(readlxl)

Like last time we set out data dir and our file name, this time the file format is .xls.

# Data path
data.dir <- "C:/your/source/dir/here" #remember the forward /
data.file <- paste(data.dir,"SPY_HistoricalNav.xls",sep="/")

We use the readlxl package to read the .xls file as a data.frame using read_excel command:

# Read data
read.spy <- read_excel(data.file1, skip=3)

Now that our data is loaded as a data.frame we can check the formats of our file using str() command:

> str(read.spy)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	3452 obs. of  9 variables:
 $ Date              : chr  "25-Jul-2017" "24-Jul-2017" "21-Jul-2017" "20-Jul-2017" ...
 $ Nav               : chr  "247.364738" "246.64531" "246.901703" "246.991417" ...
 $ Shares Outstanding: chr  "980382116.0" "975232116.0" "961932116.0" "962982116.0" ...
 $ Total Net Assets  : chr  "242511965719.74" "240536427437.52" "237502677431.15" "23784

Note the date is in character format stored as Day-Jul-2017. The month is stored as text versus a traditional numerical value to signify the month. The historical NAV, shares outstanding and total net assets column are stored as characters. If we were to perform any arithmetic on these formats the result would be 0 as adding a character to a character = 0. We therefore proceed to change the Date column to the correct date format and change the other columns to numerical format.

We can convert to numerical format as below with the help of magrittr package.

# Convert Nav `Shares Outstanding` `Total Net Assets` columns to numeric format
# Using magrittr
cols <-c(2:4)
read.spy[,cols] %<>% lapply(function(x) as.numeric(as.character(x)))

If we look at the date format with head(read.spy) you will notice the format is: 25-Jul-2017 – notice the month is written with text, Jul, Jan etc etc… we wish to parse this with the parse_date_time function:

# Convert date from Day - Jan - Year format to readable date format
spy.date <- parse_date_time(x = read.spy$Date,
                            orders = c("d B Y"),
                            locale = "eng")

Notice orders = c(“d B Y”),the B in the middle signifies that the month is written in text form. Thus we can parse the date format and we can then convert to yyyymmdd format with the following line:

# Convert dates to YYYMMDD
cspydate <- as.character(as.Date(spy.date, "%m/%b/%Y"), format = "%Y%m%d")

Notice we placed the parsing of the date format in spy.date variable (yes look back up) and also our date conversion was stored in the cspydate variable. We wish to join the final output yyyymmdd dates stored in the cspydate variable back into our original data frame. We do this as below:

# Make new dataframe with yyyymmdd and original data
new.df.spy <- data.frame(cspydate,read.spy)

Quick note here – if we head(new.df.spy) we will see that the cspydate was added to our original data frame with the column name cspydate. We want to now drop our old original Date column which is stored in column [2]. After its dropped we then rename cspydate to Date as below:

# Drop original Date column from dataframe
final.df.spy <- new.df.spy[,-2]

# Rename Column Name to Date
colnames(final.df.spy)[1] <- "Date"

Next we will change the order of our data

# Sort data.frame from 'newest to oldest' to 'oldest to newest'
# Using plyr package
final.df.spy <- arrange(final.df.spy, Date)

If we tail(final.df.spy) we will see that there is rows with NA values. I want to remove these from the bottom of the data frame using the following function:

# Remove NA rows at bottom of data frame
# Note Requires changing row numbers
removeRows <- function(rowNum, data) {
  newData <- data[-rowNum, , drop = FALSE]
  rownames(newData) <- NULL
  newData
}
final.df.spy <- removeRows(3441:3452, final.df.spy)

Note this requires us to change the row numbers on a daily basis. There is likely a more elegant way to remove rows from the bottom of a data frame so its more robust to new rows of data being added. If know a better way please drop a message in the comment box.

With our data frame now clean, we can then export as .csv

# Write dataframes to csv file
write.csv(final.df.spy,file='C:/R Projects/spy_clean.csv', row.names = FALSE)

In upcoming posts we will touch on working with loops where we can perform the same functions above on multiple files. For example, we can process the above commands on every .xls historical ETF data available from Global Street Advisors. The benefit of the loop is that we reduce the lines of code that is to be written.

The succinct code is below:

#Load in Global Street Advsitors .xls
#Clean Data, convert date, convert to numerical, sort by date, remove NA vales at bottom of data frame
#Export as .csv
#Andrew Bannerman
#7.26.2017

require(readxl)
require(magrittr)
require(plyr)

# Data path
data.dir <- "G:/R Projects"
data.file1 <- paste(data.dir,"SPY_HistoricalNav.xls",sep="/")

# Read data
read.spy <- read_excel(data.file1, skip=3)

# Convert Nav `Shares Outstanding` `Total Net Assets` columns to numeric format
# Using magrittr
cols <-c(2:4)
read.spy[,cols] %<>% lapply(function(x) as.numeric(as.character(x)))

# Convert date from Day - Jan - Year format to readable date format
spy.date <- parse_date_time(x = read.spy$Date,
                            orders = c("d B Y"),
                            locale = "eng")

# Convert dates to YYYMMDD
cspydate <- as.character(as.Date(spy.date, "%m/%b/%Y"), format = "%Y%m%d")

# Make new dataframe with yyyymmdd and original data
new.df.spy <- data.frame(cspydate,read.spy)

# Drop original Date column from dataframe
final.df.spy <- new.df.spy[,-2]

# Rename Column Name to Date
colnames(final.df.spy)[1] <- "Date"

# Sort data.frame from 'newest to oldest' to 'oldest to newest'
# Using plyr package
final.df.spy <- arrange(final.df.spy, Date)

# Remove NA rows at bottom of data frame
# Note Requires changing row numbers
removeRows <- function(rowNum, data) {
  newData <- data[-rowNum, , drop = FALSE]
  rownames(newData) <- NULL
  newData
}
final.df.spy <- removeRows(3441:3452, final.df.spy)

# Write dataframes to csv file
write.csv(final.df.spy,file='C:/R Projects/spy_clean.csv', row.names = FALSE)

R – Load .csv Files Into R. Change Data Formats. Change Date Order. Export as .csv file.

If you are new to R or want to make a move from excel to working with R then these posts and subsequent posts should be useful.

The goal of this post is to walk through step by step loading data into R from a .csv file. Once the data is loaded. We will look at the format of the data and make the correct conversions. This will make our data usable for working with other packages such as ggplot2 for plotting charts or for timeseries  such as ZOO or XTS. We will also change the order of our data by sorting by date. We will then export our worked data to .csv format.

In this post we will:

  1. Load .csv file from local computer
  2. Read .csv file as a data frame
  3. View the data using head, tail and print commands
  4. Check the format of each column within the data frame
  5. Convert date column from ‘character’ to correct ‘Date’ format
  6. Order the date from ”newest to oldest’ to ‘oldest to newest’. Ie newest line of data bottom of data set
  7. Save output to .csv file.

I will assume you have downloaded and installed R and also Rstudio. Rstudio is optional but I find its a nice GUI R client that I like to use.

First we are going to load our data in R. I am going to assume you have some .csv data that perhaps came from subscription data providers or from the internet. If you do not have a .csv file in mind. You may perhaps download UltraPRO ETF historical NAV data for an example file to work with (google).

In this example I will be using the 3x ETF UPRO historical Navs / Assets under management .csv file which I downloaded from their website.

The following code sets a data directory to point to on our hard drive and specifies the .csv file name.

# Data path
data.dir <- "G:/R Projects"
data.read.upro <- paste(data.dir,"UPRO-historical_nav.csv",sep="/")

One note here about R and folder directory’s notice the forward slashes are in this format / versus if you look at a traditional path on your computer it will be the other way around \ such as: C:\data\dir. If you have errors loading data into R, this may be the culprit. Just change your slashes to the / position.

Now I want to read the .csv file. Using the read.csv() command:

# Read data
read.upro <- read.csv(data.read.upro,header=TRUE, sep=",",skip=0,stringsAsFactors=FALSE)

header=TRUE because we have a header. sep=’,’ for comma delimited. The skip=0 tells us that we do not want to skip any lines to read the header. In the UPRO .csv the header is on line 1 so I do not wish to skip any lines in reading the data. If for example we have a file with a header on line (row) 4 . We would set skip=3. This will then read the header on line 4.

stringsAsFactors=FALSE means any strings in the original .csv will be read in as characters. If we do not use this, then our strings would import as factors which then would need to be converted to the correct format.

The read.csv function above reads the data into R as a data frame.

 # Data path
 class(read.upro)
 [1] "data.frame" 

Lets check to see if our data loaded correctly. We view our data in a few ways using head, tail and print. The head views the top portion of our data as below:

head(read.upro)
        Date            ProShares.Name Ticker     NAV Prior.NAV NAV.Change.... NAV.Change.....1
1 2009-06-23 ProShares UltraPro S&P500   UPRO 6.66675   6.66675        0.00000           0.0000
2 2009-06-24 ProShares UltraPro S&P500   UPRO 6.79750   6.66670        1.96199           0.1308
3 2009-06-25 ProShares UltraPro S&P500   UPRO 7.23425   6.79745        6.42594           0.4368
4 2009-06-26 ProShares UltraPro S&P500   UPRO 7.20500   7.23430       -0.40501          -0.0293
5 2009-06-29 ProShares UltraPro S&P500   UPRO 7.40250   7.20500        2.74115           0.1975
6 2009-06-30 ProShares UltraPro S&P500   UPRO 7.21500   7.40250       -2.53293          -0.1875
  Shares.Outstanding..000. Assets.Under.Management
1                  1200.01                 8000180
2                  1200.01                 8157082
3                  1200.01                 8681187
4                  1200.01                 8646086
5                  1200.01                 8883089
6                  3600.01                25974087

This reads the first (top) 6 lines. We can specify to read more lines by:

head(read.upro,12)
         Date            ProShares.Name Ticker     NAV Prior.NAV NAV.Change.... NAV.Change.....1
1  2009-06-23 ProShares UltraPro S&P500   UPRO 6.66675   6.66675        0.00000           0.0000
2  2009-06-24 ProShares UltraPro S&P500   UPRO 6.79750   6.66670        1.96199           0.1308
3  2009-06-25 ProShares UltraPro S&P500   UPRO 7.23425   6.79745        6.42594           0.4368
4  2009-06-26 ProShares UltraPro S&P500   UPRO 7.20500   7.23430       -0.40501          -0.0293
5  2009-06-29 ProShares UltraPro S&P500   UPRO 7.40250   7.20500        2.74115           0.1975
6  2009-06-30 ProShares UltraPro S&P500   UPRO 7.21500   7.40250       -2.53293          -0.1875
7  2009-07-01 ProShares UltraPro S&P500   UPRO 7.31075   7.21495        1.32780           0.0958
8  2009-07-02 ProShares UltraPro S&P500   UPRO 6.67325   7.31075       -8.72004          -0.6375
9  2009-07-06 ProShares UltraPro S&P500   UPRO 6.72250   6.67320        0.73878           0.0493
10 2009-07-07 ProShares UltraPro S&P500   UPRO 6.32325   6.72255       -5.93971          -0.3993
11 2009-07-08 ProShares UltraPro S&P500   UPRO 6.30325   6.32325       -0.31629          -0.0200
12 2009-07-09 ProShares UltraPro S&P500   UPRO 6.36825   6.30325        1.03121           0.0650
   Shares.Outstanding..000. Assets.Under.Management
1                   1200.01                 8000180
2                   1200.01                 8157082
3                   1200.01                 8681187
4                   1200.01                 8646086
5                   1200.01                 8883089
6                   3600.01                25974087
7                   3600.01                26318788
8                   3600.01                24023780
9                   3600.01                24201081
10                  3600.01                22763776
11                  3600.01                22691776
12                  3600.01                22925776

This reads the first (top) 12 lines.

If we want to look at the bottom of the data we can do this:

tail(read.upro)
           Date            ProShares.Name Ticker    NAV Prior.NAV NAV.Change....
2015 2017-06-22 ProShares UltraPro S&P500   UPRO 105.33    105.48       -0.14221
2016 2017-06-23 ProShares UltraPro S&P500   UPRO 105.81    105.33        0.45571
2017 2017-06-26 ProShares UltraPro S&P500   UPRO 105.90    105.81        0.08506
2018 2017-06-27 ProShares UltraPro S&P500   UPRO 103.32    105.90       -2.43626
2019 2017-06-28 ProShares UltraPro S&P500   UPRO 106.08    103.32        2.67131
2020 2017-06-29 ProShares UltraPro S&P500   UPRO 103.34    106.08       -2.58296
     NAV.Change.....1 Shares.Outstanding..000. Assets.Under.Management
2015            -0.15                     8450               890038500
2016             0.48                     8450               894094500
2017             0.09                     8650               916035000
2018            -2.58                     8500               878220000
2019             2.76                     8500               901680000
2020            -2.74                     8500               878390000

This is reading the bottom 6 lines. Again if we want to view more of the bottom lines we can simply:

tail(read.upro,20)

Which will read the bottom 20 lines.

If we want to view the entire data frame we can use print:

print(read.upro)

Note: if the data set, like in this case is very large, it will max out the terminal. The default is about 111 lines.

To extend the max print we can use the following:

options(max.print=1000000)

This extends the limit of the print function.

Now that is general viewing of the data. Next we want to check which formats our data is in. The data we loaded in this example is the 3x ETF UPRO from the UltraPRO etf family. This is considered time series data. Each data point = 1 day. And we have a Date column as well as numerical values.

As you recall from earier we read the .csv file with the read.csv() command. This reads the data into R as a data frame.

We can check the format of our data as below using str() command:

str(read.upro)
'data.frame':	2020 obs. of  9 variables:
 $ Date                    : chr  "06/29/2017" "06/28/2017" "06/27/2017" "06/26/2017" ...
 $ ProShares.Name          : chr  "ProShares UltraPro S&P500" "ProShares UltraPro S&P500" "ProShares UltraPro S&P500" "ProShares UltraPro S&P500" ...
 $ Ticker                  : chr  "UPRO" "UPRO" "UPRO" "UPRO" ...
 $ NAV                     : num  103 106 103 106 106 ...
 $ Prior.NAV               : num  106 103 106 106 105 ...
 $ NAV.Change....          : num  -2.583 2.6713 -2.4363 0.0851 0.4557 ...
 $ NAV.Change.....1        : num  -2.74 2.76 -2.58 0.09 0.48 -0.15 -0.19 -2.18 2.63 0.08 ...
 $ Shares.Outstanding..000.: num  8500 8500 8500 8650 8450 8450 8450 8450 8500 8500 ...
 $ Assets.Under.Management : num  8.78e+08 9.02e+08 8.78e+08 9.16e+08 8.94e+08 ...

What the str() command has told us is:

The data is in a data.frame.

Date column is stored as  character (thanks to stringsAsFactors=FALSE during reading of .csv)

ProShares.Name is stored as character (thanks to stringsAsFactors=FALSE during reading of .csv)

Ticker column is character (thanks to stringsAsFactors=FALSE during reading of .csv)

Our numbers are in numerical format. All good. Apart from date column. If we want to plot our data with ggplot2 (or others) or create a time series object we are going to run into formatting issues with the date being in character format. The easiest way to change the format is to use a package called lubridate.

We can install lubridate with the following command:

install.packages("lubridate")

we then need to load the package into R using:

 library(lubridate)

# Convert date column [1] using lubridate package
read.upro$Date <- mdy(read.upro$Date)

we chose the mdy order because within our data our Date column is ordered by  month – day – year. If we had it in other orders, we may change it to ymd which would cater to a year – month – day order.

Now after running lubridate for date conversion we can check the format of the Date column by:

 str(read.upro$Date)
 Date[1:2020], format: "2017-06-29" "2017-06-28" "2017-06-27" "2017-06-26" "2017-06-23" "2017-06-22" ...

We now have our Date format as Date.

We also notice that the default date ordering from the UltraPRO historical NAV .csv is sorted ‘newest to oldest’. This means the newest data is always at the top of the data set. We can change this to sort ‘oldest to newest’ with the help of the plyr package. Like last time we can install the plyr package with:

install.packages("plyr")

Next we load the package into R with:

 library(plyr) 

Now we can sort the date to ‘oldest to newest’:

# Sort data.frame from 'newest to oldest' to 'oldest to newest'
# Using plyr package
read.upro <- arrange(read.upro, Date)

We can verify the data is ‘oldest to newest’ using the tail() command:

tail(read.upro)
           Date            ProShares.Name Ticker    NAV Prior.NAV NAV.Change....
2015 2017-06-22 ProShares UltraPro S&P500   UPRO 105.33    105.48       -0.14221
2016 2017-06-23 ProShares UltraPro S&P500   UPRO 105.81    105.33        0.45571
2017 2017-06-26 ProShares UltraPro S&P500   UPRO 105.90    105.81        0.08506
2018 2017-06-27 ProShares UltraPro S&P500   UPRO 103.32    105.90       -2.43626
2019 2017-06-28 ProShares UltraPro S&P500   UPRO 106.08    103.32        2.67131
2020 2017-06-29 ProShares UltraPro S&P500   UPRO 103.34    106.08       -2.58296
     NAV.Change.....1 Shares.Outstanding..000. Assets.Under.Management
2015            -0.15                     8450               890038500
2016             0.48                     8450               894094500
2017             0.09                     8650               916035000
2018            -2.58                     8500               878220000
2019             2.76                     8500               901680000
2020            -2.74                     8500               878390000

Voila!

We can then save our data.frame to an output.csv file. The following code achieves this using the write.csv() command.

# Write data.frame to .csv file
 write.csv(read.upro,file='C:/your/data/dir/clean_data.csv', row.names = FALSE)

Note: row.names=FALSE. This is optional. Default export to .csv numbers each row from 1,2,3,4,5,6 etc… If we do not want the row numbers as part of our exported .csv file we set row.names=FALSE.

In this example our numbers are already in numerical format. However, if we came across a situation where our numbers were not in numerical format. We can also convert them easily with R. I will get into this on the next post where a particular data set from an ETF provider has numbers in character format. We will work at cleaning this data set up next and reiterate some the commands we used today.

See you next time.

Succinct code below:

# Read UPRO 3x ETF .csv Source: proshares website
# Use lubridate to convert date format
# Use plyr to sort date
# You can install packages via: install.packages("packageName")
# install.packages(c("lubridate","plyr"))

require(lubridate)
require(plyr)

# Data path
data.dir <- "G:/R Projects"
data.read.upro <- paste(data.dir,"UPRO-historical_nav.csv",sep="/")

# Read data
read.upro <- read.csv(data.read.upro,header=TRUE, sep=",",skip=0,stringsAsFactors=FALSE)

# Convert date column [1] using lubridate package
read.upro$Date <- mdy(read.upro$Date)

# Sort data.frame from 'newest to oldest' to 'oldest to newest'
# Using plyr package
read.upro <- arrange(read.upro, Date)

# Write data.frame to .csv file
write.csv(read.upro,file='G:/R Projects/clean.csv', row.names = FALSE)

## Additional commands used to check and work with the data
head(read.upro) #Print top (head)
head(read.upro,20) #Print top 20 lines
tail(read.upro) #Print bottom (tail)
tail(read.upro,20) #Print bottom 20 lines
print(read.upro) #Print entire data set to R terminal
options(max.print=1000000) #Extend limits of R print
class(read.upro) #Check class
str(read.upro) #Check formats
str(read.upro$Assets.Under.Management) #Checks format of one particular column