R – Load .csv Files Into R. Change Data Formats. Change Date Order. Export as .csv file.

If you are new to R or want to make a move from excel to working with R then these posts and subsequent posts should be useful.

The goal of this post is to walk through step by step loading data into R from a .csv file. Once the data is loaded. We will look at the format of the data and make the correct conversions. This will make our data usable for working with other packages such as ggplot2 for plotting charts or for timeseries  such as ZOO or XTS. We will also change the order of our data by sorting by date. We will then export our worked data to .csv format.

In this post we will:

  1. Load .csv file from local computer
  2. Read .csv file as a data frame
  3. View the data using head, tail and print commands
  4. Check the format of each column within the data frame
  5. Convert date column from ‘character’ to correct ‘Date’ format
  6. Order the date from ”newest to oldest’ to ‘oldest to newest’. Ie newest line of data bottom of data set
  7. Save output to .csv file.

I will assume you have downloaded and installed R and also Rstudio. Rstudio is optional but I find its a nice GUI R client that I like to use.

First we are going to load our data in R. I am going to assume you have some .csv data that perhaps came from subscription data providers or from the internet. If you do not have a .csv file in mind. You may perhaps download UltraPRO ETF historical NAV data for an example file to work with (google).

In this example I will be using the 3x ETF UPRO historical Navs / Assets under management .csv file which I downloaded from their website.

The following code sets a data directory to point to on our hard drive and specifies the .csv file name.

# Data path
data.dir <- "G:/R Projects"
data.read.upro <- paste(data.dir,"UPRO-historical_nav.csv",sep="/")

One note here about R and folder directory’s notice the forward slashes are in this format / versus if you look at a traditional path on your computer it will be the other way around \ such as: C:\data\dir. If you have errors loading data into R, this may be the culprit. Just change your slashes to the / position.

Now I want to read the .csv file. Using the read.csv() command:

# Read data
read.upro <- read.csv(data.read.upro,header=TRUE, sep=",",skip=0,stringsAsFactors=FALSE)

header=TRUE because we have a header. sep=’,’ for comma delimited. The skip=0 tells us that we do not want to skip any lines to read the header. In the UPRO .csv the header is on line 1 so I do not wish to skip any lines in reading the data. If for example we have a file with a header on line (row) 4 . We would set skip=3. This will then read the header on line 4.

stringsAsFactors=FALSE means any strings in the original .csv will be read in as characters. If we do not use this, then our strings would import as factors which then would need to be converted to the correct format.

The read.csv function above reads the data into R as a data frame.

 # Data path
 class(read.upro)
 [1] "data.frame" 

Lets check to see if our data loaded correctly. We view our data in a few ways using head, tail and print. The head views the top portion of our data as below:

head(read.upro)
        Date            ProShares.Name Ticker     NAV Prior.NAV NAV.Change.... NAV.Change.....1
1 2009-06-23 ProShares UltraPro S&P500   UPRO 6.66675   6.66675        0.00000           0.0000
2 2009-06-24 ProShares UltraPro S&P500   UPRO 6.79750   6.66670        1.96199           0.1308
3 2009-06-25 ProShares UltraPro S&P500   UPRO 7.23425   6.79745        6.42594           0.4368
4 2009-06-26 ProShares UltraPro S&P500   UPRO 7.20500   7.23430       -0.40501          -0.0293
5 2009-06-29 ProShares UltraPro S&P500   UPRO 7.40250   7.20500        2.74115           0.1975
6 2009-06-30 ProShares UltraPro S&P500   UPRO 7.21500   7.40250       -2.53293          -0.1875
  Shares.Outstanding..000. Assets.Under.Management
1                  1200.01                 8000180
2                  1200.01                 8157082
3                  1200.01                 8681187
4                  1200.01                 8646086
5                  1200.01                 8883089
6                  3600.01                25974087

This reads the first (top) 6 lines. We can specify to read more lines by:

head(read.upro,12)
         Date            ProShares.Name Ticker     NAV Prior.NAV NAV.Change.... NAV.Change.....1
1  2009-06-23 ProShares UltraPro S&P500   UPRO 6.66675   6.66675        0.00000           0.0000
2  2009-06-24 ProShares UltraPro S&P500   UPRO 6.79750   6.66670        1.96199           0.1308
3  2009-06-25 ProShares UltraPro S&P500   UPRO 7.23425   6.79745        6.42594           0.4368
4  2009-06-26 ProShares UltraPro S&P500   UPRO 7.20500   7.23430       -0.40501          -0.0293
5  2009-06-29 ProShares UltraPro S&P500   UPRO 7.40250   7.20500        2.74115           0.1975
6  2009-06-30 ProShares UltraPro S&P500   UPRO 7.21500   7.40250       -2.53293          -0.1875
7  2009-07-01 ProShares UltraPro S&P500   UPRO 7.31075   7.21495        1.32780           0.0958
8  2009-07-02 ProShares UltraPro S&P500   UPRO 6.67325   7.31075       -8.72004          -0.6375
9  2009-07-06 ProShares UltraPro S&P500   UPRO 6.72250   6.67320        0.73878           0.0493
10 2009-07-07 ProShares UltraPro S&P500   UPRO 6.32325   6.72255       -5.93971          -0.3993
11 2009-07-08 ProShares UltraPro S&P500   UPRO 6.30325   6.32325       -0.31629          -0.0200
12 2009-07-09 ProShares UltraPro S&P500   UPRO 6.36825   6.30325        1.03121           0.0650
   Shares.Outstanding..000. Assets.Under.Management
1                   1200.01                 8000180
2                   1200.01                 8157082
3                   1200.01                 8681187
4                   1200.01                 8646086
5                   1200.01                 8883089
6                   3600.01                25974087
7                   3600.01                26318788
8                   3600.01                24023780
9                   3600.01                24201081
10                  3600.01                22763776
11                  3600.01                22691776
12                  3600.01                22925776

This reads the first (top) 12 lines.

If we want to look at the bottom of the data we can do this:

tail(read.upro)
           Date            ProShares.Name Ticker    NAV Prior.NAV NAV.Change....
2015 2017-06-22 ProShares UltraPro S&P500   UPRO 105.33    105.48       -0.14221
2016 2017-06-23 ProShares UltraPro S&P500   UPRO 105.81    105.33        0.45571
2017 2017-06-26 ProShares UltraPro S&P500   UPRO 105.90    105.81        0.08506
2018 2017-06-27 ProShares UltraPro S&P500   UPRO 103.32    105.90       -2.43626
2019 2017-06-28 ProShares UltraPro S&P500   UPRO 106.08    103.32        2.67131
2020 2017-06-29 ProShares UltraPro S&P500   UPRO 103.34    106.08       -2.58296
     NAV.Change.....1 Shares.Outstanding..000. Assets.Under.Management
2015            -0.15                     8450               890038500
2016             0.48                     8450               894094500
2017             0.09                     8650               916035000
2018            -2.58                     8500               878220000
2019             2.76                     8500               901680000
2020            -2.74                     8500               878390000

This is reading the bottom 6 lines. Again if we want to view more of the bottom lines we can simply:

tail(read.upro,20)

Which will read the bottom 20 lines.

If we want to view the entire data frame we can use print:

print(read.upro)

Note: if the data set, like in this case is very large, it will max out the terminal. The default is about 111 lines.

To extend the max print we can use the following:

options(max.print=1000000)

This extends the limit of the print function.

Now that is general viewing of the data. Next we want to check which formats our data is in. The data we loaded in this example is the 3x ETF UPRO from the UltraPRO etf family. This is considered time series data. Each data point = 1 day. And we have a Date column as well as numerical values.

As you recall from earier we read the .csv file with the read.csv() command. This reads the data into R as a data frame.

We can check the format of our data as below using str() command:

str(read.upro)
'data.frame':	2020 obs. of  9 variables:
 $ Date                    : chr  "06/29/2017" "06/28/2017" "06/27/2017" "06/26/2017" ...
 $ ProShares.Name          : chr  "ProShares UltraPro S&P500" "ProShares UltraPro S&P500" "ProShares UltraPro S&P500" "ProShares UltraPro S&P500" ...
 $ Ticker                  : chr  "UPRO" "UPRO" "UPRO" "UPRO" ...
 $ NAV                     : num  103 106 103 106 106 ...
 $ Prior.NAV               : num  106 103 106 106 105 ...
 $ NAV.Change....          : num  -2.583 2.6713 -2.4363 0.0851 0.4557 ...
 $ NAV.Change.....1        : num  -2.74 2.76 -2.58 0.09 0.48 -0.15 -0.19 -2.18 2.63 0.08 ...
 $ Shares.Outstanding..000.: num  8500 8500 8500 8650 8450 8450 8450 8450 8500 8500 ...
 $ Assets.Under.Management : num  8.78e+08 9.02e+08 8.78e+08 9.16e+08 8.94e+08 ...

What the str() command has told us is:

The data is in a data.frame.

Date column is stored as  character (thanks to stringsAsFactors=FALSE during reading of .csv)

ProShares.Name is stored as character (thanks to stringsAsFactors=FALSE during reading of .csv)

Ticker column is character (thanks to stringsAsFactors=FALSE during reading of .csv)

Our numbers are in numerical format. All good. Apart from date column. If we want to plot our data with ggplot2 (or others) or create a time series object we are going to run into formatting issues with the date being in character format. The easiest way to change the format is to use a package called lubridate.

We can install lubridate with the following command:

install.packages("lubridate")

we then need to load the package into R using:

 library(lubridate)

# Convert date column [1] using lubridate package
read.upro$Date <- mdy(read.upro$Date)

we chose the mdy order because within our data our Date column is ordered by  month – day – year. If we had it in other orders, we may change it to ymd which would cater to a year – month – day order.

Now after running lubridate for date conversion we can check the format of the Date column by:

 str(read.upro$Date)
 Date[1:2020], format: "2017-06-29" "2017-06-28" "2017-06-27" "2017-06-26" "2017-06-23" "2017-06-22" ...

We now have our Date format as Date.

We also notice that the default date ordering from the UltraPRO historical NAV .csv is sorted ‘newest to oldest’. This means the newest data is always at the top of the data set. We can change this to sort ‘oldest to newest’ with the help of the plyr package. Like last time we can install the plyr package with:

install.packages("plyr")

Next we load the package into R with:

 library(plyr) 

Now we can sort the date to ‘oldest to newest’:

# Sort data.frame from 'newest to oldest' to 'oldest to newest'
# Using plyr package
read.upro <- arrange(read.upro, Date)

We can verify the data is ‘oldest to newest’ using the tail() command:

tail(read.upro)
           Date            ProShares.Name Ticker    NAV Prior.NAV NAV.Change....
2015 2017-06-22 ProShares UltraPro S&P500   UPRO 105.33    105.48       -0.14221
2016 2017-06-23 ProShares UltraPro S&P500   UPRO 105.81    105.33        0.45571
2017 2017-06-26 ProShares UltraPro S&P500   UPRO 105.90    105.81        0.08506
2018 2017-06-27 ProShares UltraPro S&P500   UPRO 103.32    105.90       -2.43626
2019 2017-06-28 ProShares UltraPro S&P500   UPRO 106.08    103.32        2.67131
2020 2017-06-29 ProShares UltraPro S&P500   UPRO 103.34    106.08       -2.58296
     NAV.Change.....1 Shares.Outstanding..000. Assets.Under.Management
2015            -0.15                     8450               890038500
2016             0.48                     8450               894094500
2017             0.09                     8650               916035000
2018            -2.58                     8500               878220000
2019             2.76                     8500               901680000
2020            -2.74                     8500               878390000

Voila!

We can then save our data.frame to an output.csv file. The following code achieves this using the write.csv() command.

# Write data.frame to .csv file
 write.csv(read.upro,file='C:/your/data/dir/clean_data.csv', row.names = FALSE)

Note: row.names=FALSE. This is optional. Default export to .csv numbers each row from 1,2,3,4,5,6 etc… If we do not want the row numbers as part of our exported .csv file we set row.names=FALSE.

In this example our numbers are already in numerical format. However, if we came across a situation where our numbers were not in numerical format. We can also convert them easily with R. I will get into this on the next post where a particular data set from an ETF provider has numbers in character format. We will work at cleaning this data set up next and reiterate some the commands we used today.

See you next time.

Succinct code below:

# Read UPRO 3x ETF .csv Source: proshares website
# Use lubridate to convert date format
# Use plyr to sort date
# You can install packages via: install.packages("packageName")
# install.packages(c("lubridate","plyr"))

require(lubridate)
require(plyr)

# Data path
data.dir <- "G:/R Projects"
data.read.upro <- paste(data.dir,"UPRO-historical_nav.csv",sep="/")

# Read data
read.upro <- read.csv(data.read.upro,header=TRUE, sep=",",skip=0,stringsAsFactors=FALSE)

# Convert date column [1] using lubridate package
read.upro$Date <- mdy(read.upro$Date)

# Sort data.frame from 'newest to oldest' to 'oldest to newest'
# Using plyr package
read.upro <- arrange(read.upro, Date)

# Write data.frame to .csv file
write.csv(read.upro,file='G:/R Projects/clean.csv', row.names = FALSE)

## Additional commands used to check and work with the data
head(read.upro) #Print top (head)
head(read.upro,20) #Print top 20 lines
tail(read.upro) #Print bottom (tail)
tail(read.upro,20) #Print bottom 20 lines
print(read.upro) #Print entire data set to R terminal
options(max.print=1000000) #Extend limits of R print
class(read.upro) #Check class
str(read.upro) #Check formats
str(read.upro$Assets.Under.Management) #Checks format of one particular column

Author: Andrew Bannerman

Integrity Inspector. Quantitative Analysis is a favorite past time.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s