Thursday, September 12, 2013

Only Load Data If Not Already Open in R

I often find it beneficial to check to see whether or not a dataset is already loaded into R at the beginning of a file. This is particularly helpful when I'm dealing with a large file that I don't want to load repeatedly, and when I might be using the same dataset with multiple R scripts or re-running the same script while making changes to the code.

To check to see if an object with that name is already loaded, we can use the exists function from the base package. We can then wrap our read.csv command with an if statement to cause the file to only load if an object with that name is not already loaded.


if(!exists("largeData")) {
  largeData <- read.csv("huge-file.csv",
    header = TRUE)
}

You will probably also find it useful to use the "colClasses" option of read.csv or read.table to help the file load faster and make sure your data are in the right format. For example:


if(!exists("largeData")) {
  largeData <- read.csv("huge-file.csv",
    header = TRUE,
    colClasses = c("factor", "integer", "character", "integer", 
      "integer", "character"))
}


--
This post is one part of my series on dealing with large datasets.

1 comment:

  1. I don't directly read CSV files into my analysis scripts because there is always a lot of cleaning up to do before the data is usable. For each data set (ie. CSV file) I have a "LoadAndClean.R" script which reads in the file and performs the data cleansing operations. I wrap this script with the same if statement:

    if(!exists("BLAH")) {
    source("~/Directory/LoadAndClean.R")
    }

    ReplyDelete