Thursday, September 26, 2013

Perform a Function on Each File in R

Sometimes you might have several data files and want to use R to perform the same function across all of them. Or maybe you have multiple files and want to systematically combine them into one file without having to open each file and manually copy the data out.

Fortunately, it's not complicated to use R to systematically iterate across files.

Finding or Choosing the Names of Data Files

There are multiple ways to find or choose the names of the files you want to analyze.

You can explicitly state the file names or you can get R to find any files with a particular extension.

Explicitly Stating File Names

fileNames <- c("sample1.csv", "sample2.csv")

Finding Files with a Specific Extension

In this case, we use Sys.glob from the base package to find all files including the wildcard "*.csv".

fileNames <- Sys.glob("*.csv")

Iterating Across All Files

We'll start with a loop and then we can add whatever functions we want to the inside of the loop:

for (fileName in fileNames) {

  # read data:
  sample <- read.csv(fileName,
    header = TRUE,
    sep = ",")

  # add more stuff here

}

For example, we could add one to every "Widget" value in each file and overwrite the old data with the new data:

for (fileName in fileNames) {

  # read old data:
  sample <- read.csv(fileName,
    header = TRUE,
    sep = ",")

  # add one to every widget value in every file:
  sample$Widgets <- sample$Widgets + 1
  
  # overwrite old data with new data:
  write.table(sample, 
    fileName,
    append = FALSE,
    quote = FALSE,
    sep = ",",
    row.names = FALSE,
    col.names = TRUE)

}

Or we could do the same thing, but create a new copy of each file:

extension <- "csv"

fileNames <- Sys.glob(paste("*.", extension, sep = ""))

fileNumbers <- seq(fileNames)

for (fileNumber in fileNumbers) {

  newFileName <-  paste("new-", 
    sub(paste("\\.", extension, sep = ""), "", fileNames[fileNumber]), 
    ".", extension, sep = "")

  # read old data:
  sample <- read.csv(fileNames[fileNumber],
    header = TRUE,
    sep = ",")

  # add one to every widget value in every file:
  sample$Widgets <- sample$Widgets + 1
  
  # write old data to new files:
  write.table(sample, 
    newFileName,
    append = FALSE,
    quote = FALSE,
    sep = ",",
    row.names = FALSE,
    col.names = TRUE)

}

In the above example, we used the paste and sub functions from the base package to automatically create new file names based on the old file names.

Or we could instead use each dataset to create an entirely new dataset, where each row is based on data from one file:


fileNames <- Sys.glob("*.csv")

for (fileName in fileNames) {

  # read original data:
  sample <- read.csv(fileName,
    header = TRUE,
    sep = ",")

  # create new data based on contents of original file:
  allWidgets <- data.frame(
    File = fileName,
    Widgets = sum(sample$Widgets))
  
  # write new data to separate file:
  write.table(allWidgets, 
    "Output/sample-allSamples.csv",
    append = TRUE,
    sep = ",",
    row.names = FALSE,
    col.names = FALSE)

}

In the above example, data.frame is used to create a new data row based on each data file. Then the append option of write.table is set to TRUE so that row can be added to the other rows created from other data files.

Those are just a few examples of how you can use R to perform the same function(s) on a large number of files without having to manually run each one. I'm sure you can think of more uses.

All the files are available on GitHub. You can see how eachFile.R, eachfile-newNames.R, and eachFile-append.R each do something different to the sample datasets.

Thursday, September 19, 2013

Truncate by Delimiter in R

Sometimes, you only need to analyze part of the data stored as a vector. In this example, there is a list of patents. Each patent has been assigned to one or more patent classes. Let's say that we want to analyze the dataset based on only the first patent class listed for each patent.

patents <- data.frame(
  patent = 1:30,
  class = c("405", "33/209", "549/514", "110", "540", "43", 
  "315/327", "540", "536/514", "523/522", "315", 
  "138/248/285", "24", "365", "73/116/137", "73/200", 
  "252/508", "96/261", "327/318", "426/424/512", 
  "75/423", "430", "416", "536/423/530", "381/181", "4", 
  "340/187", "423/75", "360/392/G9B", "524/106/423"))

We can use regular expressions to truncate each element of the vector just before the first "/".

grep, grepl, sub, gsub, regexpr, gregexpr, and regexec are all functions in the base package that allow you to use regular expressions within each element of a character vector. sub and gsub allow you to replace within each element of the vector. sub replaces the first match within each element, while gsub replaces all matches within each element. In this case, we want to remove everything from the first "/" on, and we want to replace it with nothing. Here's how we can use sub to do that:

patents$primaryClass <- sub("/.*", "", patents$class)

> table(patents$primaryClass)

110 138  24 252 315 327  33 340 360 365 381   4 405 416 423 426  43 430 523 524 
  1   1   1   1   2   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
536 540 549  73  75  96 
  2   2   1   2   1   1 


--
This post is one part of my series on Text to Columns.

Citations and Further Reading

Thursday, September 12, 2013

Only Load Data If Not Already Open in R

I often find it beneficial to check to see whether or not a dataset is already loaded into R at the beginning of a file. This is particularly helpful when I'm dealing with a large file that I don't want to load repeatedly, and when I might be using the same dataset with multiple R scripts or re-running the same script while making changes to the code.

To check to see if an object with that name is already loaded, we can use the exists function from the base package. We can then wrap our read.csv command with an if statement to cause the file to only load if an object with that name is not already loaded.


if(!exists("largeData")) {
  largeData <- read.csv("huge-file.csv",
    header = TRUE)
}

You will probably also find it useful to use the "colClasses" option of read.csv or read.table to help the file load faster and make sure your data are in the right format. For example:


if(!exists("largeData")) {
  largeData <- read.csv("huge-file.csv",
    header = TRUE,
    colClasses = c("factor", "integer", "character", "integer", 
      "integer", "character"))
}


--
This post is one part of my series on dealing with large datasets.

Thursday, September 5, 2013

Using colClasses to Load Data More Quickly in R

Specifying a colClasses argument to read.table or read.csv can save time on importing data, while also saving steps to specify classes for each variable later.

For example, loading a 893 MB took 441 seconds to load when not using colClasses, but only 268 seconds to load when using colClasses. The system.time function in base can help you check your own times.

Without specifying colClasses:


   user  system elapsed 
441.224   8.200 454.155 

When specifying colClasses:

   user  system elapsed 
268.036   6.096 284.099 

The classes you can specify are: factor, character, integer, numeric, logical, complex, and Date. Dates that are in the form %Y-%m-%d or Y/%m/%d will import correctly. This tip allows you to import dates properly for dates in other formats.

system.time(largeData <- read.csv("huge-file.csv",
  header = TRUE,
  colClasses = c("character", "character", "complex", 
    "factor", "factor", "character", "integer", 
    "integer", "numeric", "character", "character",
    "Date", "integer", "logical")))

If there aren't any classes that you want to change from their defaults, you can read in the first few rows, determine the classes from that, and then import the rest of the file:

sampleData <- read.csv("huge-file.csv", header = TRUE, nrows = 5)
classes <- sapply(sampleData, class)
largeData <- read.csv("huge-file.csv", header = TRUE, colClasses = classes)
str(largeData)

If you aren't concerned about the time it takes to read the data file, but instead just want the classes to be correct on import, you have the option of only specifying certain classes:


smallData <- read.csv("small-file.csv", 
 header = TRUE,
 colClasses=c("variableName"="character"))

> class(smallData$variableName)
[1] "character"

Citations and Further Reading



--
This post is one part of my series on dealing with large datasets.