Maps and Apps

Thursday, December 19, 2013

Make All Polygons the Same Shade in Leaflet

The Quick Start tutorial shows us how to change the color of a polygon:

var circle = L.circle([51.508, -0.11], 500, {
    color: 'red',
    fillColor: '#f03',
    fillOpacity: 0.5
}).addTo(map);

But what if we want to change the color and style of all (or a set of the) polygons?

First we can define the style:

var defaultStyle = {
  color: 'green',
  fillOpacity: 0.2
};

And then we can just add that style to our polygons:

var polygon = L.polygon([
    [51.509, -0.08],
    [51.503, -0.06],
    [51.51, -0.047]
]).setStyle(defaultStyle).addTo(map);

var circle = L.circle([51.508, -0.11], 500).setStyle(defaultStyle).addTo(map);

The full code (based on the Quick Start tutorial) is available in a gist.

Thursday, December 12, 2013

Add Spaces Between Citations in LaTeX

I had an issue where LaTeX wasn't adding spaces between references when I had multiple references in the same place.

So this:

of their own \cite{moretti:2012,saxenian:1996,casper:2007}.

was compiling as this:

of their own [Moretti, 2012,Saxenian, 1996,Casper, 2007].

To fix the problem, I simply added the space option for the cite package.

\usepackage[space]{cite}

And now it looks as it should:

of their own [Moretti, 2012, Saxenian, 1996, Casper, 2007].

If you have spaces and don't want them, you can instead use the nospace option to remove the space between each citation.

References

http://texdoc.net/texmf-dist/doc/latex/cite/cite.pdf

Thursday, December 5, 2013

Check if a Variable Exists in R

If you use attach, it is easy to tell if a variable exists. You can simply use exists to check:

>attach(df)

>exists("varName")
[1] TRUE

However, if you don't use attach (and I find you generally don't want to), this simple solution doesn't work.

> detach(df)
> exists("df$varName")
[1] FALSE

Instead of using exists, you can use in or any from the base package to determine if a variable is defined in a data frame:

> "varName" %in% names(df)
[1] TRUE
> any(names(df) == "varName")
[1] TRUE

Or to determine if a variable is defined in a matrix:

> "varName" %in% colnames(df)
[1] TRUE
> any(colnames(df) == "varName")
[1] TRUE

References

Thursday, October 24, 2013

Table as an Image in R

Usually, it's best to keep tables as text, but if you're making a lot of graphics, it can be helpful to be able to create images of tables.

PNG table

Creating the Table

After loading the data, let's first use this trick to put line breaks between the levels of the effect variable. Depending on your data, you may or may not need or want to do this.

library(OIdata)
data(birds)
library(gridExtra)

# line breaks between words for levels of birds$effect:
levels(birds$effect) <- gsub(" ", "\n", levels(birds$effect))

Next let's make our table:

xyTable <- table(birds$sky, birds$effect)

Now we can create an empty plot, center our table in it, and use the grid.table function from the gridExtra package to display the table and choose a font size.

plot.new()
grid.table(xyTable,
  # change font sizes:
  gpar.coltext = gpar(cex = 1.2),
  gpar.rowtext = gpar(cex = 1.2))

Now you can view and save the image just like any other plot.

The code is available in a gist.

Citations and Further Reading

Thursday, October 17, 2013

Line Breaks Between Words in Axis Labels in ggplot in R

Sometimes when plotting factor variables in R, the graphics can look pretty messy thanks to long factor levels. If the level attributes have multiple words, there is an easy fix to this that often makes the axis labels look much cleaner.

Without Line Breaks

Here's the messy looking example:

No line breaks in axis labels

And here's the code for the messy looking example:

library(OIdata)
data(birds)
library(ggplot2)

ggplot(birds,
  aes(x = effect,
    y = speed)) +
geom_boxplot()

With Line Breaks

We can use regular expressions to add line breaks to the factor levels by substituting any spaces with line breaks:

library(OIdata)
data(birds)
library(ggplot2)

levels(birds$effect) <- gsub(" ", "\n", levels(birds$effect))
ggplot(birds,
  aes(x = effect,
    y = speed)) +
geom_boxplot()

Line breaks in axis labels

Just one line made the plot look much better, and it will carry over to other plots you make as well. For example, you could create a table with the same variable.

Horizontal Boxes

Here we can see the difference in a box plot with horizontal boxes. It's up to you to decide which style looks better:

No line breaks in axis labels

Line breaks in axis labels

library(OIdata)
data(birds)
library(ggplot2)

levels(birds$effect) <- gsub(" ", "\n", levels(birds$effect))
ggplot(birds,
  aes(x = effect,
    y = speed)) +
geom_boxplot() + 
coord_flip()

Just a note: if you're not using ggplot, the multi-line axis labels might overflow into the graph.

The code is available in a gist.

Citations and Further Reading

http://regex.learncodethehardway.org/

Thursday, October 10, 2013

Custom Legend in R

This particular custom legend was designed with three purposes:

To effectively bin values based on a theoretical minimum and maximum value for that variable (e.g. -1 and 1 or 0 and 100)
To use a different interval notation than the default
To handle NA values

Even though this particular legend was designed with those needs, it should be simple to extrapolate from that to build legends based on other criteria.

Standard Legend

For this post, I'll be assuming you've looked through the Oregon map tutorial or have other experience making legends in R. If not, you'll probably want to check that link out. It's an awesome tutorial.

Let's start by creating a map with a standard legend, and then we move on to customization later.

First, we'll load the packages we need and the data from OIdata:

library(OIdata)
library(RColorBrewer)
library(classInt)

# load state data from OIdata package:
data(state)

Next we want to set some constants. This will save us a bunch of typing and will make the code easier to read, especially once we start creating a custom legend. Also, it will allow us to easily change the values if we want a different number of bins or a different min and max.

In this example, we're assuming we have a theoretical minimum and maximum and want to determine our choropleth bins based on that.

nclr <- 8 # number of bins
min <- 0 # theoretical minimum
max <- 100 # theoretical maximum
breaks <- (max - min) / nclr

Next, we'll set up our choropleth colors (this should look familiar from the Oregon tutorial):

# set up colors:
plotclr <- brewer.pal(nclr, "Oranges")
plotvar <- state$coal
class <- classIntervals(plotvar,
  nclr,
  style = "fixed",
  fixedBreaks = seq(min, max, breaks))
colcode <- findColours(class, 
  plotclr)

And now let's map the data:

# map data:
map("state", # base
  col = "gray80",
  fill = TRUE,
  lty = 0)
map("state", # data
  col = colcode,
  fill = TRUE,
  lty = 0,
  add = TRUE)
map("state", # border
  col = "gray",
  lwd = 1.4,
  lty = 1,
  add = TRUE)

And finally let's add our default legend:

legend("bottomleft", # position
  legend = names(attr(colcode, "table")), 
  title = "Percent",
  fill = attr(colcode, "palette"),
  cex = 0.56,
  bty = "n") # border

Here's the output of this code (see map-standard-legend.R in the gist):

Percent of power coming from coal sources (standard legend)

Custom Legend

Next we want to add a few lines here and there to enhance the legend.

For starters, let's deal with NA values. We don't have any in this particular dataset, but if we did, we would have seen they were left as the base color of the map and not included in the legend.

After our former code setting up the colors, we should add the color for NAs. It's important that these lines go after all the other set up code, or the wrong colors will be mapped.

# set up colors:
plotclr <- brewer.pal(nclr, "Oranges")
plotvar <- state$coal
class <- classIntervals(plotvar,
  nclr,
  style = "fixed",
  fixedBreaks = seq(min, max, breaks))
colcode <- findColours(class, 
  plotclr)
NAColor <- "gray80"
plotclr <- c(plotclr, NAColor)

We also want to let the map know to have our NA color as the default color, so the map will use that instead of having those areas be transparent:

# map data:
map("state", # base
  col = NAColor,
  fill = TRUE,
  lty = 0)
map("state", # data
  col = colcode,
  fill = TRUE,
  lty = 0,
  add = TRUE)
map("state", # border
  col = "gray",
  lwd = 1.4,
  lty = 1,
  add = TRUE)

Next, we want to set up the legend text. For all but the last interval, we want it to say i ≤ n < (i + breaks). The last interval should be i ≤ n ≤ (i + breaks). This can be accomplished by

# set legend text:
legendText <- c()
for(i in seq(min, max - (max - min) / nclr, (max - min) / nclr)) {
  if (i == max(seq(min, max - (max - min) / nclr, (max - min) / nclr))) {
    legendText <- c(legendText, paste(round(i,3), "\u2264 n \u2264", round(i + (max - min) / nclr,3)))
  } else
    legendText <- c(legendText, paste(round(i,3), "\u2264 n <", round(i + (max - min) / nclr,3))) 
}

But we also want to include NAs in the legend, so we need to add a line:

# set legend text:
legendText <- c()
for(i in seq(min, max - (max - min) / nclr, (max - min) / nclr)) {
  if (i == max(seq(min, max - (max - min) / nclr, (max - min) / nclr))) {
    legendText <- c(legendText, paste(round(i,3), "\u2264 n \u2264", round(i + (max - min) / nclr,3)))
    if (!is.na(NAColor)) legendText <- c(legendText, "NA")
  } else
    legendText <- c(legendText, paste(round(i,3), "\u2264 n <", round(i + (max - min) / nclr,3))) 
}

And finally we need to add the legend to the map:

legend("bottomleft", # position
  legend = legendText, 
  title = "Percent",
  fill = plotclr,
  cex = 0.56,
  bty = "n") # border

The new map (see map-new-legend.R) meets all the criteria we started with that the original legend didn't have.

Percent of power coming from coal sources (custom legend)

Code is available in a gist.

Citations and Further Reading

Thursday, October 3, 2013

How to Set Up Your App on Shiny Beta Hosting

This is a short tutorial on how to create a web version of a Shiny app you've built once you've received access to the shiny hosting beta.

SSH: Setting Up Directories

To connect to your account, input this line into a terminal (obviously input your own username instead of "username"):

ssh username@spark.rstudio.com

It will prompt you for your password. Enter the password you received in the email.

Next, create a folder to put your first app in. Like the email says, you want to put this at ~/ShinyApps/myapp/. The "myapp" folder can be named whatever you want, but the "ShinyApps" folder needs to have that name.

mkdir ShinyApps
mkdir ShinyApps/myapp

SSH: Installing R Packages

Next, you can go ahead and install R packages, if you know which ones your application will need.

First, open R just like you would on your own computer:

Next, install a package (e.g. "maps"):

install.packages("maps", dependencies = TRUE)

It will ask you the following questions:

Would you like to use a personal library instead?  (y/n)
Would you like to create a personal library 
~/R/library 
to install packages into?  (y/n)

You can answer "y" to both of these. Your applications will be able to use packages installed here.

Then you can follow the standard procedure for installing R packages.

You can now exit ssh:

exit

Copying Files

On your own computer, navigate to the directory your Shiny application's folder is in, e.g.:

cd Shiny

Once you're there, you can copy the application folder to the server:

scp -r myapp/ username@spark.rstudio.com:ShinyApps/

Or if you don't want to copy all the files, you can move them one at a time:

scp myapp/ui.R username@spark.rstudio.com:ShinyApps/myapp/ui.R

Try It Out

Now you should be able to access your application at http://spark.rstudio.com/username/myapp/

You can now send it to all your friends or leave a link in the comments!

Thursday, September 26, 2013

Perform a Function on Each File in R

Sometimes you might have several data files and want to use R to perform the same function across all of them. Or maybe you have multiple files and want to systematically combine them into one file without having to open each file and manually copy the data out.

Fortunately, it's not complicated to use R to systematically iterate across files.

Finding or Choosing the Names of Data Files

There are multiple ways to find or choose the names of the files you want to analyze.

You can explicitly state the file names or you can get R to find any files with a particular extension.

Explicitly Stating File Names

fileNames <- c("sample1.csv", "sample2.csv")

Finding Files with a Specific Extension

In this case, we use Sys.glob from the base package to find all files including the wildcard "*.csv".

fileNames <- Sys.glob("*.csv")

Iterating Across All Files

We'll start with a loop and then we can add whatever functions we want to the inside of the loop:

for (fileName in fileNames) {

  # read data:
  sample <- read.csv(fileName,
    header = TRUE,
    sep = ",")

  # add more stuff here

}

For example, we could add one to every "Widget" value in each file and overwrite the old data with the new data:

for (fileName in fileNames) {

  # read old data:
  sample <- read.csv(fileName,
    header = TRUE,
    sep = ",")

  # add one to every widget value in every file:
  sample$Widgets <- sample$Widgets + 1
  
  # overwrite old data with new data:
  write.table(sample, 
    fileName,
    append = FALSE,
    quote = FALSE,
    sep = ",",
    row.names = FALSE,
    col.names = TRUE)

}

Or we could do the same thing, but create a new copy of each file:

extension <- "csv"

fileNames <- Sys.glob(paste("*.", extension, sep = ""))

fileNumbers <- seq(fileNames)

for (fileNumber in fileNumbers) {

  newFileName <-  paste("new-", 
    sub(paste("\\.", extension, sep = ""), "", fileNames[fileNumber]), 
    ".", extension, sep = "")

  # read old data:
  sample <- read.csv(fileNames[fileNumber],
    header = TRUE,
    sep = ",")

  # add one to every widget value in every file:
  sample$Widgets <- sample$Widgets + 1
  
  # write old data to new files:
  write.table(sample, 
    newFileName,
    append = FALSE,
    quote = FALSE,
    sep = ",",
    row.names = FALSE,
    col.names = TRUE)

}

In the above example, we used the paste and sub functions from the base package to automatically create new file names based on the old file names.

Or we could instead use each dataset to create an entirely new dataset, where each row is based on data from one file:

fileNames <- Sys.glob("*.csv")

for (fileName in fileNames) {

  # read original data:
  sample <- read.csv(fileName,
    header = TRUE,
    sep = ",")

  # create new data based on contents of original file:
  allWidgets <- data.frame(
    File = fileName,
    Widgets = sum(sample$Widgets))
  
  # write new data to separate file:
  write.table(allWidgets, 
    "Output/sample-allSamples.csv",
    append = TRUE,
    sep = ",",
    row.names = FALSE,
    col.names = FALSE)

}

In the above example, data.frame is used to create a new data row based on each data file. Then the append option of write.table is set to TRUE so that row can be added to the other rows created from other data files.

Those are just a few examples of how you can use R to perform the same function(s) on a large number of files without having to manually run each one. I'm sure you can think of more uses.

All the files are available on GitHub. You can see how eachFile.R, eachfile-newNames.R, and eachFile-append.R each do something different to the sample datasets.

Thursday, September 19, 2013

Truncate by Delimiter in R

Sometimes, you only need to analyze part of the data stored as a vector. In this example, there is a list of patents. Each patent has been assigned to one or more patent classes. Let's say that we want to analyze the dataset based on only the first patent class listed for each patent.

patents <- data.frame(
  patent = 1:30,
  class = c("405", "33/209", "549/514", "110", "540", "43", 
  "315/327", "540", "536/514", "523/522", "315", 
  "138/248/285", "24", "365", "73/116/137", "73/200", 
  "252/508", "96/261", "327/318", "426/424/512", 
  "75/423", "430", "416", "536/423/530", "381/181", "4", 
  "340/187", "423/75", "360/392/G9B", "524/106/423"))

We can use regular expressions to truncate each element of the vector just before the first "/".

grep, grepl, sub, gsub, regexpr, gregexpr, and regexec are all functions in the base package that allow you to use regular expressions within each element of a character vector. sub and gsub allow you to replace within each element of the vector. sub replaces the first match within each element, while gsub replaces all matches within each element. In this case, we want to remove everything from the first "/" on, and we want to replace it with nothing. Here's how we can use sub to do that:

patents$primaryClass <- sub("/.*", "", patents$class)

> table(patents$primaryClass)

110 138  24 252 315 327  33 340 360 365 381   4 405 416 423 426  43 430 523 524 
  1   1   1   1   2   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
536 540 549  73  75  96 
  2   2   1   2   1   1

--
This post is one part of my series on Text to Columns.

Citations and Further Reading

Regular Expressions Cheat Sheet

Thursday, September 12, 2013

Only Load Data If Not Already Open in R

I often find it beneficial to check to see whether or not a dataset is already loaded into R at the beginning of a file. This is particularly helpful when I'm dealing with a large file that I don't want to load repeatedly, and when I might be using the same dataset with multiple R scripts or re-running the same script while making changes to the code.

To check to see if an object with that name is already loaded, we can use the exists function from the base package. We can then wrap our read.csv command with an if statement to cause the file to only load if an object with that name is not already loaded.

if(!exists("largeData")) {
  largeData <- read.csv("huge-file.csv",
    header = TRUE)
}

You will probably also find it useful to use the "colClasses" option of read.csv or read.table to help the file load faster and make sure your data are in the right format. For example:

if(!exists("largeData")) {
  largeData <- read.csv("huge-file.csv",
    header = TRUE,
    colClasses = c("factor", "integer", "character", "integer", 
      "integer", "character"))
}

--
This post is one part of my series on dealing with large datasets.

Thursday, September 5, 2013

Using colClasses to Load Data More Quickly in R

Specifying a colClasses argument to read.table or read.csv can save time on importing data, while also saving steps to specify classes for each variable later.

For example, loading a 893 MB took 441 seconds to load when not using colClasses, but only 268 seconds to load when using colClasses. The system.time function in base can help you check your own times.

Without specifying colClasses:

   user  system elapsed 
441.224   8.200 454.155

When specifying colClasses:

   user  system elapsed 
268.036   6.096 284.099

The classes you can specify are: factor, character, integer, numeric, logical, complex, and Date. Dates that are in the form %Y-%m-%d or Y/%m/%d will import correctly. This tip allows you to import dates properly for dates in other formats.

system.time(largeData <- read.csv("huge-file.csv",
  header = TRUE,
  colClasses = c("character", "character", "complex", 
    "factor", "factor", "character", "integer", 
    "integer", "numeric", "character", "character",
    "Date", "integer", "logical")))

If there aren't any classes that you want to change from their defaults, you can read in the first few rows, determine the classes from that, and then import the rest of the file:

sampleData <- read.csv("huge-file.csv", header = TRUE, nrows = 5)
classes <- sapply(sampleData, class)
largeData <- read.csv("huge-file.csv", header = TRUE, colClasses = classes)
str(largeData)

If you aren't concerned about the time it takes to read the data file, but instead just want the classes to be correct on import, you have the option of only specifying certain classes:

smallData <- read.csv("small-file.csv", 
 header = TRUE,
 colClasses=c("variableName"="character"))

> class(smallData$variableName)
[1] "character"

Citations and Further Reading

--
This post is one part of my series on dealing with large datasets.

Thursday, August 29, 2013

Plot Weekly or Monthly Totals in R

When plotting time series data, you might want to bin the values so that each data point corresponds to the sum for a given month or week. This post will show an easy way to use cut and ggplot2's stat_summary to plot month totals in R without needing to reorganize the data into a second data frame.

Let's start with a simple sample data set with a series of dates and quantities:

library(ggplot2)
library(scales)

# load data:
log <- data.frame(Date = c("2013/05/25","2013/05/28","2013/05/31","2013/06/01","2013/06/02","2013/06/05","2013/06/07"), 
  Quantity = c(9,1,15,4,5,17,18))

log

str(log)

> log
        Date Quantity
1 2013/05/25        9
2 2013/05/28        1
3 2013/05/31       15
4 2013/06/01        4
5 2013/06/02        5
6 2013/06/05       17
7 2013/06/07       18

> str(log)
'data.frame': 7 obs. of  2 variables:
 $ Date    : Factor w/ 7 levels "2013/05/25","2013/05/28",..: 1 2 3 4 5 6 7
 $ Quantity: num  9 1 15 4 5 17 18

Next, if the date data is not already in a date format, we'll need to convert it to date format:

# convert date variable from factor to date format:
log$Date <- as.Date(log$Date,
  "%Y/%m/%d") # tabulate all the options here
str(log)

> str(log)
'data.frame': 7 obs. of  2 variables:
 $ Date    : Date, format: "2013-05-25" "2013-05-28" ...
 $ Quantity: num  9 1 15 4 5 17 18

Next we need to create variables stating the week and month of each observation. For week, cut has an option that allows you to break weeks as you'd like, beginning weeks on either Sunday or Monday.

# create variables of the week and month of each observation:
log$Month <- as.Date(cut(log$Date,
  breaks = "month"))
log$Week <- as.Date(cut(log$Date,
  breaks = "week",
  start.on.monday = FALSE)) # changes weekly break point to Sunday
log

> log
        Date Quantity      Month       Week
1 2013-05-25        9 2013-05-01 2013-05-19
2 2013-05-28        1 2013-05-01 2013-05-26
3 2013-05-31       15 2013-05-01 2013-05-26
4 2013-06-01        4 2013-06-01 2013-05-26
5 2013-06-02        5 2013-06-01 2013-06-02
6 2013-06-05       17 2013-06-01 2013-06-02
7 2013-06-07       18 2013-06-01 2013-06-02

Finally, we can create either a line or bar plot of the data by month and by week, using stat_summary to sum up the values associated with each week or month:

# graph by month:
ggplot(data = log,
  aes(Month, Quantity)) +
  stat_summary(fun.y = sum, # adds up all observations for the month
    geom = "bar") + # or "line"
  scale_x_date(
    labels = date_format("%Y-%m"),
    breaks = "1 month") # custom x-axis labels

Time series plot, binned by month

# graph by week:
ggplot(data = log,
  aes(Week, Quantity)) +
  stat_summary(fun.y = sum, # adds up all observations for the week
    geom = "bar") + # or "line"
  scale_x_date(
    labels = date_format("%Y-%m-%d"),
    breaks = "1 week") # custom x-axis labels

Time series plot, totaled by week

The full code is available in a gist.

References

http://stackoverflow.com/questions/3496536/barplot-totals-by-month-with-ggplot

Thursday, August 22, 2013

Date Formats in R

Importing Dates

Dates can be imported from character, numeric, POSIXlt, and POSIXct formats using the as.Date function from the base package.

If your data were exported from Excel, they will possibly be in numeric format. Otherwise, they will most likely be stored in character format.

Importing Dates from Character Format

If your dates are stored as characters, you simply need to provide as.Date with your vector of dates and the format they are currently stored in. The possible date segment formats are listed in a table below.

For example,

"05/27/84" is in the format %m/%d/%y, while "May 27 1984" is in the format %B %d %Y.

To import those dates, you would simply provide your dates and their format (if not specified, it tries %Y-%m-%d and then %Y/%m/%d):

dates <- c("05/27/84", "07/07/05")
betterDates <- as.Date(dates,
  format = "%m/%d/%y")

> betterDates
[1] "1984-05-27" "2005-07-07"

Or:

dates <- c("May 27 1984", "July 7 2005")
betterDates <- as.Date(dates,
  format = "%B %d %Y")

> betterDates
[1] "1984-05-27" "2005-07-07"

This outputs the dates in the ISO 8601 international standard format %Y-%m-%d. If you would like to use dates in a different format, read "Changing Date Formats" below.

Importing Dates from Numeric Format

If you are importing data from Excel, you may have dates that are in a numeric format. We can still use as.Date to import these, we simply need to know the origin date that Excel starts counting from, and provide that to as.Date.

For Excel on Windows, the origin date is December 30, 1899 for dates after 1900. (Excel's designer thought 1900 was a leap year, but it was not.) For Excel on Mac, the origin date is January 1, 1904.

# from Windows Excel:
  dates <- c(30829, 38540)
  betterDates <- as.Date(dates,
    origin = "1899-12-30")

>   betterDates
[1] "1984-05-27" "2005-07-07"

# from Mac Excel:
  dates <- c(29367, 37078)
  betterDates <- as.Date(dates,
    origin = "1904-01-01")

>   betterDates
[1] "1984-05-27" "2005-07-07"

This outputs the dates in the ISO 8601 international standard format %Y-%m-%d. If you would like to use dates in a different format, read the next step:

Changing Date Formats

If you would like to use dates in a format other than the standard %Y-%m-%d, you can do that using the format function from the base package.

For example,

format(betterDates,
  "%a %b %d")

[1] "Sun May 27" "Thu Jul 07"

Correct Centuries

If you are importing data with only two digits for the years, you will find that it assumes that years 69 to 99 are 1969-1999, while years 00 to 68 are 2000--2068 (subject to change in future versions of R).

Often, this is not what you intend to have happen. This page gives a good explanation of several ways to fix this depending on your preference of centuries. One solution it provides is to assume all dates R is placing in the future are really from the previous century. That solution is as follows:

dates <- c("05/27/84", "07/07/05", "08/17/20")
betterDates <- as.Date(dates, "%m/%d/%y")

> betterDates
[1] "1984-05-27" "2005-07-07" "2020-08-17"

correctCentury <- as.Date(ifelse(betterDates > Sys.Date(), 
  format(betterDates, "19%y-%m-%d"), 
  format(betterDates)))

> correctCentury
[1] "1984-05-27" "2005-07-07" "1920-08-17"

Purpose of Proper Formatting

Having your dates in the proper format allows R to know that they are dates, and as such knows what calculations it should and should not perform on them. For one example, see my post on plotting weekly or monthly totals. Here are a few more examples:

>   mean(betterDates)
[1] "1994-12-16"

>   max(betterDates)
[1] "2005-07-07"

>   min(betterDates)
[1] "1984-05-27"

The code is available in a gist.

Date Formats

Conversion specification	Description	Example
%a	Abbreviated weekday	Sun, Thu
%A	Full weekday	Sunday, Thursday
%b or %h	Abbreviated month	May, Jul
%B	Full month	May, July
%d	Day of the month 01-31	27, 07
%j	Day of the year 001-366	148, 188
%m	Month 01-12	05, 07
%U	Week 01-53 with Sunday as first day of the week	22, 27
%w	Weekday 0-6 Sunday is 0	0, 4
%W	Week 00-53 with Monday as first day of the week	21, 27
%x	Date, locale-specific
%y	Year without century 00-99	84, 05
%Y	Year with century on input: 00 to 68 prefixed by 20 69 to 99 prefixed by 19	1984, 2005
%C	Century	19, 20
%D	Date formatted %m/%d/%y	05/27/84, 07/07/05
%u	Weekday 1-7 Monday is 1	7, 4

%n	Newline on output or Arbitrary whitespace on input
%t	Tab on output or Arbitrary whitespace on input

References

help(as.Date)
help(strptime)
http://stackoverflow.com/questions/9508747/r-adding-century-to-year

Thursday, March 7, 2013

geom_point Legend with Custom Colors in ggplot

Formerly, I showed how to make line segments using ggplot.

Working from that previous example, there are only a few things we need to change to add custom colors to our plot and legend in ggplot.

First, we'll add the colors of our choice. I'll do this using RColorBrewer, but you can choose whatever method you'd like.

library(RColorBrewer)
colors = brewer.pal(8, "Dark2")

The next section will be exactly the same as the previous example, except for removing the scale_color_discrete line to make way for the scale_color_manual we'll be adding later.

library(ggplot2)

data <- as.data.frame(USPersonalExpenditure) # data from package datasets
data$Category <- as.character(rownames(USPersonalExpenditure)) # this makes things simpler later

ggplot(data,
  aes(x = Expenditure,
    y = Category)) +
labs(x = "Expenditure",
  y = "Category") +
geom_segment(aes(x = data$"1940",
    y = Category,
    xend = data$"1960",
    yend = Category),
  size = 1) +
geom_point(aes(x = data$"1940",
    color = "1940"), # these can be any string, they just need to be unique identifiers
  size = 4,
  shape = 15) +
geom_point(aes(x = data$"1960",
    color = "1960"),
  size = 4,
  shape = 15) +
theme(legend.position = "bottom") +

And finally, we'll add a scale_color_manual line to our plot. We need to define the name, labels, and colors of the plot.

scale_color_manual(name = "Year", # or name = element_blank()
  labels = c(1940, 1960),
  values = colors)

And here's our final plot, complete with whatever custom colors we've chosen in both the plot and legend:

geom_point in ggplot with custom colors in the graph and legend

I've updated the gist from the previous post to also include a file that has custom colors.

Thursday, February 28, 2013

Shapefiles in R

Let's learn how to use Shapefiles in R. This will allow us to map data for complicated areas or jurisdictions like zipcodes or school districts. For the United States, many shapefiles are available from the Census Bureau. Our example will map U.S. national parks.

First, download the U.S. Parks and Protected Lands shape files from Natural Earth. We'll be using the ne_10m_parks_and_protected_lands_area.shp file.

Next, start working in R. First, we'll load the shapefile and maptools:

# load up area shape file:
library(maptools)
area <- readShapePoly("ne_10m_parks_and_protected_lands_area.shp")

# # or file.choose:
# area <- readShapePoly(file.choose())

Next we can set the colors we want to use. And then we can set up our basemap.

library(RColorBrewer)
colors <- brewer.pal(9, "BuGn")

library(ggmap)
mapImage <- get_map(location = c(lon = -118, lat = 37.5),
  color = "color",
  source = "osm",
  # maptype = "terrain",
  zoom = 6)

Next, we can use the fortify function from the ggplot2 package. This converts the crazy shape file with all its nested attributes into a data frame that ggmap will know what to do with.

area.points <- fortify(area)

Finally, we can map our shape files!

ggmap(mapImage) +
  geom_polygon(aes(x = long,
      y = lat,
      group = group),
    data = area.points,
    color = colors[9],
    fill = colors[6],
    alpha = 0.5) +
labs(x = "Longitude",
  y = "Latitude")

National Parks and Protected Lands in California and Nevada

Same figure, with a Stamen terrain basemap with ColorBrewer palette "RdPu"

The full code is available as a gist.