Thursday, August 29, 2013

Plot Weekly or Monthly Totals in R

When plotting time series data, you might want to bin the values so that each data point corresponds to the sum for a given month or week. This post will show an easy way to use cut and ggplot2's stat_summary to plot month totals in R without needing to reorganize the data into a second data frame.

Let's start with a simple sample data set with a series of dates and quantities:

library(ggplot2)
library(scales)

# load data:
log <- data.frame(Date = c("2013/05/25","2013/05/28","2013/05/31","2013/06/01","2013/06/02","2013/06/05","2013/06/07"), 
  Quantity = c(9,1,15,4,5,17,18))
log
str(log)


> log
        Date Quantity
1 2013/05/25        9
2 2013/05/28        1
3 2013/05/31       15
4 2013/06/01        4
5 2013/06/02        5
6 2013/06/05       17
7 2013/06/07       18

> str(log)
'data.frame': 7 obs. of  2 variables:
 $ Date    : Factor w/ 7 levels "2013/05/25","2013/05/28",..: 1 2 3 4 5 6 7
 $ Quantity: num  9 1 15 4 5 17 18


Next, if the date data is not already in a date format, we'll need to convert it to date format:

# convert date variable from factor to date format:
log$Date <- as.Date(log$Date,
  "%Y/%m/%d") # tabulate all the options here
str(log)

> str(log)
'data.frame': 7 obs. of  2 variables:
 $ Date    : Date, format: "2013-05-25" "2013-05-28" ...
 $ Quantity: num  9 1 15 4 5 17 18

Next we need to create variables stating the week and month of each observation. For week, cut has an option that allows you to break weeks as you'd like, beginning weeks on either Sunday or Monday.

# create variables of the week and month of each observation:
log$Month <- as.Date(cut(log$Date,
  breaks = "month"))
log$Week <- as.Date(cut(log$Date,
  breaks = "week",
  start.on.monday = FALSE)) # changes weekly break point to Sunday
log

> log
        Date Quantity      Month       Week
1 2013-05-25        9 2013-05-01 2013-05-19
2 2013-05-28        1 2013-05-01 2013-05-26
3 2013-05-31       15 2013-05-01 2013-05-26
4 2013-06-01        4 2013-06-01 2013-05-26
5 2013-06-02        5 2013-06-01 2013-06-02
6 2013-06-05       17 2013-06-01 2013-06-02
7 2013-06-07       18 2013-06-01 2013-06-02

Finally, we can create either a line or bar plot of the data by month and by week, using stat_summary to sum up the values associated with each week or month:

# graph by month:
ggplot(data = log,
  aes(Month, Quantity)) +
  stat_summary(fun.y = sum, # adds up all observations for the month
    geom = "bar") + # or "line"
  scale_x_date(
    labels = date_format("%Y-%m"),
    breaks = "1 month") # custom x-axis labels
Time series plot, binned by month


# graph by week:
ggplot(data = log,
  aes(Week, Quantity)) +
  stat_summary(fun.y = sum, # adds up all observations for the week
    geom = "bar") + # or "line"
  scale_x_date(
    labels = date_format("%Y-%m-%d"),
    breaks = "1 week") # custom x-axis labels
Time series plot, totaled by week


The full code is available in a gist.

References

Thursday, August 22, 2013

Date Formats in R

Importing Dates

Dates can be imported from character, numeric, POSIXlt, and POSIXct formats using the as.Date function from the base package.

If your data were exported from Excel, they will possibly be in numeric format. Otherwise, they will most likely be stored in character format.

Importing Dates from Character Format

If your dates are stored as characters, you simply need to provide as.Date with your vector of dates and the format they are currently stored in. The possible date segment formats are listed in a table below. 

For example,
"05/27/84" is in the format %m/%d/%y, while "May 27 1984" is in the format %B %d %Y.

To import those dates, you would simply provide your dates and their format (if not specified, it tries %Y-%m-%d and then %Y/%m/%d):


dates <- c("05/27/84", "07/07/05")
betterDates <- as.Date(dates,
  format = "%m/%d/%y")
> betterDates
[1] "1984-05-27" "2005-07-07"

Or:

dates <- c("May 27 1984", "July 7 2005")
betterDates <- as.Date(dates,
  format = "%B %d %Y")

> betterDates
[1] "1984-05-27" "2005-07-07"

This outputs the dates in the ISO 8601 international standard format %Y-%m-%d. If you would like to use dates in a different format, read "Changing Date Formats" below.

Importing Dates from Numeric Format

If you are importing data from Excel, you may have dates that are in a numeric format. We can still use as.Date to import these, we simply need to know the origin date that Excel starts counting from, and provide that to as.Date.

For Excel on Windows, the origin date is December 30, 1899 for dates after 1900. (Excel's designer thought 1900 was a leap year, but it was not.) For Excel on Mac, the origin date is January 1, 1904.

# from Windows Excel:
  dates <- c(30829, 38540)
  betterDates <- as.Date(dates,
    origin = "1899-12-30")

>   betterDates
[1] "1984-05-27" "2005-07-07"

# from Mac Excel:
  dates <- c(29367, 37078)
  betterDates <- as.Date(dates,
    origin = "1904-01-01")

>   betterDates
[1] "1984-05-27" "2005-07-07"

This outputs the dates in the ISO 8601 international standard format %Y-%m-%d. If you would like to use dates in a different format, read the next step:

Changing Date Formats

If you would like to use dates in a format other than the standard %Y-%m-%d, you can do that using the format function from the base package.

For example,

format(betterDates,
  "%a %b %d")

[1] "Sun May 27" "Thu Jul 07"

Correct Centuries

If you are importing data with only two digits for the years, you will find that it assumes that years 69 to 99 are 1969-1999, while years 00 to 68 are 2000--2068 (subject to change in future versions of R).

Often, this is not what you intend to have happen. This page gives a good explanation of several ways to fix this depending on your preference of centuries. One solution it provides is to assume all dates R is placing in the future are really from the previous century. That solution is as follows:

dates <- c("05/27/84", "07/07/05", "08/17/20")
betterDates <- as.Date(dates, "%m/%d/%y")

> betterDates
[1] "1984-05-27" "2005-07-07" "2020-08-17"

correctCentury <- as.Date(ifelse(betterDates > Sys.Date(), 
  format(betterDates, "19%y-%m-%d"), 
  format(betterDates)))

> correctCentury
[1] "1984-05-27" "2005-07-07" "1920-08-17"

Purpose of Proper Formatting

Having your dates in the proper format allows R to know that they are dates, and as such knows what calculations it should and should not perform on them. For one example, see my post on plotting weekly or monthly totals. Here are a few more examples:

>   mean(betterDates)
[1] "1994-12-16"

>   max(betterDates)
[1] "2005-07-07"

>   min(betterDates)
[1] "1984-05-27"



The code is available in a gist.

Date Formats

Conversion specificationDescriptionExample
%aAbbreviated weekdaySun, Thu
%AFull weekdaySunday, Thursday
%b or %hAbbreviated monthMay, Jul
%BFull monthMay, July
%dDay of the month
01-31
27, 07
%jDay of the year
001-366
148, 188
%mMonth
01-12
05, 07
%UWeek
01-53
with Sunday as first day of the week
22, 27
%wWeekday
0-6
Sunday is 0
0, 4
%WWeek
00-53
with Monday as first day of the week
21, 27
%xDate, locale-specific
%yYear without century
00-99
84, 05
%YYear with century
on input:
00 to 68 prefixed by 20
69 to 99 prefixed by 19
1984, 2005
%CCentury19, 20
%DDate formatted %m/%d/%y05/27/84, 07/07/05
%uWeekday
1-7
Monday is 1
7, 4
%nNewline on output or
Arbitrary whitespace on input
%tTab on output or
Arbitrary whitespace on input

References