Thursday, January 31, 2013

Using Line Segments to Compare Values in R

Sometimes you want to create a graph that will allow the viewer to see in one glance:
  • The original value of a variable
  • The new value of the variable
  • The change between old and new
One method I like to use to do this is using geom_segment and geom_point in the ggplot2 package.

First, let's load ggplot2 and our data:

library(ggplot2)

data <- as.data.frame(USPersonalExpenditure) # data from package datasets
data$Category <- as.character(rownames(USPersonalExpenditure)) # this makes things simpler later

Next, we'll set up our plot and axes:

ggplot(data,
 aes(y = Category)) +
labs(x = "Expenditure",
 y = "Category") +

For geom_segment, we need to provide four variables. (Sometimes two of the four will be the same, like in this case.) x and y provide the start points, and xend and yend provide the endpoints.

In this case, we want to show the change between 1940 and 1960 for each category. Therefore our variables are the following:
  • x: "1940"
  • y: Category
  • xend: "1960"
  • yend: Category
geom_segment(aes(x = data$"1940",
  y = Category,
  xend = data$"1960",
  yend = Category),
 size = 1) +

Next, we want to plot points for the 1940 and 1960 values. We could do the same for the 1945, 1950, and 1955 values, if we wanted to.

geom_point(aes(x = data$"1940",
 color = "1940"),
 size = 4, shape = 15) +
geom_point(aes(x = data$"1960",
 color = "1960"),
 size = 4, shape = 15) +

Finally, we'll finish up by touching up the legend for the plot:

scale_color_discrete(name = "Year") +
theme(legend.position = "bottom")

geom_segment, then geom_point

The order of geom_segment and the geom_points matters. The first geom line in the code will get plotted first. Therefore, if you want the points displayed over the segments, put the segments first in the code. Likewise, if you want the segments displayed over the points, put the points first in the code.

For example, we could change the middle section of the code to:

geom_point(aes(x = data$"1940",
 color = "1940"),
 size = 4, shape = 15) +
geom_point(aes(x = data$"1960",
 color = "1960"),
 size = 4, shape = 15) +

geom_segment(aes(x = data$"1940",
  y = Category,
  xend = data$"1960",
  yend = Category),
 size = 1) +

And the output would look like:
geom_point then geom_segment


Similarly, if you have points that will be overlapping, make sure you think about which of the point lines you want R to plot first.

The code is available in a gist.

Thursday, January 24, 2013

Storing a Function in a Separate File in R

If you're going to be using a function across several different R files, you might want to store the function in its own file.

If you want to name the function in its own file

This is probably the best option in general, if only because you may want to put more than one function in a single file.

Next, let's make our function in the file fun.R:

mult <- function(x, y) {
    x*y
}

If you get the warning message "In readLines(file) : incomplete final line found on 'fun.R'", just insert a line break at the end of the fun.R file.

If you want to name the function in the file running it

First let's make the same function (but this time unnamed) in the file times.R:

function(x, y) {
    x*y
}

Calling the functions

And finally we'll make a file file.R to call our functions:

times <- dget("times.R")
times(-4:4, 2)

source("fun.R")
mult(-4:4, 2)

Note: if you are used to using source to run your R code, note that in this case we are using the source command within a file.


All files are available as a gist.

Thursday, January 17, 2013

Calculating a Gini Coefficients for a Number of Locales at Once in R

The Gini coefficient is a measure of the inequality of a distribution, most commonly used to compare inequality in income or wealth among countries.

Let's first generate some random data to analyze. You can download my random data or use the code below to generate your own. Of course, if you generate your own, your graphs and results will be different from those shown below.

city <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
income <- sample(1:100000,
 100,
 replace = TRUE)
cities <- data.frame(city, income)

Next, let's graph our data:

library(ggplot2)
ggplot(cities,
 aes(income)) +
 stat_density(geom = "path",
  position = "identity") +
 facet_wrap(~ city, ncol = 2)

Histogram of each city's incomes
Your results will differ if using random data

The Gini coefficient is easy enough to calculate in R for a single locale using the gini function from the reldist package.

library(reldist)
gini(cities[which(cities$city == "A"), ]$income)

But we don't want to replicate this code over and over to calculate the Gini coefficient for a large number of locales. We also want the coefficients to be in a data frame for easy use in R or for export for use in another program.

There are many ways to automate a function to run over many subsets of a data frame. The most straightforward in our particular case is aggregate:

ginicities <- aggregate(income ~ city,
 data = cities,
 FUN = "gini")
names(ginisec) <- c("city", "gini")



> ginisec
   city      gini
1     A 0.2856827
2     B 0.3639070
3     C 0.3288934
4     D 0.1863783
5     E 0.3565739
6     F 0.2587475
7     G 0.3022642
8     H 0.3795288
9     I 0.3311034
10    J 0.2496933


And finally, let's go ahead and export our data using write.csv:


write.csv(ginicities, "gini.csv",
 row.names = FALSE)


While you're at it, you might want to try using other functions on your dataset, such as mean, median, and length.

The full code is available in a gist.

Citations and Further Reading

Thursday, January 10, 2013

Stacked Bar Charts in R

Reshape Wide to Long

Let's use the Loblolly dataset from the datasets package. These data track the growth of some loblolly pine trees.

> Loblolly[1:10,]
   height age Seed
1    4.51   3  301
15  10.89   5  301
29  28.72  10  301
43  41.74  15  301
57  52.70  20  301
71  60.92  25  301
2    4.55   3  303
16  10.92   5  303
30  29.07  10  303
44  42.83  15  303

First, we need to convert the data to wide form, so each age (i.e. 3, 5, 10, 15, 20, 25) will have its own variable.

wide <- reshape(Loblolly,
 v.names = "height",
 timevar = "age",
 idvar = "Seed",
 direction = "wide")

> wide[1:5,]
  Seed height.3 height.5 height.10 height.15 height.20 height.25
1  301     4.51    10.89     28.72     41.74     52.70     60.92
2  303     4.55    10.92     29.07     42.83     53.88     63.39
3  305     4.79    11.37     30.21     44.40     55.82     64.10
4  307     3.91     9.48     25.66     39.07     50.78     59.07
5  309     4.81    11.20     28.66     41.66     53.31     63.05

Create Variables

Then we want to create new columns showing how much each tree has grown between data points. For example, instead of knowing a tree's height at age 10, we want to know how much it's grown between age 5 and age 10, so that can be a bar in our graph.

wide$h0.3 <- wide$height.3
wide$h3.5 <- wide$height.5 - wide$height.3
wide$h5.10 <- wide$height.10 - wide$height.5
wide$h10.15 <- wide$height.15 - wide$height.10
wide$h15.20 <- wide$height.20 - wide$height.15
wide$h20.25 <- wide$height.25 - wide$height.20

Plot Stacked Bar Chart

Finally, we want to plot all the new data points:

library(RColorBrewer)
sequential <- brewer.pal(6, "BuGn")
barplot(t(wide[,8:13]),
 names.arg = wide$Seed, # x-axis labels
 cex.names = 0.7, # makes x-axis labels small enough to show all
 col = sequential, # colors
 xlab = "Seed Source",
 ylab = "Height, Feet",
 xlim = c(0,20), # these two lines allow space for the legend
 width = 1) # these two lines allow space for the legend
legend("bottomright", 
 legend = c("20-25", "15-20", "10-15", "5-10", "3-5", "0-3"), #in order from top to bottom
 fill = sequential[6:1], # 6:1 reorders so legend order matches graph
 title = "Years")

Stacked bar chart
If you decide you'd rather have clustered bars instead of stacked bars, you can just add the option beside = TRUE to the barplot.

The full code is available in a gist.

Citations and Further Reading