Thursday, September 27, 2012

Histogram + Density Plot Combo in R

Plotting a histogram using hist from the graphics package is pretty straightforward, but what if you want to view the density plot on top of the histogram? This combination of graphics can help us compare the distributions of groups.

Let's use some of the data included with R in the package datasets. It will help to have two things to compare, so we'll use the beaver data sets, beaver1 and beaver2: the body temperatures of two beavers, taken at 10 minute intervals.

First we want to plot the histogram of one beaver:

hist(beaver1$temp, # histogram
 col="peachpuff", # column color
 border="black",
 prob = TRUE, # show densities instead of frequencies
 xlab = "temp",
 main = "Beaver #1")

Next, we want to add in the density line, using lines:


hist(beaver1$temp, # histogram
 col="peachpuff", # column color
 border="black",
 prob = TRUE, # show densities instead of frequencies
 xlab = "temp",
 main = "Beaver #1")
lines(density(beaver1$temp), # density plot
 lwd = 2, # thickness of line
 col = "chocolate3")


Now let's show the plots for both beavers on the same image. We'll make a histogram and density plot for Beaver #2, wrap the graphs in a layout and png, and change the x-axis to be the same, using xlim.


Here's the final code, also available on gist:

png("beaverhist.png")
layout(matrix(c(1:2), 2, 1,
 byrow = TRUE))
hist(beaver1$temp, # histogram
 col = "peachpuff", # column color
 border = "black",
 prob = TRUE, # show densities instead of frequencies
 xlim = c(36,38.5),
 xlab = "temp",
 main = "Beaver #1")
lines(density(beaver1$temp), # density plot
 lwd = 2, # thickness of line
 col = "chocolate3")
hist(beaver2$temp, # histogram
 col = "peachpuff", # column color
 border = "black",
 prob = TRUE, # show densities instead of frequencies
 xlim = c(36,38.5),
 xlab = "temp",
 main = "Beaver #2")
lines(density(beaver2$temp), # density plot
 lwd = 2, # thickness of line
 col = "chocolate3")
dev.off()

Thursday, September 20, 2012

Descriptive Statistics of Groups in R

The sleep data set—provided by the datasets package—shows the effects of two different drugs on ten patients. Extra is the increase in hours of sleep; group is the drug given, 1 or 2; and ID is the patient ID, 1 to 10.

I'll be using this data set to show how to perform descriptive statistics of groups within a data set, when the data set is long (as opposed to wide).

First, we'll need to load up the psych package. The datasets package containing our data is probably already loaded.

library(psych)

The describe.by function in the psych package is what does the magic for us here. It will group our data by a variable we give it, and output descriptive statistics for each of the groups.

> describe.by(sleep, sleep$group)
group: 1
       var  n mean   sd median trimmed  mad  min  max range skew kurtosis   se
extra    1 10 0.75 1.79   0.35    0.68 1.56 -1.6  3.7   5.3 0.42    -1.30 0.57
group*   2 10 1.00 0.00   1.00    1.00 0.00  1.0  1.0   0.0  NaN      NaN 0.00
ID*      3 10 5.50 3.03   5.50    5.50 3.71  1.0 10.0   9.0 0.00    -1.56 0.96
------------------------------------------------------------ 
group: 2
       var  n mean   sd median trimmed  mad  min  max range skew kurtosis   se
extra    1 10 2.33 2.00   1.75    2.24 2.45 -0.1  5.5   5.6 0.28    -1.66 0.63
group*   2 10 2.00 0.00   2.00    2.00 0.00  2.0  2.0   0.0  NaN      NaN 0.00
ID*      3 10 5.50 3.03   5.50    5.50 3.71  1.0 10.0   9.0 0.00    -1.56 0.96

Of course, there are other ways to find the descriptive statistics of groups, and since you'll probably be doing further analysis on the groups, and you may be splitting the whole data into subsets by groups, it may be easiest to just use describe on each subset. But that's a topic for another post. And this is an easy way to quickly look at many groups, and a quick look is particularly essential for descriptive statistics.

Thursday, September 13, 2012

Word Clouds in R

Thanks to the wordcloud package, it's super easy to make a word cloud or tag cloud in R.

In this case, the words have been counted already. If you are starting with plain text, you can use the text mining package tm to obtain the counts. Other bloggers have provided good examples of this. I'll just be covering the simple case where we already have the frequencies.

Let's look at some commonly used words during the National Conventions this year. The New York Times produced a cool infographic that we'll use as our data source. The data in csv format (and the R code too) are available in a gist.

First we need to load up the packages and our data:

library(wordcloud)
library(RColorBrewer)

conventions <- read.table("conventions.csv",
 header = TRUE,
 sep = ",")

And then we can get to using the wordcloud library to produce our clouds in R:

png("dnc.png")
wordcloud(conventions$wordper25k, # words
 conventions$democrats, # frequencies
 scale = c(4,1), # size of largest and smallest words
 colors = brewer.pal(9,"Blues"), # number of colors, palette
 rot.per = 0) # proportion of words to rotate 90 degrees
dev.off()

png("rnc.png")
wordcloud(conventions$wordper25k,
 conventions$republicans,
 scale = c(4,1),
 colors = brewer.pal(9,"Reds"),
 rot.per = 0)
dev.off()

DNC word cloud

RNC word cloud

The default word cloud has some words rotated 90 degrees, but I prefer to use rot.per = 0 to make them all horizontal for readability.

You can easily change to just one color if you prefer that since the size already denotes the frequency of the word, by changing color to "red3", for example:

RNC single color

png("rncalt.png")
wordcloud(conventions$wordper25k,
 conventions$republicans,
 scale = c(4,1),
 colors = "red3",
 rot.per = 0)
dev.off()

DNC single color
And there you have it, a simple way to generate a word count from frequency data using R.

Thursday, September 6, 2012

Text to Columns in Stata

If you've ever used Excel's text to columns feature, you know how valuable it can be. If you haven't ever used text to columns, it allows you to take one column of data and separate it into multiple columns using delimiters that you provide. One time this is helpful is when converting data from other formats.

If you're learning Stata, you may wonder how you can gain this useful functionality. There are a few different ways, but for now we'll be discussing split.

For the following example, I have imported some patent data where the four most relevant primary patent classes for each observation are listed in a single column. These are delimited by a "/" as can be seen below.

Data before transformation

I would like each of these classes to be included in its own column. To do this, I give Stata the following command:

split class, parse(/) generate(class)

Stata command and feedback

In this command, the first class is the name of the variable I want to transform, / is the delimiter, and generate(class) lets Stata to know that I would like the names of the new variables to each be class followed by an integer. In the example, the most /'s there were in class was two, so three class[n] variables are created.

Data after transformation

I can then drop class if I want to remove the original class variable.

I could have also used the option destring if I wanted to treat the patent classes as numbers.

Split documentation excerpt

More use cases are shown in the split documentation. One example they provide allows you to use multiple delimiters. In this instance showing how to separate the names of court cases even if some are delimited by "v." and some by "vs."

For more complex situations, you can also use regular expressions.