Thursday, September 13, 2012

Word Clouds in R

Thanks to the wordcloud package, it's super easy to make a word cloud or tag cloud in R.

In this case, the words have been counted already. If you are starting with plain text, you can use the text mining package tm to obtain the counts. Other bloggers have provided good examples of this. I'll just be covering the simple case where we already have the frequencies.

Let's look at some commonly used words during the National Conventions this year. The New York Times produced a cool infographic that we'll use as our data source. The data in csv format (and the R code too) are available in a gist.

First we need to load up the packages and our data:

library(wordcloud)
library(RColorBrewer)

conventions <- read.table("conventions.csv",
 header = TRUE,
 sep = ",")

And then we can get to using the wordcloud library to produce our clouds in R:

png("dnc.png")
wordcloud(conventions$wordper25k, # words
 conventions$democrats, # frequencies
 scale = c(4,1), # size of largest and smallest words
 colors = brewer.pal(9,"Blues"), # number of colors, palette
 rot.per = 0) # proportion of words to rotate 90 degrees
dev.off()

png("rnc.png")
wordcloud(conventions$wordper25k,
 conventions$republicans,
 scale = c(4,1),
 colors = brewer.pal(9,"Reds"),
 rot.per = 0)
dev.off()

DNC word cloud

RNC word cloud

The default word cloud has some words rotated 90 degrees, but I prefer to use rot.per = 0 to make them all horizontal for readability.

You can easily change to just one color if you prefer that since the size already denotes the frequency of the word, by changing color to "red3", for example:

RNC single color

png("rncalt.png")
wordcloud(conventions$wordper25k,
 conventions$republicans,
 scale = c(4,1),
 colors = "red3",
 rot.per = 0)
dev.off()

DNC single color
And there you have it, a simple way to generate a word count from frequency data using R.

No comments:

Post a Comment