Thursday, December 20, 2012

Mapping GPS Tracks in Google Earth

Even though I find the R solution easier, I'm sure not everyone agrees with me, so here's the Google Earth way to import a bunch of GPS tracks and map them:

Export to TCX

From inside Garmin Training Center or other GPS device software, export a huge TCX file with all your tracks.

Mapping in Google Earth

Opening your TCX file

Select File -> Open. Change the dropbox for "Files of type" to "Gps", and then find your file and open it. Make sure the "Create KML LineStrings" box is checked and the "Create KML Tracks" box is unchecked. This will save us some work later.

Import settings

Next, zoom in, zoom out, and recenter as desired. If you want North to be up, you can click the "N" in the compass in the upper right. 

In the layers (bottom left), I personally like to check Roads and uncheck everything else.

In the places (just above layers), right click "GPS device" and select "Properties". Click over to the "Style, Color" tab, and then click "Share Style".

Share style

I like to use the following settings:

Lines:
  • Color: red
  • Width: 3.0
  • Opacity: 100%
Label:
  • Opacity: 0%
Icon:
  • Opacity: 0%
Example style and color settings

Make sure you've selected all the dates you want on the date slider. Then you can select "Print" to export to a PDF, or you can take a screenshot.

Export image
Date slider and final map
And that's all there is to it. Large maps may take a long time to load and edit.


--
This post is one part of my series on Mapping GPS Tracks.

Thursday, December 13, 2012

Mapping GPS Tracks in R

This is an explanation of how I used R to combine all my GPS cycling tracks from my Garmin Forerunner 305.

Converting to CSV

You can convert pretty much any GPS data to .csv by using GPSBabel. For importing directly from my Garmin, I used the command:

gpsbabel -t -i garmin -f usb: -o unicsv -F out.csv

[Note: you'll probably need to work as root to access your device directly]

For importing from a .tcx file, you can use:

gpsbabel -t -i gtrnctr -f test2.tcx -o unicsv -F old.csv

Mapping in R

After converting to .csv, we'll have a file with several columns, such as latitude, longitude, date, and time. We can now easily import this into R.

gps <- read.csv("out.csv", 
 header = TRUE)

Next we want to load up ggmap and get our base map. To determine how zoomed in we are, we can set zoom and size. We can also choose the maptype, with options of terrain, satellite, roadmap, or hybrid (satellite + roadmap).

library(ggmap)
mapImageData <- get_googlemap(center = c(lon = median(gps$Longitude), lat = median(gps$Latitude)),
 zoom = 11,
# size = c(500, 500),
 maptype = c("terrain"))

I chose to set the center of the map to the median of my latitudes and the median of my longitudes. I've done some biking when traveling, so median made more sense for me than mean.
Finally we want to map our GPS data. There are several pch options to try.

ggmap(mapImageData,
 extent = "device") + # takes out axes, etc.
 geom_point(aes(x = Longitude,
  y = Latitude),
 data = gps,
 colour = "red",
 size = 1,
 pch = 20)

All my metro Atlanta bike rides
Previously, I've used Google Earth to create these maps, but I actually found it to be easier and way less time and resource efficient to do it in R. The only tricky part was converting the data into .csv, and there are other ways to do that, if GPSBabel isn't working for you. You might also be interested in trying Google Earth for mapping your tracks, instead of R.

Here's the gist with the code.


--
This post is one part of my series on Mapping GPS Tracks.

Citations and Further Reading

Thursday, December 6, 2012

WorkFlowy (or How I Organize Everything)

Being organized is crucial to doing a PhD.

One of the main ways I get and stay organized is by using WorkFlowy. WorkFlowy is a simple (perhaps startlingly simple) webapp that can function as to-do list, planning software, and outliner.

I've tried tons of to-do apps for desktop and mobile, but I've found I always revert to post-its and a todo.txt. WorkFlowy is the first service I've found that has changed that. It's pretty similar to a text file, but easier to navigate.



How I use WorkFlowy

To-do list

For my to-do list, I have one list for each week and then one list inside that for each day. It's easy to move stuff from day to day. When I finish items, I check them off.


When I'm finished with a week, I move it to a completed list at the bottom of my WorkFlowy. Since I mark off items as I finish them instead of deleting them, this gives me a quick way to look through and see what I've accomplished in the past week. This is helpful for self-assessment and for preparing for meetings with committee members.


You can also get WorkFlowy to send you a daily email with summaries of your changes.

WorkFlowy settings

Organizing papers

I tend toward being an excessive outliner. I outline my papers or sections of papers down to the paragraph or even sentence level before I write them. Some people prefer the edit-and-edit-again route, but I've found that outlining works better for me, personally.

In the preliminary stages of a paper, I use WorkFlowy for my outlines.

I have one list for each paper I have planned. Inside each list, I include things like a summary of the paper, research questions, hypotheses, possible data sources, variables, literature citations, possible cases to analyze, and a to-do list of things I need to remember to do.

Other features

Sharing

WorkFlowy lists can also be shared and used for collaboration. Free accounts can share through links that are accessible to anyone using the link. This link feature can either allow other people to edit the list or only view it. Paid accounts can also share privately to email addresses, with login required to view or edit.

Backup

Paid accounts can have automatic backup to Dropbox. Any user can export manually as text at any time.


Full disclosure: I don't work for WorkFlowy. I'm not receiving anything monetary for this post; nor am I receiving anything at all other than some extra WorkFlowy items if you sign up using my referral link. And I've referred enough people at this point that I don't even really need those. I just like to use the link to see how many people decide to try the service based on my opinions. (Also, you get more items if you use the link, as well.) I wrote this post for the same reason I write all my blog posts: to document what I've been up to and hopefully help someone along the way.

Thursday, November 29, 2012

Sorting Within Lattice Graphics in R

Default

By default, lattice sorts the observations by the axis values, starting at the bottom left.

For example,

library(lattice)
colors = c("#1B9E77", "#D95F02", "#7570B3")
dotplot(rownames(mtcars) ~ mpg,
 data = mtcars,
 col = colors[1],
 pch = 1)

produces:
Default lattice dotplot
(Note: The rownames(cars) bit is just because of how this data set is stored. For your data, you might just type the variable name (model, for example) instead.)

Graphing one variable, sorting by another

If we want to show the same data, but we want to sort by another variable (or the same variable, for that matter), we can just add reorder(yvar, sortvar). For example, to sort by the number of cylinders, we could:

dotplot(reorder(rownames(mtcars), cyl) ~ mpg,
 data = mtcars,
 col = colors[1],
 pch = 1)
Sorted by number of cylinders

Graphing two variables

To better show how this works, let's graph cyl alongside mpg, so we can see how it is sorting:

dotplot(reorder(rownames(mtcars), cyl) ~ mpg + cyl,
 data = mtcars,
 col = colors,
 pch = c(1, 0))
Graph of mpg and cyl, sorted by cyl

Reverse order

We can also sort in reverse order, by adding a "-" before the variable name:

dotplot(reorder(rownames(mtcars), -cyl) ~ mpg + cyl,
 data = mtcars,
 col = colors,
 pch = c(1, 0))
Graph of mpg and cyl, sorted by cyl, reversed

Adding a legend

We can also add a legend:

dotplot(reorder(rownames(mtcars), -cyl) ~ mpg + cyl,
 data = mtcars,
 xlab = "",
 col = colors,
 pch = c(1, 0),
 key = list(points = list(col = colors[1:2], pch = c(1, 0)),
  text = list(c("Miles per gallon", "Number of cylinders")),
  space = "bottom"))
With legend

Other lattice types

The same technique will work with other lattice graphs, such as barchart, bwplot, and stripplot.

Full code available as a gist.

Thursday, November 15, 2012

Subfigures in LaTeX

How can you create subfigures or subfloats in LaTeX? It's often easy to combine multiple figures from within a statistical package or image software, but it's generally best not to if you want to include subcaptions as text, for improved searchability.

Fortunately the subcaption package in LaTeX allows us to do this easily. (The subfigure and subfig packages have been deprecated, so it's best to use subcaption instead.)

How can you make more than one image or table be part of a LaTeX figure while still being able to create text captions for each?

Using the word clouds from my R word cloud tutorial, let's look at an example:


In order to make the figure containing the subfigures, we'll need to decide a few things first. Then it's just a matter of letting LaTeX know our preferences:

We also need to let LaTeX know to use the caption and subcaption packages. The full code for the example image above is included below, and also in a gist.


\documentclass{article}

\usepackage{graphicx}
\usepackage{caption}
\usepackage{subcaption}

\begin{document}

\begin{figure}
 \centering
 \begin{subfigure}{0.4\textwidth} % width of left subfigure
  \includegraphics[width=\textwidth]{rncalt.png}
  \caption{RNC} % subcaption
 \end{subfigure}
 \vspace{1em} % here you can insert horizontal or vertical space
 \begin{subfigure}{0.4\textwidth} % width of right subfigure
  \includegraphics[width=\textwidth]{dncalt.png}
  \caption{DNC} % subcaption
 \end{subfigure}
 \caption{Wordcloud of national conventions} % caption for whole figure
\end{figure}

\end{document}

For more information, this wikibooks article is useful.


--
This post is one part of my series on LaTeX.

Thursday, November 8, 2012

White Space in LaTeX


Extra spaces and line breaks in your source file are ignored. But there are several ways to force LaTeX to introduce white space to your documents.

The most simple commands to insert a specific amount of white space into your document are \hspace and \vspace.

To produce vertical space, use \vspace followed by the length of the space in brackets. This length can be represented in any units recognized by LaTeX, e.g. \vspace{2 in}. The space between the number and the unit is optional, so \vspace{2in} will also work.

Similarly, you can use \hspace to insert horizontal space in your document, e.g., \hspace{2 in}.

If \hspace and \vspace are not working as you would like (often at the beginning or end of a line or page), you can instead use \hspace* and \vspace*, which will force space to appear.

If you want to put in as much blank space as possible (while still maintaining page margins, etc.), you can use \hfill and \vfill.

Multiples of \vfill or \hfill on a particular page or in an environment will fill an even amount of the space. (E.g., if there are 6 inches to be filled and 3 \vfill's, each will take 2 inches.)

Horizontal

As well as \hspace{} and \hfill, there are some horizontal-specific commands for adding white space.

\quad creates a space whose size is relative to the current font size and font face. \quad is equal to \hspace{1em}. There are other commands that make spaces of sizes relative to \quad:

Horizontal space commands in LaTeX

Vertical

Probably the most common white space command is \\. It is used to start a new paragraph.

That's not the only way to tell LaTeX to break a line. Here are a number of ways to do that in different situations:

commandaction

\\Start a new paragraph
\linebreak[number]Start a new line, option to request, not require
\newlineLine break (only in paragraph mode)
Line breaks in LaTeX

\newline or \linebreak can be beneficial because they work inside the tabular environment (when a p column definition is used), where \\ will not work within a cell.

As well as using \vspace{} and \vfill, there are some specific commands for vertical white space:

  • \smallskip
  • \medskip
  • \bigskip

The sizes of \smallskip, \medskip, and \bigskip are determined by the documentclass.

Example of skip sizes

There are also several kinds of ways to break a page, which sometimes creates white space:

commandaction

\pagebreakStarts a new page, using white space throughout to fill the full page before the break
\newpageStarts a new page, leaving the rest of the page before the break blank
\clearpageLike \newpage, but also prints all prior figures
\cleardoublepageLike \clearpage, but next page with print will be odd
Page breaks in LaTeX

Here's an example of the difference between \pagebreak and \newpage.

More information


--
This post is one part of my series on LaTeX.

Thursday, November 1, 2012

Units in LaTeX

Discrete units

These are your basic units, like inches, centimeters, and points. Conversion of units from here and here.

These tables show the relative sizes of each unit:

Relative sizes of units in LaTeX, inches

Relative sizes of units in LaTeX, cm

Units defined relative to font sizes

There also are sizes that are relative to the current font face and font size. The size ex is usually the same as the height of an "x", and the size em is usually (but less often) equal to the width of an "M".

Since these are relative to your font, don't be surprised if your attempts look different than mine. Just like their absolute sizes, the size of ex relative to em is not consistent across fonts.

Examples of font-relative units

Units defined relative to document

There also are units that have definitions relative to your document. These are determined based on your documentclass, and can also be explicitly changed.

A list of these and how to change them is available here. Some of the sizes are illustrated here.


commandsize

\paperwidthWidth of page
\paperheightHeight of page
\textwidthWidth of text
\textheightHeight of text
\linewidthWidth of a line, usually equal to \textwidth, but varies with environment
\columnwidthWidth of a column, usually same as \linewidth
\columnsepDistance between columns
\tabcolsepSeparation between columns in a tabular environment
\parindentParagraph indentation
\parskipThe extra vertical space between paragraphs
\baselineskipVertical distance between lines in a paragraph
\baselinestretchMultiplies \baselineskip
\unitlengthUnits of length in a picture environment
\topmarginSize of top margin
\evensidemarginMargin of even pages
\oddsidemarginMargin of odd pages
Document size commands in LaTeX

Next, we'll learn how to use these units to add white space.


--
This post is one part of my series on LaTeX.

Thursday, October 25, 2012

Palettes in R

In its simplest form, a palette in R is simply a vector of colors. This vector can be include the hex triplet or R color names.

The default palette can be seen through palette():

> palette("default") # you'll only need this line if you've previously changed the palette from the default
> palette()
[1] "black"   "red"     "green3"  "blue"    "cyan"    "magenta" "yellow"
[8] "gray"

Defining your own palettes

If you want to make your own palette, you can just create your own vector of colors. It's fine for your vector to include a mixture of hex triplets and R color names. You can use the palette function above, but generally it's best to just store each palette as a standard vector. For one thing, you can use more than one palette that way. Here's how you can define your own palette:

colors <- c("#A7A7A7",
 "dodgerblue",
 "firebrick",
 "forestgreen",
 "gold")

Now let's try using our palette. For now let's just color each bar of a histogram. This is a silly example, but I think it's the easiest way to show how to get R to utilize your palette. In the following example, there are six bars, but only five colors. You can see that R will cycle through your palette to fill all the shapes.

hist(discoveries,
 col = colors)



A more sensible use of color is to use a different color for each of a number of summary statistics:

colors <- c("#A7A7A7",
 "dodgerblue",
 "firebrick",
 "forestgreen",
 "gold")
hist(discoveries,
 col = colors[1])
abline(v = mean(discoveries),
 col = colors[2],
 lwd = 5)
abline(v = median(discoveries),
 col = colors[3],
 lwd = 5)
abline(v = min(discoveries),
 col = colors[4],
 lwd = 5)
abline(v = max(discoveries),
 col = colors[5],
 lwd = 5)
legend(x = "topright", # location of legend within plot area
 col = colors[2:5],
c("Mean", "Median", "Minimum", "Maximum"), lwd = 5)


Predefined palettes: default R palettes

The package grDevices (you probably already have this loaded) contains a number of palettes.

?rainbow
rainbowcols <- rainbow(6)
hist(discoveries,
 col = rainbowcols)


For rainbow, you can adjust the saturation and value. For example:

rainbowcols <- rainbow(6,
 s = 0.5)
hist(discoveries,
 col = rainbowcols)



heatcols <- heat.colors(6)
hist(discoveries,
 col = heatcols)


As well as rainbow and heat.colors, there are also terrain.colors, topo.colors, and cm.colors.

Predefined palettes: RColorBrewer


library(RColorBrewer)
display.brewer.all()

The above lines will show us all the RColorBrewer palettes (output shown below). The top section of palettes are sequential, the middle section are qualitative, and the lower section are diverging. Here is some information about how to choose a good palette.
RColorBrewer palettes

RColorBrewer works a little different than how we've defined palettes previously. We'll have to use brewer.pal to create a palette.

library(RColorBrewer)
darkcols <- brewer.pal(8, "Dark2")
hist(discoveries,
 col = darkcols)


Even though we have to provide brewer.pal with the number of colors we want, we won't necessarily need to use all those colors later. We can still choose a color from the vector like we have previously. When we're setting a col setting to the full palette, we'll be more concerned with how many colors are included in the palette , but even there, we can choose a subset of the whole palette:

darkcols <- brewer.pal(8, "Dark2")
hist(discoveries,
 col = darkcols[1:2])


Here's the code from this post.

Now that we're familiar with making our own palettes and using the built-in palettes in grDevices and RColorBrewer, I'm planning a future post about a more practical (but also more complicated) example of using palettes: making maps.

--
This post is one part of my series on palettes.

Thursday, October 11, 2012

Random Name Generator in R

Just for the heck of it, let's recreate my Reality TV Show Name Generator in R. This isn't really the sort of thing you'd normally do in R, but we can try out a bunch of different functions this way: random integers/sampling, concatenation, sorting, and determining the length of an object.

First, let's create a dictionary for R to randomly pull names from. In this case, we want a first word and a last word, that we'll later combine to create a name similar to so many current TV shows. (Note: In some cases, I've made the first "word" or last "word" multiple words for some variety.) Feel free to make your own lists.

first <- c("Fear", "Frontier", "Nanny", "Job", "Yard", "Airport", "Half Pint", "Commando", "Fast Food", "Basketball", "Bachelorette", "Diva", "Baggage", "College", "Octane", "Clean", "Sister", "Army", "Drama", "Backyard", "Pirate", "Shark", "Project", "Model", "Survival", "Justice", "Mom", "New York", "Jersey", "Ax", "Warrior", "Ancient", "Pawn", "Throttle", "The Great American", "Knight", "American", "Outback", "Celebrity", "Air", "Restaurant", "Bachelor", "Family", "Royal", "Surf", "Ulitmate", "Date", "Operation", "Fish Tank", "Logging", "Hollywood", "Amateur", "Craft", "Mystery", "Intervention", "Dog", "Human", "Rock", "Ice Road", "Shipping", "Modern", "Crocodile", "Farm", "Amish", "Single", "Tool", "Boot Camp", "Pioneer", "Kid", "Action", "Bounty", "Paradise", "Mega", "Love", "Style", "Teen", "Pop", "Wedding", "An American", "Treasure", "Myth", "Empire", "Motorway", "Room", "Casino", "Comedy", "Undercover", "Millionaire", "Chopper", "Space", "Cajun", "Hot Rod", "The", "Colonial", "Dance", "Flying", "Sorority", "Mountain", "Auction", "Extreme", "Whale", "Storage", "Cake", "Turf", "UFO", "The Real", "Wild", "Duck", "Queer", "Voice", "Fame", "Music", "Rock Star", "BBQ", "Spouse", "Wife", "Road", "Star", "Renovation", "Comic", "Chef", "Band", "House", "Sweet")

second <- c("Hunters", "Hoarders", "Contest", "Party", "Stars", "Truckers", "Camp", "Dance Crew", "Casting Call", "Inventor", "Search", "Pitmasters", "Blitz", "Marvels", "Wedding", "Crew", "Men", "Project", "Intervention", "Celebrities", "Treasure", "Master", "Days", "Wishes", "Sweets", "Haul", "Hour", "Mania", "Warrior", "Wrangler", "Restoration", "Factor", "Hot Rod", "of Love", "Inventors", "Kitchen", "Casino", "Queens", "Academy", "Superhero", "Battles", "Behavior", "Rules", "Justice", "Date", "Discoveries", "Club", "Brother", "Showdown", "Disasters", "Attack", "Contender", "People", "Raiders", "Story", "Patrol", "House", "Gypsies", "Challenge", "School", "Aliens", "Towers", "Brawlers", "Garage", "Whisperer", "Supermodel", "Boss", "Secrets", "Apprentice", "Icon", "House Party", "Pickers", "Crashers", "Nation", "Files", "Office", "Wars", "Rescue", "VIP", "Fighter", "Job", "Experiment", "Girls", "Quest", "Eats", "Moms", "Idol", "Consignment", "Life", "Dynasty", "Diners", "Chef", "Makeover", "Ninja", "Show", "Ladies", "Dancing", "Greenlight", "Mates", "Wives", "Jail", "Model", "Ship", "Family", "Videos", "Repo", "Rivals", "Room", "Dad", "Star", "Exes", "Island", "Next Door", "Missions", "Kings", "Loser", "Shore", "Assistant", "Comedians", "Rooms", "Boys")

Whew, that was long. Instead, let's look at a random sample of those.

We can make random lists using sample. Sample wants you to provide it with a vector, the sample size you want, and whether or not it should reuse items from the vector. In this case, we don't want any duplicates, so we add replace = FALSE.

first <- sample(first,
 10,
 replace = FALSE)

second <- sample(second,
 12,
 replace = FALSE)

This gives us something more like

> first
 [1] "Teen"       "Fame"       "Basketball" "Sister"     "Bachelor" 
 [6] "Half Pint"  "Myth"       "Paradise"   "Frontier"   "Fast Food"
> second
 [1] "Comedians"    "Boss"         "Experiment"   "Wives"        "Wedding"    
 [6] "Intervention" "Days"         "Raiders"      "Attack"       "Sweets"     
[11] "Jail"         "Whisperer"

It's random, so your list will probably be different.

If you want to sort them alphabetically, use

first <- sort(first)

If we had wanted to randomly resort the whole "first" list, we could have entered the command

first <- sample(first, length(first), replace = FALSE)

Using length makes it easy to resort the whole list. You don't have to look up how many items there are in first.

Let's use length again to choose a random number that we'll use to grab a number from the first list, and then again to choose a random number that we'll use to grab a number from the second list:

rand1 <- sample(1:length(first), 1)
rand2 <- sample(1:length(second), 1)

And finally, we want to pull an item from each list and concatenate them. We could concatenate to a file, but here we'll just let it print to the console.

cat(first[rand1], second[rand2], "\n")

"\n" produces a line break so we can do source(reality.R) again and have another name pop up on the next line.

If you've made a dictionary of your own and want to check for duplicates, you can do the following:

anyDuplicated(first) # if not 0, there are that many duplicates
duplicated(first) # this will show which items are duplicates

Feel free to improve on my dictionary or generator. I'd love to see it.

Code available as a gist.

Thursday, October 4, 2012

Adding Measures of Central Tendency to Histograms in R

Building on the basic histogram with a density plot, we can add measures of central tendency (in this case, mean and median) and a legend.

Like last time, we'll use the beaver data from the datasets package.

hist(beaver1$temp, # histogram
 col = "peachpuff", # column color
 border = "black", 
 prob = TRUE, # show densities instead of frequencies
 xlim = c(36,38.5),
 ylim = c(0,3),
 xlab = "Temperature",
 main = "Beaver #1")
lines(density(beaver1$temp), # density plot
 lwd = 2, # thickness of line
 col = "chocolate3")

Next we'll add a line for the mean:

abline(v = mean(beaver1$temp),
 col = "royalblue",
 lwd = 2)

And a line for the median:

abline(v = median(beaver1$temp),
 col = "red",
 lwd = 2)

And then we can also add a legend, so it will be easy to tell which line is which.

legend(x = "topright", # location of legend within plot area
 c("Density plot", "Mean", "Median"),
 col = c("chocolate3", "royalblue", "red"),
 lwd = c(2, 2, 2))

All of this together gives us the following graphic:


In this example, the mean and median are very close, as we can see by using median() and mode().

> mean(beaver1$temp)
[1] 36.86219
> median(beaver1$temp)
[1] 36.87

We can do like we did in the previous post and graph beaver1 and beaver2 together by adding a layout line and changing the limits of x and y. The full code for this is available in a gist.

Here's the output from that code:


Thursday, September 27, 2012

Histogram + Density Plot Combo in R

Plotting a histogram using hist from the graphics package is pretty straightforward, but what if you want to view the density plot on top of the histogram? This combination of graphics can help us compare the distributions of groups.

Let's use some of the data included with R in the package datasets. It will help to have two things to compare, so we'll use the beaver data sets, beaver1 and beaver2: the body temperatures of two beavers, taken at 10 minute intervals.

First we want to plot the histogram of one beaver:

hist(beaver1$temp, # histogram
 col="peachpuff", # column color
 border="black",
 prob = TRUE, # show densities instead of frequencies
 xlab = "temp",
 main = "Beaver #1")

Next, we want to add in the density line, using lines:


hist(beaver1$temp, # histogram
 col="peachpuff", # column color
 border="black",
 prob = TRUE, # show densities instead of frequencies
 xlab = "temp",
 main = "Beaver #1")
lines(density(beaver1$temp), # density plot
 lwd = 2, # thickness of line
 col = "chocolate3")


Now let's show the plots for both beavers on the same image. We'll make a histogram and density plot for Beaver #2, wrap the graphs in a layout and png, and change the x-axis to be the same, using xlim.


Here's the final code, also available on gist:

png("beaverhist.png")
layout(matrix(c(1:2), 2, 1,
 byrow = TRUE))
hist(beaver1$temp, # histogram
 col = "peachpuff", # column color
 border = "black",
 prob = TRUE, # show densities instead of frequencies
 xlim = c(36,38.5),
 xlab = "temp",
 main = "Beaver #1")
lines(density(beaver1$temp), # density plot
 lwd = 2, # thickness of line
 col = "chocolate3")
hist(beaver2$temp, # histogram
 col = "peachpuff", # column color
 border = "black",
 prob = TRUE, # show densities instead of frequencies
 xlim = c(36,38.5),
 xlab = "temp",
 main = "Beaver #2")
lines(density(beaver2$temp), # density plot
 lwd = 2, # thickness of line
 col = "chocolate3")
dev.off()

Thursday, September 20, 2012

Descriptive Statistics of Groups in R

The sleep data set—provided by the datasets package—shows the effects of two different drugs on ten patients. Extra is the increase in hours of sleep; group is the drug given, 1 or 2; and ID is the patient ID, 1 to 10.

I'll be using this data set to show how to perform descriptive statistics of groups within a data set, when the data set is long (as opposed to wide).

First, we'll need to load up the psych package. The datasets package containing our data is probably already loaded.

library(psych)

The describe.by function in the psych package is what does the magic for us here. It will group our data by a variable we give it, and output descriptive statistics for each of the groups.

> describe.by(sleep, sleep$group)
group: 1
       var  n mean   sd median trimmed  mad  min  max range skew kurtosis   se
extra    1 10 0.75 1.79   0.35    0.68 1.56 -1.6  3.7   5.3 0.42    -1.30 0.57
group*   2 10 1.00 0.00   1.00    1.00 0.00  1.0  1.0   0.0  NaN      NaN 0.00
ID*      3 10 5.50 3.03   5.50    5.50 3.71  1.0 10.0   9.0 0.00    -1.56 0.96
------------------------------------------------------------ 
group: 2
       var  n mean   sd median trimmed  mad  min  max range skew kurtosis   se
extra    1 10 2.33 2.00   1.75    2.24 2.45 -0.1  5.5   5.6 0.28    -1.66 0.63
group*   2 10 2.00 0.00   2.00    2.00 0.00  2.0  2.0   0.0  NaN      NaN 0.00
ID*      3 10 5.50 3.03   5.50    5.50 3.71  1.0 10.0   9.0 0.00    -1.56 0.96

Of course, there are other ways to find the descriptive statistics of groups, and since you'll probably be doing further analysis on the groups, and you may be splitting the whole data into subsets by groups, it may be easiest to just use describe on each subset. But that's a topic for another post. And this is an easy way to quickly look at many groups, and a quick look is particularly essential for descriptive statistics.

Thursday, September 13, 2012

Word Clouds in R

Thanks to the wordcloud package, it's super easy to make a word cloud or tag cloud in R.

In this case, the words have been counted already. If you are starting with plain text, you can use the text mining package tm to obtain the counts. Other bloggers have provided good examples of this. I'll just be covering the simple case where we already have the frequencies.

Let's look at some commonly used words during the National Conventions this year. The New York Times produced a cool infographic that we'll use as our data source. The data in csv format (and the R code too) are available in a gist.

First we need to load up the packages and our data:

library(wordcloud)
library(RColorBrewer)

conventions <- read.table("conventions.csv",
 header = TRUE,
 sep = ",")

And then we can get to using the wordcloud library to produce our clouds in R:

png("dnc.png")
wordcloud(conventions$wordper25k, # words
 conventions$democrats, # frequencies
 scale = c(4,1), # size of largest and smallest words
 colors = brewer.pal(9,"Blues"), # number of colors, palette
 rot.per = 0) # proportion of words to rotate 90 degrees
dev.off()

png("rnc.png")
wordcloud(conventions$wordper25k,
 conventions$republicans,
 scale = c(4,1),
 colors = brewer.pal(9,"Reds"),
 rot.per = 0)
dev.off()

DNC word cloud

RNC word cloud

The default word cloud has some words rotated 90 degrees, but I prefer to use rot.per = 0 to make them all horizontal for readability.

You can easily change to just one color if you prefer that since the size already denotes the frequency of the word, by changing color to "red3", for example:

RNC single color

png("rncalt.png")
wordcloud(conventions$wordper25k,
 conventions$republicans,
 scale = c(4,1),
 colors = "red3",
 rot.per = 0)
dev.off()

DNC single color
And there you have it, a simple way to generate a word count from frequency data using R.

Thursday, September 6, 2012

Text to Columns in Stata

If you've ever used Excel's text to columns feature, you know how valuable it can be. If you haven't ever used text to columns, it allows you to take one column of data and separate it into multiple columns using delimiters that you provide. One time this is helpful is when converting data from other formats.

If you're learning Stata, you may wonder how you can gain this useful functionality. There are a few different ways, but for now we'll be discussing split.

For the following example, I have imported some patent data where the four most relevant primary patent classes for each observation are listed in a single column. These are delimited by a "/" as can be seen below.

Data before transformation

I would like each of these classes to be included in its own column. To do this, I give Stata the following command:

split class, parse(/) generate(class)

Stata command and feedback

In this command, the first class is the name of the variable I want to transform, / is the delimiter, and generate(class) lets Stata to know that I would like the names of the new variables to each be class followed by an integer. In the example, the most /'s there were in class was two, so three class[n] variables are created.

Data after transformation

I can then drop class if I want to remove the original class variable.

I could have also used the option destring if I wanted to treat the patent classes as numbers.

Split documentation excerpt

More use cases are shown in the split documentation. One example they provide allows you to use multiple delimiters. In this instance showing how to separate the names of court cases even if some are delimited by "v." and some by "vs."

For more complex situations, you can also use regular expressions.