Showing posts with label regular expressions. Show all posts
Showing posts with label regular expressions. Show all posts

Thursday, October 17, 2013

Line Breaks Between Words in Axis Labels in ggplot in R

Sometimes when plotting factor variables in R, the graphics can look pretty messy thanks to long factor levels. If the level attributes have multiple words, there is an easy fix to this that often makes the axis labels look much cleaner.

Without Line Breaks

Here's the messy looking example:
No line breaks in axis labels
And here's the code for the messy looking example:


library(OIdata)
data(birds)
library(ggplot2)

ggplot(birds,
  aes(x = effect,
    y = speed)) +
geom_boxplot()

With Line Breaks

We can use regular expressions to add line breaks to the factor levels by substituting any spaces with line breaks:

library(OIdata)
data(birds)
library(ggplot2)

levels(birds$effect) <- gsub(" ", "\n", levels(birds$effect))
ggplot(birds,
  aes(x = effect,
    y = speed)) +
geom_boxplot()

Line breaks in axis labels
Just one line made the plot look much better, and it will carry over to other plots you make as well. For example, you could create a table with the same variable.

Horizontal Boxes

Here we can see the difference in a box plot with horizontal boxes. It's up to you to decide which style looks better:

No line breaks in axis labels


Line breaks in axis labels

library(OIdata)
data(birds)
library(ggplot2)

levels(birds$effect) <- gsub(" ", "\n", levels(birds$effect))
ggplot(birds,
  aes(x = effect,
    y = speed)) +
geom_boxplot() + 
coord_flip()

Just a note: if you're not using ggplot, the multi-line axis labels might overflow into the graph.

The code is available in a gist.

Citations and Further Reading

Thursday, September 19, 2013

Truncate by Delimiter in R

Sometimes, you only need to analyze part of the data stored as a vector. In this example, there is a list of patents. Each patent has been assigned to one or more patent classes. Let's say that we want to analyze the dataset based on only the first patent class listed for each patent.

patents <- data.frame(
  patent = 1:30,
  class = c("405", "33/209", "549/514", "110", "540", "43", 
  "315/327", "540", "536/514", "523/522", "315", 
  "138/248/285", "24", "365", "73/116/137", "73/200", 
  "252/508", "96/261", "327/318", "426/424/512", 
  "75/423", "430", "416", "536/423/530", "381/181", "4", 
  "340/187", "423/75", "360/392/G9B", "524/106/423"))

We can use regular expressions to truncate each element of the vector just before the first "/".

grep, grepl, sub, gsub, regexpr, gregexpr, and regexec are all functions in the base package that allow you to use regular expressions within each element of a character vector. sub and gsub allow you to replace within each element of the vector. sub replaces the first match within each element, while gsub replaces all matches within each element. In this case, we want to remove everything from the first "/" on, and we want to replace it with nothing. Here's how we can use sub to do that:

patents$primaryClass <- sub("/.*", "", patents$class)

> table(patents$primaryClass)

110 138  24 252 315 327  33 340 360 365 381   4 405 416 423 426  43 430 523 524 
  1   1   1   1   2   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
536 540 549  73  75  96 
  2   2   1   2   1   1 


--
This post is one part of my series on Text to Columns.

Citations and Further Reading