2 min read

dplyr in 100 lines: Part 1 - Intro

If you’ve been around R for a little while, you’ve almost certainly heard of Hadley Wickham. His amazing packages (including ggplot2, reshape2, plyr and many more) have shaped the R ecosystem into what it is today. Without them, it’s unlikely I would have continued learning R.

 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
`%>%` <- function(x,y) {
  cl <- match.call()
  lhs <- cl[[2]] ; rhs <- cl[[3]]
  if(length(rhs) == 1) {
    y(x)
  } else {
    rhsl <- as.list(rhs)
    rhsl <- c(rhsl[1], lhs, rhsl[2:length(rhsl)]) # swap in the lhs as the first arg in rhs
    eval(as.call(rhsl))
  }
}

dots <- function(...) {
  eval(substitute(alist(...)))
}

If you’ve been around R for a little while, you’ve almost certainly heard of Hadley Wickham. His amazing packages (including ggplot2, reshape2, plyr and many more) have shaped the R ecosystem into what it is today. Without them, it’s unlikely I would have continued learning R.

One of Hadley’s packages is called dplyr, and it’s been an absolute game-changer. When it was released circa 2014, I was giddy with excitement the first time I saw a demonstration. It lets you perform almost any data manipulation task as a series of five simple verbs; mutate, filter, arrange, group_by and summarise.

A lot of effort was put into making it as fast as possible, while maintaining an incredibly simple syntax. In most cases, it approaches the speed of the amazing data.table package, but with a much more approachable syntax. And to boot, it lets us transparently use out-of-memory (i.e. in-database) data as if it were a local dataframe.

Fast forward a few months and Hadley’s new R book, Advanced R, was being updated on his website as he wrote it. It wasn’t until a print copy of his book was available that I put serious effort into going through the content. And I’m glad to say it’s as thorough and well written as we’ve come to expect.

I’ve wanted to peer under the hood of dplyr for a while to figure out how it works, but to an intermediate R programmer like myself, the codebase was a bit too complex.

Instead of using dplyr’s codebase to figure out how dplyr works, I decided to use what I was learning in Advanced R to write my own version, using only base R function.

Turns out all that’s needed is about 70 lines of code.

In Part 2, we’ll cover