Replicating dplyr’s filter()
The most common way to subset a data frame in R is to use the [] function.
If you wanted to see only cars with good gas mileage form the mtcars data set, you could do something like this.
## Error in `[.data.frame`(mtcars, mpg > 30, ): object 'mpg' not found
Oops! We forgot to tell R that mpg is a column in mtcars, and not a variable in our environment.
## mpg cyl disp hp drat wt qsec vs am gear carb
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Fortunately, using dplyr’s filter() function, we don’t have to remember to specify the mtcars$ part.
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## 2 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## 3 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## 4 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## [1] mpg cyl disp hp drat wt qsec vs am gear carb
## <0 rows> (or 0-length row.names)
Another cool thing about filter(): it can take as many filtering expressions as we want to give it! For example:
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
Making our own filter(): Attempt #1
We’ll use R’s built in subsetting operator, writing a function to accept a dataframe and an filtering expression (i.e. the mpg > 30 part)
Let’s try our first function.
## Error: <text>:2:7: unexpected 'in'
## 1: filter2(mtcars, mpg > 30)
## 2: Error in
## ^
No luck! Que pasa? We get the error we had previously.
Let’s investigate by changing our function to print the value o` the expression argument.
And run it again.
## Error in print(expr): object 'mpg' not found
No love either. What’s going on?
Looks like R tries (and fails) to evaluate the expression mpg > 30 as soon as we try to do anything with it…
Making our own filter(): Attempt #2
What we need is substitute(), a special primitive function that does not evaluate its arguments and returns a call object that can be evaluated later.
And now it works. A great first step!
## mpg > 30
Because substitute() returns an unevaluated call, we’ll need to use it in conjunction with the eval() function.
And….
## mpg cyl disp hp drat wt qsec vs am gear carb
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Congrats! Our first working version of filter2().
Making our own filter(): Attempt #3 (FINAL)
As we saw ealier, dplyr::filter() is able to take one or more filtering expressions.
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
But our current version offilter2() can only handle a single filtering expression. We get an error when we try to more than one filter condition.
## Error in filter2(mtcars, mpg > 30, hp > 100): unused argument (hp > 100)
How can we change filter2() to support more arguments?
It turns out that R has a built-in capability to handle this.
We just need to add an ... (“dots”) argument to our function to allow us to specify an arbitrary number of filtering expressions.
Now what do we do inside the function to capture the list of arguments that are passed in by the user?
Turns out R has a built in function called alist() to help us deal with us deal with .... Inside filter2() we’ll stil need to use eval and substitute as before.
filter2 <- function(data, ...) {
# create a list of the filtering expressions to use
expr <- eval(substitute(alist(...)))
# apply each filter expression to the data
_stuff_
}Then we can access each filter expression supplied by the user by subsetting the list of arguments like expr[[1]] or expr[[2]]. What we now want to do is repeatedly filter our data for each item in the expr list.
In most situations in R, if we want to do something to every element in a list, we want to use the lapply() function. But in this case, lapply() doesn’t work because we want to pass the filtered results from each expression to the next.
Here, a for loop is most appropriate.
That leads us to our final function.
Testing our final working filter2() function
## mpg cyl disp hp drat wt qsec vs am gear carb
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
And pleasingly, we don’t need anything extra to cater for more complex filter expressions like the ones below, which filters out the bottom 80% least fuel efficient cars, and keeps only those whose displacement per cylinder exceeds 3o cubic inches per cylinder.
## mpg cyl disp hp drat wt qsec vs am gear carb
## Merc 240D 24.4 4 146.7 62 3.69 3.19 20.0 1 0 4 2
## Porsche 914-2 26.0 4 120.3 91 4.43 2.14 16.7 0 1 5 2
In the next section, we’ll build mutate2().