My favorite R data.table feature is the “`by`

” grouping notation when combined with the `:=`

notation.

Let’s take a look at this powerful notation.

First, let’s build an example `data.frame`

.

d <- wrapr::build_frame( "group" , "value" | "a" , 1L | "a" , 2L | "b" , 3L | "b" , 4L ) knitr::kable(d)

group | value |
---|---|

a | 1 |

a | 2 |

b | 3 |

b | 4 |

The data is some sort of value with a grouping column telling us which rows are related.

With the `data.table`

“`:=`

,`by`

” notation we can add the per-group totals into each row of the data as follows (the extra `[]`

at the end is just the command to also print the results in addition to adding the column in-place).

library("data.table") dt <- data.table::as.data.table(d) dt[, group_sum := sum(value), by = "group"][] # group value group_sum # 1: a 1 3 # 2: a 2 3 # 3: b 3 7 # 4: b 4 7

The “`by`

” signals we are doing a per-group calculation, and the “`:=`

” signals to land the results in the original `data.table`

. This sort of window function is incredibly useful in computing things such as what fraction of a group’s mass is in each row. For example.

# build a fresh copy as last command altered dt in place dt <- data.table::as.data.table(d) dt[, fraction := value/sum(value), by = "group"][] # group value fraction # 1: a 1 0.3333333 # 2: a 2 0.6666667 # 3: b 3 0.4285714 # 4: b 4 0.5714286

In base R (or in a more purely relational data system) the obvious way to solve this requires two steps: computing the per-group summaries and then joining them back into the original table rows. This can be done as follows.

sums <- tapply(d$value, d$group, sum) d$fraction <- d$value/sums[d$group] print(d) # group value fraction # 1 a 1 0.3333333 # 2 a 2 0.6666667 # 3 b 3 0.4285714 # 4 b 4 0.5714286

We called the transform a “window function”, as that is the name that SQL uses for the concept. The SQL code to perform this calculation would look like the following.

SELECT group, value, value/sum(value) OVER ( PARTITION BY group ) AS fraction FROM d

And the popular package `dplyr`

uses the following notation for the same problem.

d %>% group_by(group) %>% mutate(fraction = value/sum(value)) %>% ungroup()

And, as always, let’s end with some timings. For a 1000000 row table with 10 additional irrelevant columns, and group ids picked uniformly from 100000 symbols: we see the various solutions take the following times to complete the task.

## solution milliseconds ## datatable_soln 384 ## base_R_lookup_soln 1476 ## dplyr_soln 3988

All packages are the current CRAN releases as of 2019-06-29. Timings are sensitive to number of row, columns, and groups. Note the `data.table`

time includes the time to convert to the `data.table`

class.Details on the timings can be found here.

### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

I would have thought to use ave() for a base R solution:

Seems to be slightly slower than the base R lookup solution.

Thanks for the note!

`ave()`

is a good “all in one” solution for merging per-group summaries in, and handles multiple grouping columns (or composite keys). But I wanted to show how one gets to a solution from basic ideas (compute summary, merge summary back in). I did not avoid it for speed reasons, and`print(ave)`

is fairly interesting (though a bit hard to work out all the details, such as the assignment back into`slice()`

).Actually I just re-ran the timings with

`ave()`

.`ave()`

is competitive with the merge solution.