Optimising dplyr

By Tom Jemmett

suppressPackageStartupMessages({ library(tidyverse) library(data.table)
})
set.seed(2214)

Recently in a project I was working on I encountered an issue where one code chunk in my RMarkdown report started to take a significant amount of time to run, and it was using almost 100% of my RAM. The data I was working with was an extract of various tables from HES: each row was relating to some activity that was undertaken.

The particular bit of code that was causing me problems was trying to create a simple chart that showed what percentage of people had particular types of activity. As each person may have had more than one of any of the activity types I had to take distinct sets of rows first, then calculate the percentage of individuals having that activity type.

However, there was a slight complication in our data: as this was part of a wider report we would group some of the activity (the Critical Care Bed Day’s) together at one part in the report, but also be able to break these records down into Elective/Emergency.

The table below shows what some of this activity data looked like (we just show the type’s of activity and the subgroups below).

activity_data <- tribble( ~type, ~sub_group, ~alt_sub_group, ~n, "Urgent Service Event", "Emergency Admission", NA, 2.00, "Urgent Service Event", "A&E Attendance", NA, 3.00, "Urgent Service Event", "111 Call", NA, 2.00, "Planned Contact", "Outpatient Attendance", NA, 4.00, "Planned Contact", "Mental Health Contact", NA, 0.12, "Planned Contact", "IAPT Contact", NA, 0.10, "Planned Admission", "Day Case Admission", NA, 0.25, "Planned Admission", "Regular Attendance Admission", NA, 0.07, "Planned Admission", "Elective Admission", NA, 0.10, "Bed", "Critical Care Bed Day", "Elective Admission", 0.08, "Bed", "Critical Care Bed Day", "Emergency Admission", 0.10
) # don't show the n column here, this is just used to generate rows later
activity_data %>% select(-n)
type sub_group alt_sub_group
Urgent Service Event Emergency Admission NA
Urgent Service Event A&E Attendance NA
Urgent Service Event 111 Call NA
Planned Contact Outpatient Attendance NA
Planned Contact Mental Health Contact NA
Planned Contact IAPT Contact NA
Planned Admission Day Case Admission NA
Planned Admission Regular Attendance Admission NA
Planned Admission Elective Admission NA
Bed Critical Care Bed Day Elective Admission
Bed Critical Care Bed Day Emergency Admission

To demonstrate the issue with the initial summary code I produced we can generate some synthetic data. The real dataset included far more columns (e.g. the date of the activity, the organisation where the activity happened at), but for this demonstration we just generate a patient id.

activity_synthetic <- function(people) { activity_data %>% mutate(pid = map(n, function(.x) { rpois(people, .x) %>% imap(~rep(.y, .x)) %>% flatten_dbl() })) %>% unnest(pid) %>% select(-n)
} activity_100k <- activity_synthetic(100000)
activity_10k <- filter(activity_100k, pid <= 10000)

Here is a sample of this data:

sample_n(activity_10k, 10)
type sub_group alt_sub_group pid
Planned Contact Outpatient Attendance NA 6253
Urgent Service Event A&E Attendance NA 1773
Planned Contact Outpatient Attendance NA 3300
Urgent Service Event 111 Call NA 9349
Bed Critical Care Bed Day Emergency Admission 5617
Planned Contact Outpatient Attendance NA 6971
Urgent Service Event Emergency Admission NA 1552
Urgent Service Event 111 Call NA 9607
Urgent Service Event A&E Attendance NA 8370
Urgent Service Event Emergency Admission NA 4944

Now, for rendering in this particular part of the report we wanted to slightly modify the labels of activity used. We wanted to show the alternative subgroup labels along with the subgroup labels for the Critical Care Bed Day records.

Ideally we would have done this at the data loading stage. However, because these fields were being used at other parts in the report we couldn’t easily change the data at loading.

This is the code that I originally came up with, which I’m wrapping in a function for now so we can benchmark our different approaches later. First we update the type and subgroup columns, then we perform the summarisation steps (get the distinct rows for each individual, then count how many individuals there were for that activity).

dplyr_naive <- function(activity) { activity %>% mutate(across(type, ~ifelse(sub_group == "Critical Care Bed Day", sub_group, .x)), across(sub_group, ~ifelse(type == "Critical Care Bed Day", paste0("Critical Care (", word(alt_sub_group, 1), ")"), .x))) %>% group_by(type, sub_group) %>% distinct(pid) %>% count() %>% ungroup() %>% arrange(type, sub_group)
}

We can test how long this takes to run with our 100,000 patient dataset.

system.time( dplyr_naive(activity_100k) )[["elapsed"]]
## [1] 17.11

This may not seem like a huge amount of time, but I was re-rendering this report upwards of 10 times an hour to ensure the pipeline was still working. Furthermore, we were rendering 12 versions of this report at a time (we had a report for each of the 11 STP’s in the Midlands, and an overall Midlands report), and this particular bit of code was chewing up all of my available memory, meaning we couldn’t run in parallel.

What is the problem with the approach above? When we update the type and sub_group columns in the mutate step R has to update all of the rows in the dataframe (in this case, 1,181,633 rows). But, R will not overwrite the original table in memory. Instead it will copy the entire data frame. This is what slows everything down.

So the question is, can we reduce the amount of rows we have to update? Yes! We can simply perform the summary steps before updating the labels! This is the approach I settled on: it’s largely the same as above, I just have to include the alt_sub_group column in the group by step, and remove this column from the results at the end.

dplyr_improved <- function(activity) { activity %>% group_by(type, sub_group, alt_sub_group) %>% distinct(pid) %>% count() %>% ungroup() %>% mutate(across(type, ~ifelse(sub_group == "Critical Care Bed Day", sub_group, .x)), across(sub_group, ~ifelse(type == "Critical Care Bed Day", paste0("Critical Care (", word(alt_sub_group, 1), ")"), .x))) %>% select(-alt_sub_group) %>% arrange(type, sub_group)
}

How does this perform?

system.time( dplyr_improved(activity_100k) )[["elapsed"]]
## [1] 0.12

Significantly better!

data.table

Many will tell you when start working with larger datasets in R you have to resort to using the {data.table} package. I’m not the most proficient {data.table} user in the world, so if you can think of a better way of solving this problem please let me know!

First, here is basically the first approach translated to {data.table}

data.table_naive <- function(activity) { adt <- as.data.table(activity) adt[, type := ifelse(sub_group == "Critical Care Bed Day", "Critical Care Bed Day", type)] adt[, sub_group := ifelse(type == "Critical Care Bed Day", paste0("Critical Care (", word(alt_sub_group, 1), ")"), sub_group)] unique(adt)[order(type, sub_group), .(n = .N), c("type", "sub_group")]
}

and here is the second approach

data.table_improved <- function(activity) { adt <- as.data.table(activity) %>% unique() %>% .[, .(n = .N), c("type", "sub_group", "alt_sub_group")] adt[sub_group == "Critical Care Bed Day", type := "Critical Care Bed Day"] adt[type == "Critical Care Bed Day", subgroup := paste0("Critical Care (", word(alt_sub_group, 1), ")")] adt[, alt_sub_group := NULL] adt[order(type, sub_group), ]
}

benchmarking

Now, we can benchmark the different functions to see how they perform.

bench::mark( dplyr_naive(activity_10k), data.table_naive(activity_10k), dplyr_improved(activity_10k), data.table_improved(activity_10k), iterations = 10, check = FALSE
) %>% select(expression, min, median, mem_alloc, n_gc)
## # A tibble: 4 x 4
## expression min median mem_alloc
## <bch:expr> <bch:tm> <bch:tm> <bch:byt>
## 1 dplyr_naive(activity_10k) 1.13s 1.25s 40.97MB
## 2 data.table_naive(activity_10k) 1.49s 1.76s 44.18MB
## 3 dplyr_improved(activity_10k) 23.82ms 27.99ms 7.26MB
## 4 data.table_improved(activity_10k) 22.88ms 23.77ms 7.8MB

The table above shows how the improved approach is significantly better than the naive approach. We can see that the median time to run the function is significantly better (note that the naive functions are in seconds, but the improved functions are in milliseconds), but also the amount of memory that is being allocated is much, much lower. The n_gc value tells us how many times the “garbage collector” ran. We want this figure to be low.

We can also see how the function works with the 100,000 row dataset.

bench::mark( dplyr_naive(activity_100k), data.table_naive(activity_100k), dplyr_improved(activity_100k), data.table_improved(activity_100k), iterations = 1, check = FALSE
) %>% select(expression, min, median, mem_alloc, n_gc)
## # A tibble: 4 x 4
## expression min median mem_alloc
## <bch:expr> <bch:tm> <bch:tm> <bch:byt>
## 1 dplyr_naive(activity_100k) 17.5s 17.5s 417.2MB
## 2 data.table_naive(activity_100k) 20.9s 20.9s 431.6MB
## 3 dplyr_improved(activity_100k) 299.5ms 299.5ms 68.5MB
## 4 data.table_improved(activity_100k) 180.7ms 180.7ms 69.6MB

As we can see from both of these results both approaches in {dplyr} and {data.table} are broadly the same, with a slight performance edge to using {data.table} over {dplyr}. The big difference is to more sensibly arrange the order of operations so that you summarise first, then perform the costly data manipulation steps on a smaller data set.

The post Optimising dplyr appeared first on NHS-R Community.

Favorite
Categories R Tags ExcerptFavorite

Leave a Comment