User:MPopov (WMF)/Notes/Caching in R

These notes describe some best practices for adding caching to your workflow when working on a project.

Directory referencing with `here`

here is a fantastic package that makes navigation within a project folder a breeze. Suppose you had the following directory structure:

/home/bearloga/
         |- some_project/
                |- README.md
                |- data/
                     |- demo.csv
                |- figures/
                |- notebooks/
                     |- data.ipynb
                     |- analysis.ipynb
                |- queries/
                |- scripts/

Let's say you're working in analysis.ipynb. Just put the following in the very first cell:

here::i_am("notebooks/analysis.ipynb")

library(here)

and you would see:

   here() starts at /home/bearloga/some_project

Then you can use here() to write commands like:

demo <- read.csv(here("data", "demo.csv"))

Simple function

The cached_execution() function depends on readr, fs, and here packages, so make sure those are installed.

cached_execution <- function(.execute, .cache_name, ...) {
    if (!fs::dir_exists(here("cache"))) fs::dir_create(here("cache"))
    cache_filename <- here("cache", fs::path_ext_set(.cache_name, "rds"))
    if (fs::file_exists(cache_filename)) {
        result <- readr::read_rds(cache_filename)
    } else {
        result <- .execute(...)
        readr::write_rds(result, cache_filename, compress = "gz")
    }
    return(result)
}

The parameters are as follows:

.execute: A function/closure to execute – e.g. query_hive or function() { "Hello world!" }
.cache_name: The name of the cache (without an extension)
...: Parameters to forward to the function/closure provided in .execute

Note: the two parameters are prefixed with . as a best practice to avoid argument collision with anything passed to ...

Future work: this can be made smarter by recording what .execute is when caching. Then prior to retrieving the cached result, the function can check if .execute matches the one in the cache and if they're different the cache would be invalidated.

Usage examples

Caching a single query

The following would save the result (for fast retrieval in the future) in cache/wmf_product_tables.rds, creating the cache/ sub-directory within the project if it does not exist yet.

library(wmfdata) # remotes::install_github("wikimedia/wmfdata-r")

wmf_product_tables <- cached_execution(
    query_hive,
    "wmf_product_tables",
    query = "USE wmf_product; SHOW TABLES;"
)

Alternatively:

library(wmfdata) # remotes::install_github("wikimedia/wmfdata-r")

wmf_product_tables <- cached_execution(
    function() { query_hive("USE wmf_product; SHOW TABLES;") },
    "wmf_product_tables"
)

And if using R 4.1 (or newer) with lambda-notation anonymous functions:

library(wmfdata) # remotes::install_github("wikimedia/wmfdata-r")

wmf_product_tables <- cached_execution(
    \() query_hive("USE wmf_product; SHOW TABLES;"),
    "wmf_product_tables"
)

Caching multiple queries

Suppose you wanted to retrieve the last 90 days of web request data one day at a time and cache each day's requests. What we're going to do is create a range of dates and use purrr::map_dfr() to execute & cache a query for each date in that range, caching each date's results separately from other dates in the range.

library(wmfdata) # query_hive(), extract_ymd()
library(glue)    # string literals
library(purrr)   # map_dfr()
library(zeallot) # %<-% multi-assignment

last_90_days <- map_dfr(
  
    .x = seq(Sys.Date() - 90, Sys.Date(), "day"),

    .f = function(date, ...) {
       cached_execution(
           .cache_name = format(date, "webrequests_%Y-%m-%d"),
           date = date,
           ...
        )
    },

    # Parameters passed to .f():
    .execute = function(date, query) {

        c(year, month, day) %<-% extract_ymd(date)

        # Substitute ${year}, ${month}, ${day}:
        query <- glue(query, .open = "${")

        query_hive(query)

    },
    
    # This is passed to .f() which then forwards it to .execute() via the ...:
    query = "
      USE wmf;
      SELECT *
      FROM webrequest
      WHERE year = ${year} AND month = ${month} AND day = ${day}
        AND webrequest_source = 'text';
    "

)

Note: refer to purrr's documentation for more details about map_dfr().