Databases, SQL and dplyr

Dr. Alexander Fisher

Duke University

Numbers every programmer should know

Task Timing (ns) Timing (μs)
L1 cache reference 0.5 0.0005
L2 cache reference 7 0.007
Main memory reference 100 0.1
Random seek SSD 150,000 150
Read 1 MB sequentially from memory 250,000 250
Read 1 MB sequentially from SSD 1,000,000 1,000
Disk seek 10,000,000 10,000
Read 1 MB sequentially from disk 20,000,000 20,000
Send packet CA->Netherlands->CA 150,000,000 150,000

Implications for big data

Lets imagine we have a 10 GB flat data file and that we want to select certain rows based on a particular criteria. This requires a sequential read across the entire data set.

File Location Performance Time
in memory \(10~GB \times (250~\mu s / 1~MB)\) 2.5 seconds
on disk (SSD) \(10~GB \times (1~ms / 1~MB)\) 10 seconds
on disk (HD) \(10~GB \times (20~ms / 1~MB)\) 200 seconds


This is just for reading sequential data, if we make any modifications (writing) or the data is fragmented things are much worse.

Blocks


Cost:

Disk << SSD <<< Memory

Speed:

Disk <<< SSD << Memory


So usually possible to grow our disk storage to accommodate our data. However, memory is usually the limiting resource, and if we can’t fit everything into memory?

Create blocks - group related data (i.e. rows) and read in multiple rows at a time. Optimal size will depend on the task and the properties of the disk.

Databases

R & databases - the DBI package

Low level package for interfacing R with Database management systems (DBMS) that provides a common interface to achieve the following functionality:

  • connect/disconnect from DB
  • create and execute statements in the DB
  • extract results/output from statements
  • error/exception handling
  • information (meta-data) from database objects
  • transaction management (optional)

DBI Backends

DBI is a specification, not an implementation, and there are a number of packages that implement the DBI specification for different database systems. Backends for R-DBI lists all available backends, but some notable ones include:

  • RPostgres

  • RMariaDB

  • RSQLite

  • odbc

  • bigrquery

  • duckdb

  • sparklyr

RSQLite

Provides the implementation necessary to use DBI to interface with an SQLite database.

library(RSQLite)

this package also loads the necessary DBI functions as well (via re-exporting).

Once loaded we can create a connection to our database,

con = dbConnect(RSQLite::SQLite(), ":memory:")
# :memory: creates a temporary, in-memory SQLite data base
str(con)
Formal class 'SQLiteConnection' [package "RSQLite"] with 8 slots
  ..@ ptr                :<externalptr> 
  ..@ dbname             : chr ":memory:"
  ..@ loadable.extensions: logi TRUE
  ..@ flags              : int 70
  ..@ vfs                : chr ""
  ..@ ref                :<environment: 0x107abb778> 
  ..@ bigint             : chr "integer64"
  ..@ extended_types     : logi FALSE

Example Table

employees = tibble(
  name   = c("Alice","Bob","Carol","Dave","Eve","Frank"),
  email  = c("alice@company.com", "bob@company.com",
             "carol@company.com", "dave@company.com",
             "eve@company.com",   "frank@comany.com"),
  salary = c(52000, 40000, 30000, 33000, 44000, 37000),
  dept   = c("Accounting", "Accounting","Sales",
             "Accounting","Sales","Sales"),
)
dbListTables(con)
character(0)
dbWriteTable(con, name = "employees", value = employees)
dbListTables(con)
[1] "employees"

Removing Tables

dbWriteTable(con, "employs", employees)
dbListTables(con)
[1] "employees" "employs"  

dbRemoveTable(con,"employs")
dbListTables(con)
[1] "employees"

Querying Tables

Databases queries are transactional (see ACID) and are broken up into 3 steps:

(res = dbSendQuery(con, "SELECT * FROM employees"))
<SQLiteResult>
  SQL  SELECT * FROM employees
  ROWS Fetched: 0 [incomplete]
       Changed: 0
dbFetch(res)
   name             email salary       dept
1 Alice alice@company.com  52000 Accounting
2   Bob   bob@company.com  40000 Accounting
3 Carol carol@company.com  30000      Sales
4  Dave  dave@company.com  33000 Accounting
5   Eve   eve@company.com  44000      Sales
6 Frank  frank@comany.com  37000      Sales
dbClearResult(res)

For convenience

There is also dbGetQuery() which combines all three steps,

(res = dbGetQuery(con, "SELECT * FROM employees"))
   name             email salary       dept
1 Alice alice@company.com  52000 Accounting
2   Bob   bob@company.com  40000 Accounting
3 Carol carol@company.com  30000      Sales
4  Dave  dave@company.com  33000 Accounting
5   Eve   eve@company.com  44000      Sales
6 Frank  frank@comany.com  37000      Sales

Creating tables

dbCreateTable() will create a new table with a schema based on an existing data.frame / tibble, but it does not populate that table with data.

dbCreateTable(con, "iris", iris)
(res = dbGetQuery(con, "select * from iris"))
[1] Sepal.Length Sepal.Width  Petal.Length Petal.Width  Species     
<0 rows> (or 0-length row.names)

Adding to tables

Data can be added to an existing table via dbAppendTable().

dbAppendTable(con, name = "iris", value = iris)
Warning: Factors converted to character
[1] 150
dbGetQuery(con, "select * from iris") |> 
  as_tibble()
# A tibble: 150 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# ℹ 140 more rows

Recap

We’ve connected to a database. We can view the tables within the database:

dbListTables(con)
[1] "employees" "iris"     

We can even save the connnection to a specific table within the databse:

emp <- tbl(con, "employees")

filter() using WHERE

We can run SQL directly:

result <- 
  dbGetQuery(con, "SELECT * FROM employees WHERE salary < 40000")

result
   name             email salary       dept
1 Carol carol@company.com  30000      Sales
2  Dave  dave@company.com  33000 Accounting
3 Frank  frank@comany.com  37000      Sales

group_by() and summarize() via GROUP BY

We can create groups for the purpose of summarizing using GROUP BY.

result <- 
  dbGetQuery(con, "SELECT dept, COUNT(*) AS n FROM employees GROUP BY dept")

result
        dept n
1 Accounting 3
2      Sales 3

The “AS” keyword in SQL is essentially a way to rename a column or table in the result. From a dplyr standpoint, it’s very similar to how we use mutate() or summarize().

head() using LIMIT

dbGetQuery(con, "SELECT * FROM employees")
   name             email salary       dept
1 Alice alice@company.com  52000 Accounting
2   Bob   bob@company.com  40000 Accounting
3 Carol carol@company.com  30000      Sales
4  Dave  dave@company.com  33000 Accounting
5   Eve   eve@company.com  44000      Sales
6 Frank  frank@comany.com  37000      Sales

Exercise

Using SQL syntax,

  1. compute the total costs in payroll for this company.

  2. the average salary within each department

total_salary <- dbGetQuery(con, 
"SELECT SUM(salary) AS total_salary
    FROM employees")
total_salary
  total_salary
1       236000
avg_salary_by_dept <- dbGetQuery(con, 
"SELECT dept,
           AVG(salary) AS avg_salary,
           SUM(salary) AS total_salary
    FROM employees
    GROUP BY dept")
avg_salary_by_dept
        dept avg_salary total_salary
1 Accounting   41666.67       125000
2      Sales   37000.00       111000

Closing the connection

con
<SQLiteConnection>
  Path: :memory:
  Extensions: TRUE
dbDisconnect(con)

con
<SQLiteConnection>
  DISCONNECTED

dplyr & databases

Creating a database

db = DBI::dbConnect(RSQLite::SQLite(), "flights.sqlite")
( flight_tbl = dplyr::copy_to(
    db, nycflights13::flights, name = "flights", temporary = FALSE) )
# Source:   table<`flights`> [?? x 19]
# Database: sqlite 3.50.4 [flights.sqlite]
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dbl>

What have we created?

All of this data now lives in the database on the filesystem not in memory,

pryr::object_size(db)
2.46 kB
pryr::object_size(flight_tbl)
6.50 kB
pryr::object_size(nycflights13::flights)
40.65 MB

File size

fs::dir_info(glob = "*.sqlite") |>
  select(path, type, size)
# A tibble: 1 × 3
  path           type         size
  <fs::path>     <fct> <fs::bytes>
1 flights.sqlite file        21.1M

What is flight_tbl?

class(nycflights13::flights)
[1] "tbl_df"     "tbl"        "data.frame"
class(flight_tbl)
[1] "tbl_SQLiteConnection" "tbl_dbi"              "tbl_sql"             
[4] "tbl_lazy"             "tbl"                 
str(flight_tbl)
List of 2
 $ src       :List of 2
  ..$ con  :Formal class 'SQLiteConnection' [package "RSQLite"] with 8 slots
  .. .. ..@ ptr                :<externalptr> 
  .. .. ..@ dbname             : chr "flights.sqlite"
  .. .. ..@ loadable.extensions: logi TRUE
  .. .. ..@ flags              : int 70
  .. .. ..@ vfs                : chr ""
  .. .. ..@ ref                :<environment: 0x1105c72e0> 
  .. .. ..@ bigint             : chr "integer64"
  .. .. ..@ extended_types     : logi FALSE
  ..$ disco: NULL
  ..- attr(*, "class")= chr [1:4] "src_SQLiteConnection" "src_dbi" "src_sql" "src"
 $ lazy_query:List of 5
  ..$ x         : 'dbplyr_table_path' chr "`flights`"
  ..$ vars      : chr [1:19] "year" "month" "day" "dep_time" ...
  ..$ group_vars: chr(0) 
  ..$ order_vars: NULL
  ..$ frame     : NULL
  ..- attr(*, "class")= chr [1:3] "lazy_base_remote_query" "lazy_base_query" "lazy_query"
 - attr(*, "class")= chr [1:5] "tbl_SQLiteConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...

Accessing existing tables

dplyr::tbl(db, "flights")
# Source:   table<`flights`> [?? x 19]
# Database: sqlite 3.50.4 [flights.sqlite]
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dbl>

Using dplyr with sqlite

(oct_21 = flight_tbl |>
   filter(month == 10, day == 21) |>
   select(origin, dest, tailnum)
)
# Source:   SQL [?? x 3]
# Database: sqlite 3.50.4 [flights.sqlite]
   origin dest  tailnum
   <chr>  <chr> <chr>  
 1 EWR    CLT   N152UW 
 2 EWR    IAH   N535UA 
 3 JFK    MIA   N5BSAA 
 4 JFK    SJU   N531JB 
 5 JFK    BQN   N827JB 
 6 LGA    IAH   N15710 
 7 JFK    IAD   N825AS 
 8 EWR    TPA   N802UA 
 9 LGA    ATL   N996DL 
10 JFK    FLL   N627JB 
# ℹ more rows
dplyr::collect(oct_21)
# A tibble: 991 × 3
   origin dest  tailnum
   <chr>  <chr> <chr>  
 1 EWR    CLT   N152UW 
 2 EWR    IAH   N535UA 
 3 JFK    MIA   N5BSAA 
 4 JFK    SJU   N531JB 
 5 JFK    BQN   N827JB 
 6 LGA    IAH   N15710 
 7 JFK    IAD   N825AS 
 8 EWR    TPA   N802UA 
 9 LGA    ATL   N996DL 
10 JFK    FLL   N627JB 
# ℹ 981 more rows

Laziness

dplyr / dbplyr uses lazy evaluation as much as possible, particularly when working with non-local backends.

  • When building a query, we don’t want the entire table, often we want just enough to check if our query is working / makes sense.

  • Since we would prefer to run one complex query over many simple queries, laziness allows for verbs to be strung together.

  • Therefore, by default dplyr

    • won’t connect and query the database until absolutely necessary (e.g. show output),

    • and unless explicitly told to, will only query a handful of rows to give a sense of what the result will look like.

    • we can force evaluation via compute(), collect(), or collapse()

dplyr -> SQL - show_query()

class(oct_21)
[1] "tbl_SQLiteConnection" "tbl_dbi"              "tbl_sql"             
[4] "tbl_lazy"             "tbl"                 
show_query(oct_21)
<SQL>
SELECT `origin`, `dest`, `tailnum`
FROM `flights`
WHERE (`month` = 10.0) AND (`day` = 21.0)

More complex queries

oct_21 |> 
  summarize(
    n=n(), .by = c(origin, dest)
  )
# Source:   SQL [?? x 3]
# Database: sqlite 3.50.4 [flights.sqlite]
   origin dest      n
   <chr>  <chr> <int>
 1 EWR    ATL      15
 2 EWR    AUS       3
 3 EWR    AVL       1
 4 EWR    BNA       7
 5 EWR    BOS      17
 6 EWR    BTV       3
 7 EWR    BUF       2
 8 EWR    BWI       1
 9 EWR    CHS       4
10 EWR    CLE       4
# ℹ more rows
oct_21 |> 
  summarize(
    n=n(), .by = c(origin, dest)
  ) |> 
  show_query()
<SQL>
SELECT `origin`, `dest`, COUNT(*) AS `n`
FROM (
  SELECT `origin`, `dest`, `tailnum`
  FROM `flights`
  WHERE (`month` = 10.0) AND (`day` = 21.0)
) AS `q01`
GROUP BY `origin`, `dest`

oct_21 |> 
  count(origin, dest) |> 
  show_query()
<SQL>
SELECT `origin`, `dest`, COUNT(*) AS `n`
FROM (
  SELECT `origin`, `dest`, `tailnum`
  FROM `flights`
  WHERE (`month` = 10.0) AND (`day` = 21.0)
) AS `q01`
GROUP BY `origin`, `dest`

SQL Translation

In general, dplyr / dbplyr knows how to translate basic math, logical, and summary functions from R to SQL. dbplyr has a function, translate_sql(), that lets you experiment with how R functions are translated to SQL.

con = dbplyr::simulate_dbi()
dbplyr::translate_sql(x == 1 & (y < 2 | z > 3), con=con)
<SQL> `x` = 1.0 AND (`y` < 2.0 OR `z` > 3.0)
dbplyr::translate_sql(x ^ 2 < 10, con=con)
<SQL> (POWER(`x`, 2.0)) < 10.0
dbplyr::translate_sql(x %% 2 == 10, con=con)
<SQL> (`x` % 2.0) = 10.0
dbplyr::translate_sql(mean(x), con=con)
Warning: Missing values are always removed in SQL aggregation functions.
Use `na.rm = TRUE` to silence this warning
This warning is displayed once every 8 hours.
<SQL> AVG(`x`) OVER ()
dbplyr::translate_sql(mean(x, na.rm=TRUE), con=con)
<SQL> AVG(`x`) OVER ()

Complications?

oct_21 |> mutate(tailnum_n_prefix = grepl("^N", tailnum))
Error in `collect()`:
! Failed to collect lazy table.
Caused by error:
! no such function: grepl
oct_21 |> mutate(tailnum_n_prefix = grepl("^N", tailnum)) |> show_query()
<SQL>
SELECT `origin`, `dest`, `tailnum`, grepl('^N', `tailnum`) AS `tailnum_n_prefix`
FROM `flights`
WHERE (`month` = 10.0) AND (`day` = 21.0)

Closing thoughts

The ability of dplyr to translate from R expression to SQL is an incredibly powerful tool making your data processing workflows portable across a wide variety of data backends.

Some tools and ecosystems that are worth learning about: