Duke University
| Task | Timing (ns) | Timing (μs) |
|---|---|---|
| L1 cache reference | 0.5 | 0.0005 |
| L2 cache reference | 7 | 0.007 |
| Main memory reference | 100 | 0.1 |
| Random seek SSD | 150,000 | 150 |
| Read 1 MB sequentially from memory | 250,000 | 250 |
| Read 1 MB sequentially from SSD | 1,000,000 | 1,000 |
| Disk seek | 10,000,000 | 10,000 |
| Read 1 MB sequentially from disk | 20,000,000 | 20,000 |
| Send packet CA->Netherlands->CA | 150,000,000 | 150,000 |
Lets imagine we have a 10 GB flat data file and that we want to select certain rows based on a particular criteria. This requires a sequential read across the entire data set.
| File Location | Performance | Time |
|---|---|---|
| in memory | \(10~GB \times (250~\mu s / 1~MB)\) | 2.5 seconds |
| on disk (SSD) | \(10~GB \times (1~ms / 1~MB)\) | 10 seconds |
| on disk (HD) | \(10~GB \times (20~ms / 1~MB)\) | 200 seconds |
This is just for reading sequential data, if we make any modifications (writing) or the data is fragmented things are much worse.
Disk << SSD <<< Memory
Disk <<< SSD << Memory
So usually possible to grow our disk storage to accommodate our data. However, memory is usually the limiting resource, and if we can’t fit everything into memory?
Create blocks - group related data (i.e. rows) and read in multiple rows at a time. Optimal size will depend on the task and the properties of the disk.
Even with blocks, any kind of querying / subsetting of rows requires a linear search, which requires \(\mathcal{O}(N)\) reads.
We can do better if we are careful about how we structure our data, specifically sorting some (or all) of the columns.
Sorting is expensive, \(\mathcal{O}(N \log N)\), but it only needs to be done once.
After sorting, we can use a binary search for any subsetting tasks, \(\mathcal{O}(\log N)\)
In a databases these “sorted” columns are refered to as indexes.
Indexes require additional storage, but usually small enough to be kept in memory even if blocks need to stay on disk.
Low level package for interfacing R with Database management systems (DBMS) that provides a common interface to achieve the following functionality:
DBI is a specification, not an implementation, and there are a number of packages that implement the DBI specification for different database systems. Backends for R-DBI lists all available backends, but some notable ones include:
RPostgres
RMariaDB
RSQLite
odbc
bigrquery
duckdb
sparklyr
Provides the implementation necessary to use DBI to interface with an SQLite database.
this package also loads the necessary DBI functions as well (via re-exporting).
Once loaded we can create a connection to our database,
con = dbConnect(RSQLite::SQLite(), ":memory:")
# :memory: creates a temporary, in-memory SQLite data base
str(con)Formal class 'SQLiteConnection' [package "RSQLite"] with 8 slots
..@ ptr :<externalptr>
..@ dbname : chr ":memory:"
..@ loadable.extensions: logi TRUE
..@ flags : int 70
..@ vfs : chr ""
..@ ref :<environment: 0x107abb778>
..@ bigint : chr "integer64"
..@ extended_types : logi FALSE
employees = tibble(
name = c("Alice","Bob","Carol","Dave","Eve","Frank"),
email = c("alice@company.com", "bob@company.com",
"carol@company.com", "dave@company.com",
"eve@company.com", "frank@comany.com"),
salary = c(52000, 40000, 30000, 33000, 44000, 37000),
dept = c("Accounting", "Accounting","Sales",
"Accounting","Sales","Sales"),
)
Databases queries are transactional (see ACID) and are broken up into 3 steps:
<SQLiteResult>
SQL SELECT * FROM employees
ROWS Fetched: 0 [incomplete]
Changed: 0
There is also dbGetQuery() which combines all three steps,
name email salary dept
1 Alice alice@company.com 52000 Accounting
2 Bob bob@company.com 40000 Accounting
3 Carol carol@company.com 30000 Sales
4 Dave dave@company.com 33000 Accounting
5 Eve eve@company.com 44000 Sales
6 Frank frank@comany.com 37000 Sales
dbCreateTable() will create a new table with a schema based on an existing data.frame / tibble, but it does not populate that table with data.
Data can be added to an existing table via dbAppendTable().
# A tibble: 150 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <chr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ℹ 140 more rows
We’ve connected to a database. We can view the tables within the database:
We can even save the connnection to a specific table within the databse:
We can run SQL directly:
We can create groups for the purpose of summarizing using GROUP BY.
dept n
1 Accounting 3
2 Sales 3
The “AS” keyword in SQL is essentially a way to rename a column or table in the result. From a dplyr standpoint, it’s very similar to how we use mutate() or summarize().
Using SQL syntax,
compute the total costs in payroll for this company.
the average salary within each department
db = DBI::dbConnect(RSQLite::SQLite(), "flights.sqlite")
( flight_tbl = dplyr::copy_to(
db, nycflights13::flights, name = "flights", temporary = FALSE) )# Source: table<`flights`> [?? x 19]
# Database: sqlite 3.50.4 [flights.sqlite]
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dbl>
All of this data now lives in the database on the filesystem not in memory,
flight_tbl?[1] "tbl_df" "tbl" "data.frame"
[1] "tbl_SQLiteConnection" "tbl_dbi" "tbl_sql"
[4] "tbl_lazy" "tbl"
List of 2
$ src :List of 2
..$ con :Formal class 'SQLiteConnection' [package "RSQLite"] with 8 slots
.. .. ..@ ptr :<externalptr>
.. .. ..@ dbname : chr "flights.sqlite"
.. .. ..@ loadable.extensions: logi TRUE
.. .. ..@ flags : int 70
.. .. ..@ vfs : chr ""
.. .. ..@ ref :<environment: 0x1105c72e0>
.. .. ..@ bigint : chr "integer64"
.. .. ..@ extended_types : logi FALSE
..$ disco: NULL
..- attr(*, "class")= chr [1:4] "src_SQLiteConnection" "src_dbi" "src_sql" "src"
$ lazy_query:List of 5
..$ x : 'dbplyr_table_path' chr "`flights`"
..$ vars : chr [1:19] "year" "month" "day" "dep_time" ...
..$ group_vars: chr(0)
..$ order_vars: NULL
..$ frame : NULL
..- attr(*, "class")= chr [1:3] "lazy_base_remote_query" "lazy_base_query" "lazy_query"
- attr(*, "class")= chr [1:5] "tbl_SQLiteConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
# Source: table<`flights`> [?? x 19]
# Database: sqlite 3.50.4 [flights.sqlite]
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dbl>
# Source: SQL [?? x 3]
# Database: sqlite 3.50.4 [flights.sqlite]
origin dest tailnum
<chr> <chr> <chr>
1 EWR CLT N152UW
2 EWR IAH N535UA
3 JFK MIA N5BSAA
4 JFK SJU N531JB
5 JFK BQN N827JB
6 LGA IAH N15710
7 JFK IAD N825AS
8 EWR TPA N802UA
9 LGA ATL N996DL
10 JFK FLL N627JB
# ℹ more rows
dplyr / dbplyr uses lazy evaluation as much as possible, particularly when working with non-local backends.
When building a query, we don’t want the entire table, often we want just enough to check if our query is working / makes sense.
Since we would prefer to run one complex query over many simple queries, laziness allows for verbs to be strung together.
Therefore, by default dplyr
won’t connect and query the database until absolutely necessary (e.g. show output),
and unless explicitly told to, will only query a handful of rows to give a sense of what the result will look like.
we can force evaluation via compute(), collect(), or collapse()
show_query()In general, dplyr / dbplyr knows how to translate basic math, logical, and summary functions from R to SQL. dbplyr has a function, translate_sql(), that lets you experiment with how R functions are translated to SQL.
Error in `collect()`:
! Failed to collect lazy table.
Caused by error:
! no such function: grepl
The ability of dplyr to translate from R expression to SQL is an incredibly powerful tool making your data processing workflows portable across a wide variety of data backends.
Some tools and ecosystems that are worth learning about: