# Big IPUMS data

#### 2019-06-04

Browsing data on the IPUMS website can be a little like grocery shopping when you’re hungry — you show up to grab a couple things, but everything looks so good, and you end up with an overflowing cart1. When you do this with IPUMS data, unfortunately, sometimes your extract may get so large that it doesn’t fit in your computer’s memory.

If you’ve got an extract that’s too big, both the IPUMS website and the ipumsr package have tools to help. There are four basic strategies:

1. Get more memory.
2. Reduce the size of your dataset.
3. Use “chunked”/“yield” reading to work one piece at a time.
4. Use a database.

The IPUMS website has features to help with option 2, and the ipumsr package can help you with options 3 and 4 (option 1 depends on your wallet).

The examples in this vignette will rely on the ipumsr, dplyr and biglm packages, and the example CPS extract used in the ipums-cps vignette. If you want to follow along, you should follow the instructions in that vignette to make an extract.

library(ipumsr)
library(dplyr)

# To run the full vignette you'll also need the following packages:
installed_biglm <- requireNamespace("biglm")
installed_db_pkgs <- requireNamespace("DBI") &
requireNamespace("RSQLite") &
requireNamespace("dbplyr")

# Change these filepaths to the filepaths of your downloaded extract
cps_ddi_file <- "cps_00001.xml"
cps_data_file <- "cps_00001.dat"

# Option 1: Trade money for convenience

If you’ve got a dataset that’s too big for your RAM, you could always get more. You could accomplish this by upgrading your current computer, getting a new one, or paying a cloud service like Amazon or Microsoft Azure (or one of the many other similar services). Here are guides for using R on Amazon and Microsoft Azure.

# Option 2: Do you really need all of that?

The IPUMS website has many features that will let you reduce the size of your extract. The easiest thing to do is to review your sample and variable selections to see if you can drop some.

If you do need every sample and variable, but your analysis is on a specific subset of the data, the IPUMS extract engine has a feature called “Select Cases”, where you can subset on an included variable (for example you could subset on AGE so that your extract only includes those older than 65, or subset on EDUCATION to look at only college graduates). In most IPUMS microdata projects, the select cases feature is on the “Create Extract” page, as the last step before you submit the extract. If you’ve already submitted the extract, you can click the “revise” link on the “Download or Revise Extracts” page to access the “Select Cases” feature.

Or, if you would be happy with a random subsample of the data, the IPUMS extract engine has an option to “Customize Sample Size” that will take a random sample. This feature is also available on the “Create Extract” page, as the last step before you submit the extract. Again, if you’ve already submitted your extract, you can access this feature by clicking the “revise” link on the “Download or Revise Extracts” page.

# Option 3: work one “chunk”/“yield” at a time

ipumsr has two closely related concepts for reading parts of a function at a time. The “chunk” API inherits directly from the readr framework for reading chunks of data. These functions allow you to specify a function that will be called on each chunk and then how you’d like to combine them at the end. The “yield” API is unique to ipumsr (and fixed-width data), but allows a little more flexibility because control is returned to you between each piece of data.

## Option 3A: Work one chunk at a time

ipumsr has “chunked” versions of the microdata reading functions (read_ipums_micro_chunked() and read_ipums_micro_list_chunked()). These chunked versions of the functions allow you to specify a function that will be applied to each chunk, and then also control how the results from these chunks are combined. This functionality is based on the chunked functionality introduced by readr and so is quite flexible. Below, we’ll outline solutions to three common use-cases for IPUMS data: tabulation, regression and selecting cases.

### Chunked tabulation example

Let’s say you want to find the percent of people in the workforce by their self-reported health. Since this extract is small enough to fit in memory, we could just do the following:

read_ipums_micro(
cps_ddi_file, data_file = cps_data_file, verbose = FALSE
) %>%
mutate(
HEALTH = as_factor(HEALTH),
AT_WORK = EMPSTAT %>%
lbl_relabel(
lbl(1, "Yes") ~ .lbl == "At work",
lbl(0, "No") ~ .lbl != "At work"
) %>%
as_factor()
) %>%
group_by(HEALTH, AT_WORK) %>%
summarize(n = n())
#> # A tibble: 10 x 3
#> # Groups:   HEALTH [5]
#>    HEALTH    AT_WORK     n
#>    <fct>     <fct>   <int>
#>  1 Excellent No      40582
#>  2 Excellent Yes     28071
#>  3 Very good No      32367
#>  4 Very good Yes     32947
#>  5 Good      No      26726
#>  6 Good      Yes     22483
#>  7 Fair      No      11089
#>  8 Fair      Yes      4520
#>  9 Poor      No       5418
#> 10 Poor      Yes       780

But let’s pretend like we can only store 1,000 rows at a time. In this case, we need to use a chunked function, tabulate for each chunk, and then calculate the counts across all of the chunks.

First we’ll make the callback function, which will take two arguments: x (the data from a chunk) and pos (the position of the chunk, expressed as the line in the input file at which the chunk starts). We’ll only use x, but the callback function must always take both these arguments.

cb_function <- function(x, pos) {
x %>% mutate(
HEALTH = as_factor(HEALTH),
AT_WORK = EMPSTAT %>%
lbl_relabel(
lbl(1, "Yes") ~ .lbl == "At work",
lbl(0, "No") ~ .lbl != "At work"
) %>%
as_factor()
) %>%
group_by(HEALTH, AT_WORK) %>%
summarize(n = n())
}

Next we need to create a callback object. The choice of a callback object depends mainly on how we want to combine the results from applying our callback function to each chunk. The tree main types of callbacks that apply IPUMS variable metadta are:

• IpumsDataFrameCallback - Combine the results from each chunk together by row binding them together.
• IpumsListCallback - Use when you don’t want to (or can’t) immediately combine the resuts, so it returns a list with one item per chunk containing the results.
• IpumsSideEffectCallback - Use when you are calling the function primarily for side effects, and do not need the results at all (for example if you are saving each chunk to disk).

(The fourth, which is used for running lm regression is discussed later in this document).

In this case, we want to row-bind the data.frames returned by cb_function(), so we use IpumsDataFrameCallback.

Callback objects are R6 objects, but you don’t need to be familiar with R6 to use them2. For now, all we really need to know is that to create a callback, we use $new() syntax. cb <- IpumsDataFrameCallback$new(cb_function)

Next we read in the data with the read_ipums_micro_chunked() function, specifying the callback and that we want the chunk_size to be 1000.

chunked_tabulations <- read_ipums_micro_chunked(
cps_ddi_file, data_file = cps_data_file, verbose = FALSE,
callback = cb, chunk_size = 1000
)

Now we have a data.frame with the counts by health and work status within each chunk. To get the full table, we just need to sum by health and work status one more time.

chunked_tabulations %>%
group_by(HEALTH, AT_WORK) %>%
summarize(n = sum(n))
#> # A tibble: 10 x 3
#> # Groups:   HEALTH [5]
#>    HEALTH    AT_WORK     n
#>    <fct>     <fct>   <int>
#>  1 Excellent No      40582
#>  2 Excellent Yes     28071
#>  3 Very good No      32367
#>  4 Very good Yes     32947
#>  5 Good      No      26726
#>  6 Good      Yes     22483
#>  7 Fair      No      11089
#>  8 Fair      Yes      4520
#>  9 Poor      No       5418
#> 10 Poor      Yes       780

### Chunked regression example

With the biglm package, it is possible to use R to perform a regression on data that is too large to store in memory all at once. The ipumsr package provides a callback designed to make this simple: IpumsBiglmCallback.

Again we’ll use the CPS example, which is small enough that we can keep it in memory. Here’s an example of a regression looking at how hours worked, self-reported health and age are related among those who are currently working. This is meant as a simple example, and ignores many of the complexities in this relationship, so please use caution when interpreting.

# Read in data
cps_ddi_file, data_file = cps_data_file, verbose = FALSE
)

# Prepare data for model
# (age has been capped at 99, which we assume is high enough to not
#  cause any problems so we leave it.)
data <- data %>%
mutate(
HEALTH = as_factor(HEALTH),
AHRSWORKT = lbl_na_if(AHRSWORKT, ~.lbl == "NIU (Not in universe)"),
AT_WORK = EMPSTAT %>%
lbl_relabel(
lbl(1, "Yes") ~ .lbl == "At work",
lbl(0, "No") ~ .lbl != "At work"
) %>%
as_factor()
) %>%
filter(AT_WORK == "Yes")

# Run regression
model <- lm(AHRSWORKT ~ AGE + I(AGE^2) + HEALTH, data)
summary(model)
#>
#> Call:
#> lm(formula = AHRSWORKT ~ AGE + I(AGE^2) + HEALTH, data = data)
#>
#> Residuals:
#> <Labelled double>: Hours worked last week
#>     Min      1Q  Median      3Q     Max
#> -41.230  -4.949  -0.080   5.945  75.697
#>
#> Coefficients:
#>                  Estimate Std. Error t value Pr(>|t|)
#> (Intercept)      5.626953   0.367851  15.297  < 2e-16 ***
#> AGE              1.568287   0.017790  88.156  < 2e-16 ***
#> I(AGE^2)        -0.016798   0.000204 -82.338  < 2e-16 ***
#> HEALTHVery good -0.280826   0.104433  -2.689  0.00717 **
#> HEALTHGood      -1.275358   0.115861 -11.008  < 2e-16 ***
#> HEALTHFair      -3.614487   0.207121 -17.451  < 2e-16 ***
#> HEALTHPoor      -5.732656   0.465751 -12.308  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 12.8 on 88794 degrees of freedom
#> Multiple R-squared:  0.08886,    Adjusted R-squared:  0.08879
#> F-statistic:  1443 on 6 and 88794 DF,  p-value: < 2.2e-16

To do the same regression, but with only 1000 rows loaded at a time, we work in a similar manner.

First we make the IpumsBiglmCallback callback object that specifies both the model and a function to prepare the data.

biglm_cb <- IpumsBiglmCallback$new( model = AHRSWORKT ~ AGE + I(AGE^2) + HEALTH, prep = function(x, pos) { x %>% mutate( HEALTH = as_factor(HEALTH), AHRSWORKT = lbl_na_if(AHRSWORKT, ~.lbl == "NIU (Not in universe)"), AT_WORK = EMPSTAT %>% lbl_relabel( lbl(1, "Yes") ~ .lbl == "At work", lbl(0, "No") ~ .lbl != "At work" ) %>% as_factor() ) %>% filter(AT_WORK == "Yes") } ) And then we read the data using read_ipums_micro_chunked(), passing the callback that we just made. chunked_model <- read_ipums_micro_chunked( cps_ddi_file, data_file = cps_data_file, verbose = FALSE, callback = biglm_cb, chunk_size = 1000 ) summary(chunked_model) #> Large data regression model: biglm(AHRSWORKT ~ AGE + I(AGE^2) + HEALTH, data, ...) #> Sample size = 88801 #> Coef (95% CI) SE p #> (Intercept) 5.6270 4.8913 6.3627 0.3679 0.0000 #> AGE 1.5683 1.5327 1.6039 0.0178 0.0000 #> I(AGE^2) -0.0168 -0.0172 -0.0164 0.0002 0.0000 #> HEALTHVery good -0.2808 -0.4897 -0.0720 0.1044 0.0072 #> HEALTHGood -1.2754 -1.5071 -1.0436 0.1159 0.0000 #> HEALTHFair -3.6145 -4.0287 -3.2002 0.2071 0.0000 #> HEALTHPoor -5.7327 -6.6642 -4.8012 0.4658 0.0000 ### Chunked “select cases” example Sometimes you may want to select a subset of the data before reading it in. The IPUMS website has this functionality built in, which can be a faster way to do this (this “select cases” functionality is described in the second section above). Also, Unix commands like awk and sed will generally be much faster than these R based solutions. However, it is possible to use the chunked functions to create a subset, which can be convenient if you want to subset on some complex logic that would be hard to code into the IPUMS extract system or Unix tools. # Subset only those in "Poor" health chunked_subset <- read_ipums_micro_chunked( cps_ddi_file, data_file = cps_data_file, verbose = FALSE, callback = IpumsDataFrameCallback$new(function(x, pos) {
filter(x, HEALTH == 5)
}),
chunk_size = 1000
)

## Option 3B: Work one ‘yield’ at atime

ipumsr now has a second form of chunked reading called ‘yields’ which accomplish a similar goal as chunks, but are more flexible. They grant you more freedom in what R code you run between reading chunks, including the ability to have multiple files open at once. Additionally, yields are compatible with the bigglm function which allows you to run glm models on data larger than memory (as oposed to plain old lm’s that you can run with chunks). The downside to this greater control is that yields have an API that is unique to IPUMS data and the way they work is unusual for R code.

### Chunked tabulation example

To see how the yield functions work as compared to the chunked functions, here’s the same chunked tabulation example that was shown earlier in this document, but using yields rather than chunks. We want to calculate the percent of people in the workforce by their self-reported health status.

First we create the yield object with the function read_ipums_micro_yield() – the arguments are similar to read_ipums_micro().

data <- read_ipums_micro_yield(
cps_ddi_file,
data_file = cps_data_file,
verbose = FALSE
)

This function returns an R6 object (like the callbacks for the chunked data) which contains methods for reading the data. The most important method is the yield(n) function which will return n rows of the data.frame. In this example, we’ll also use is_done(), which tells us whether we’ve finished reading the data yet.

yield_results <- tibble(
HEALTH = factor(levels = c("Excellent", "Very good", "Good", "Fair", "Poor")),
AT_WORK = factor(levels = c("No", "Yes")),
n = integer(0)
)
while (!data$is_done()) { new <- data$yield(n = 1000) %>%
mutate(
HEALTH = as_factor(HEALTH),
AT_WORK = EMPSTAT %>%
lbl_relabel(
lbl(1, "Yes") ~ .lbl == "At work",
lbl(0, "No") ~ .lbl != "At work"
) %>%
as_factor()
) %>%
group_by(HEALTH, AT_WORK) %>%
summarize(n = n())

yield_results <- bind_rows(yield_results, new) %>%
group_by(HEALTH, AT_WORK) %>%
summarize(n = sum(n))
}

yield_results
#> # A tibble: 10 x 3
#> # Groups:   HEALTH [5]
#>    HEALTH    AT_WORK     n
#>    <fct>     <fct>   <int>
#>  1 Excellent No      40582
#>  2 Excellent Yes     28071
#>  3 Very good No      32367
#>  4 Very good Yes     32947
#>  5 Good      No      26726
#>  6 Good      Yes     22483
#>  7 Fair      No      11089
#>  8 Fair      Yes      4520
#>  9 Poor      No       5418
#> 10 Poor      Yes       780

### GLM yield regression example

One of the major benefits of the yield-style reading over chunks is that it is compatible with the GLM functions from biglm, allowing more complicated models than just linear models. Logistic models are one such model that requires GLM methods. The same caution as the chunked example applies - this is meant as a simple example model, please interpret with caution.

To run a logistic model we start by loading the data in the same way as the frequency example above:

data <- read_ipums_micro_yield(
cps_ddi_file,
data_file = cps_data_file,
verbose = FALSE
)

Next we make a function that takes a single argument reset. When it is FALSE, it returns the next chunk of data (and formats it for our model). When reset is TRUE, it resets the data to the beginning. To create this function, we use the the reset() method from the yield object.

get_model_data <- function(reset) {
if (reset) {
data$reset() } else { yield <- data$yield(n = 1000) # Set n pretty low for example
if (is.null(yield)) return(yield)
out <- yield %>%
mutate(
HEALTH = as_factor(HEALTH),
WORK30PLUS = lbl_na_if(AHRSWORKT, ~.lbl == "NIU (Not in universe)") %>%
{. >= 30},
AT_WORK = EMPSTAT %>%
lbl_relabel(
lbl(1, "Yes") ~ .lbl == "At work",
lbl(0, "No") ~ .lbl != "At work"
) %>%
as_factor()
) %>%
filter(AT_WORK == "Yes")
return(out)
}
}

Finally we feed this function and a model specification to the bigglm() function:

library(biglm)
results <- bigglm(
WORK30PLUS ~ AGE + I(AGE^2) + HEALTH,
family = binomial(link = "logit"),
data = get_model_data
)

summary(results)
#> Large data regression model: bigglm(WORK30PLUS ~ AGE + I(AGE^2) + HEALTH, family = binomial(link = "logit"),
#>     data = get_model_data)
#> Sample size =  88801
#>                    Coef    (95%     CI)     SE      p
#> (Intercept)     -3.9623 -4.0956 -3.8290 0.0667 0.0000
#> AGE              0.2677  0.2609  0.2744 0.0034 0.0000
#> I(AGE^2)        -0.0029 -0.0030 -0.0028 0.0000 0.0000
#> HEALTHVery good  0.0491  0.0043  0.0939 0.0224 0.0282
#> HEALTHGood      -0.1248 -0.1737 -0.0760 0.0244 0.0000
#> HEALTHFair      -0.6458 -0.7250 -0.5666 0.0396 0.0000
#> HEALTHPoor      -0.9561 -1.1209 -0.7913 0.0824 0.0000

# Option 4: Use a database

Databases are another option for data that cannot fit in memory as an R data.frame. If you have access to a database on a remote machine, then you can easily pull in parts of the data for your analysis. Even if you’ll need to store the database on your machine, it may have more efficient storage of data so your data fits in your memory, or it may use your hard drive.

R’s tools for integrating with databases are improving quickly. The DBI package has been updated, dplyr (through dbplyr) provides a frontend that allows you to write the same code for data in a database as you would for a local data.frame, and packages like sparklyr, sparkR, bigrquery and others provide access to the latest and greatest.

There are many different kinds of databases, each with their own benefits, weaknesses and tradeoffs. As such, it’s hard to give concrete advice without knowing your specific use-case. However, once you’ve chosen a database, in general, there will be two steps: Importing the data into the database and then connecting it to R.

As an example, we’ll use the RSQLite package to load the data into an in-memory database. RSQLite is great because it is easy to set up, but it is probably not efficient enough to help you if you need to use a database because your data doesn’t fit in memory.

## Importing data into a database

When using rectangular extracts, your best bet to import IPUMS data into your database is probably going to be a csv file. Most databases support csv importing, and these implementations will generally be well supported since this is a common file format.

However, if you need a hierarchical extract, or your database software doesn’t support the csv format, then you can use the chunking functions to load the data into a database without storing the full data in R.

# Connect to database
library(DBI)
library(RSQLite)
con <- dbConnect(SQLite(), path = ":memory:")

# Add data to tables in chunks
ddi,
data_file = cps_data_file,
if (pos == 1) {
dbWriteTable(con, "cps", x)
} else {
dbWriteTable(con, "cps", x, row.names = FALSE, append = TRUE)
}
}),
chunk_size = 1000,
verbose = FALSE
)
#> NULL

## Connecting to a database with dbplyr

The dbplyr vignette “dbplyr” (which you can access with vignette("dbplyr", package = "dbplyr")) is a good place to get started learning about how to connect to a database. Here I’ll just briefly show some examples.

example <- tbl(con, "cps")

example %>%
filter('AGE' > 25)
#> # Source:   lazy query [?? x 14]
#> # Database: sqlite 3.22.0 []
#>     YEAR SERIAL HWTSUPP   CPSID ASECFLAG FOODSTMP MONTH PERNUM  CPSIDP
#>    <dbl>  <dbl>   <dbl>   <dbl>    <int>    <int> <int>  <dbl>   <dbl>
#>  1  2011      2    545. 2.01e13        1        1     3      1 2.01e13
#>  2  2011      3    475. 2.01e13        1        1     3      1 2.01e13
#>  3  2011      3    475. 2.01e13        1        1     3      2 2.01e13
#>  4  2011      4    548. 2.01e13        1        1     3      1 2.01e13
#>  5  2011      4    548. 2.01e13        1        1     3      2 2.01e13
#>  6  2011      5    475. 2.01e13        1        1     3      1 2.01e13
#>  7  2011      6    298. 2.01e13        1        1     3      1 2.01e13
#>  8  2011      6    298. 2.01e13        1        1     3      2 2.01e13
#>  9  2011      6    298. 2.01e13        1        1     3      3 2.01e13
#> 10  2011      6    298. 2.01e13        1        1     3      4 2.01e13
#> # ... with more rows, and 5 more variables: WTSUPP <dbl>, AGE <int>,
#> #   EMPSTAT <int>, AHRSWORKT <dbl>, HEALTH <int>

Though dbplyr shows us a nice preview of the first rows of the result of our query, the data still lives in the database. When using a regular database, in general you’d use the function dplyr::collect() to load in the full results of the query to your R session. However, the database has no concept of IPUMS attributes like value and variable labels, so if you want them, you can use ipums_collect() like so:

example %>%
filter('AGE' > 25) %>%
ipums_collect(ddi)
#> # A tibble: 204,983 x 14
#>     YEAR SERIAL HWTSUPP   CPSID ASECFLAG FOODSTMP   MONTH PERNUM  CPSIDP
#>    <dbl>  <dbl>   <dbl>   <dbl> <int+lb> <int+lb> <int+l>  <dbl>   <dbl>
#>  1  2011      2    545. 2.01e13 1 [ASEC]   1 [No] 3 [Mar~      1 2.01e13
#>  2  2011      3    475. 2.01e13 1 [ASEC]   1 [No] 3 [Mar~      1 2.01e13
#>  3  2011      3    475. 2.01e13 1 [ASEC]   1 [No] 3 [Mar~      2 2.01e13
#>  4  2011      4    548. 2.01e13 1 [ASEC]   1 [No] 3 [Mar~      1 2.01e13
#>  5  2011      4    548. 2.01e13 1 [ASEC]   1 [No] 3 [Mar~      2 2.01e13
#>  6  2011      5    475. 2.01e13 1 [ASEC]   1 [No] 3 [Mar~      1 2.01e13
#>  7  2011      6    298. 2.01e13 1 [ASEC]   1 [No] 3 [Mar~      1 2.01e13
#>  8  2011      6    298. 2.01e13 1 [ASEC]   1 [No] 3 [Mar~      2 2.01e13
#>  9  2011      6    298. 2.01e13 1 [ASEC]   1 [No] 3 [Mar~      3 2.01e13
#> 10  2011      6    298. 2.01e13 1 [ASEC]   1 [No] 3 [Mar~      4 2.01e13
#> # ... with 204,973 more rows, and 5 more variables: WTSUPP <dbl>,
#> #   AGE <int+lbl>, EMPSTAT <int+lbl>, AHRSWORKT <dbl+lbl>,
#> #   HEALTH <int+lbl>

# Learning more

Big data is a problem for lots of R users, not just IPUMS users, so there are a lot of resources to help you out! These are just a few that I found useful while writing this document:

• Best practice to handle out-of-memory data - RStudio Community Thread link
• Big Data in R - Part of Stephen Mooney’s EPIC: Epidemiologic Analysis Using R, June 2015 class link
• Statistical Analysis with Open-Source R and RStudio on Amazon EMR - Markus Schmidberger on the AWS Big Data Blog link
• Hosting RStudio Server on Azure - Colin Gillespie’s blog post on using Rstudio on Azure link
• Improving DBI: A Retrospect - Kirill Müller’s report on the R Consortium grant to improve database support in R link

1. Bonus joke: Why is the IPUMS website better than any grocery store? Answer: More free samples.

2. If you’re interested in learning more about R6, the upcoming revision to Hadley Wickham’s Advanced R book includes a chapter on R6 available for free here