Basic unit-level model

The basic area-level model (Battese, Harter, and Fuller 1988; Rao and Molina 2015) is given by \[ y_j = \beta' x_j + v_{i[j]} + \epsilon_j\,,\\ v_i \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma_v^2) \qquad \epsilon_j \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma^2) \] where \(j\) runs from 1 to \(n\), the number of unit-level observations, \(\beta\) is a vector of regression coefficients for given covariates \(x_j\), and \(v_i\) are random area intercepts.

We use the api dataset included in packages survey.

library(survey)
data(api)
apipop$cname <- as.factor(apipop$cname)
apisrs$cname <- factor(apisrs$cname, levels=levels(apipop$cname))

The apipop data frame contains the complete population whereas apisrs is a simple random sample from it. The variable cname is the county name, and we will be interested in estimation at the county level. Not all counties in the population are sampled. In order to be able to make predictions for out-of-sample areas we make sure that the levels of the sample’s cname variable match those of its population counterpart.

The basic unit-level model with county random area effects is fit as follows

library(mcmcsae)
mod <- api00 ~ 
  reg(~ ell + meals + stype + hsg + col.grad + grad.sch, name="beta") +
  gen(factor = ~ iid(cname), name="v")
sampler <- create_sampler(mod, data=apisrs)
sim <- MCMCsim(sampler, store.all=TRUE, verbose=FALSE)
(summary(sim))
## llh_ :
##      Mean        SD   t-value      MCSE     q0.05      q0.5     q0.95     n_eff     R_hat 
## -1089.645     4.435  -245.691     0.127 -1097.336 -1089.401 -1082.719  1229.010     1.000 
## 
## sigma_ :
##     Mean       SD  t-value     MCSE    q0.05     q0.5    q0.95    n_eff    R_hat 
## 5.62e+01 3.03e+00 1.85e+01 6.75e-02 5.14e+01 5.60e+01 6.14e+01 2.02e+03 1.00e+00 
## 
## beta :
##                 Mean     SD t-value    MCSE   q0.05     q0.5    q0.95 n_eff R_hat
## (Intercept)  802.149 25.125   31.93 0.55319  760.38  802.391 844.1195  2063 0.999
## ell           -1.955  0.301   -6.49 0.00683   -2.44   -1.953  -1.4649  1942 1.000
## meals         -1.959  0.279   -7.03 0.00593   -2.42   -1.958  -1.5001  2206 1.000
## stypeH      -106.266 13.403   -7.93 0.26613 -127.99 -106.382 -83.2523  2537 1.002
## stypeM       -59.989 11.417   -5.25 0.23423  -78.75  -60.083 -41.3575  2376 1.000
## hsg           -0.619  0.411   -1.51 0.00812   -1.29   -0.620   0.0513  2559 1.001
## col.grad       0.584  0.490    1.19 0.01133   -0.22    0.572   1.3960  1870 1.000
## grad.sch       2.123  0.474    4.48 0.00898    1.33    2.130   2.8878  2790 1.000
## 
## v_sigma :
##    Mean      SD t-value    MCSE   q0.05    q0.5   q0.95   n_eff   R_hat 
##  22.971   6.403   3.587   0.208  13.314  22.504  33.958 945.927   1.002 
## 
## v :
##                  Mean   SD   t-value  MCSE q0.05     q0.5  q0.95 n_eff R_hat
## Alameda      -25.3376 15.5 -1.637917 0.397 -51.4 -24.9348 -0.833  1517     1
## Amador         0.7307 24.5  0.029783 0.448 -38.4   0.5594 40.278  3000     1
## Butte          0.0213 24.1  0.000884 0.475 -39.9   0.0545 39.696  2566     1
## Calaveras      7.3363 22.1  0.332020 0.424 -27.5   6.6664 44.519  2721     1
## Colusa        -0.1862 24.2 -0.007687 0.442 -40.3  -0.3438 38.628  3000     1
## Contra Costa  -9.9189 15.3 -0.648433 0.344 -36.1  -8.9934 14.541  1976     1
## Del Norte      0.9561 23.8  0.040149 0.440 -37.1   0.4333 41.074  2933     1
## El Dorado     -0.3595 24.2 -0.014866 0.446 -40.5   0.3054 36.991  2937     1
## Fresno         0.1117 15.3  0.007296 0.351 -25.3  -0.1095 25.895  1900     1
## Glenn         -0.2126 23.4 -0.009070 0.433 -38.4   0.0831 38.587  2935     1
## ... 47 elements suppressed ...

We want to estimate the area population means \[ \theta_i = \frac{1}{N_i}\sum_{j \in U_i} y_j\,, \] where \(U_i\) is the set of units in area \(i\) of size \(N_i\). The MCMC output in variable sim can be used to obtain draws from the posterior distribution for \(\theta_i\). The \(r\)th draw can be expressed as \[ \theta_{i;r} = \frac{1}{N_i} \left(n_i \bar{y}_i + \beta_r'(t_{x;i} - n_i \bar{x}_i) + (N_i - n_i)v_{i;r} + \sum_{j \in U_i\setminus s_i} \epsilon_{j;r} \right)\,, \] where \(\bar{y}_i\) is the sample mean of \(y\) in area \(i\) and \(t_{x;i}\) is a vector of population totals for area \(i\).

N <- table(apipop$cname)
samplesums <- tapply(apisrs$api00, apisrs$cname, sum)
samplesums[is.na(samplesums)] <- 0  # substitute 0 for out-of-sample areas
m <- match(apisrs$cds, apipop$cds)  # population units in the sample
res <- predict(sim, newdata=apipop, labels=names(N),
         fun=function(x, p) (samplesums + tapply(x[-m], apipop$cname[-m], sum ))/N,
         show.progress=FALSE)
(summ <- summary(res))
##                 Mean    SD t-value  MCSE q0.05 q0.5 q0.95 n_eff R_hat
## Alameda          679 14.88    45.6 0.332   654  679   702  2013 1.000
## Amador           739 31.08    23.8 0.608   689  738   792  2616 1.000
## Butte            680 26.26    25.9 0.568   638  679   722  2141 1.001
## Calaveras        743 27.92    26.6 0.594   697  742   789  2208 1.000
## Colusa           552 31.31    17.6 0.617   501  553   602  2575 1.002
## Contra Costa     727 14.79    49.1 0.273   702  727   750  2933 1.000
## Del Norte        680 32.12    21.2 0.607   628  679   734  2803 0.999
## El Dorado        753 27.00    27.9 0.532   709  754   796  2572 1.001
## Fresno           595 15.13    39.3 0.276   569  595   619  3000 1.002
## Glenn            625 31.31    20.0 0.588   573  625   676  2832 1.000
## Humboldt         701 27.35    25.6 0.543   656  701   745  2536 0.999
## Imperial         558 24.50    22.8 0.506   519  557   600  2346 1.001
## Inyo             706 33.27    21.2 0.708   652  706   761  2212 1.002
## Kern             596 15.05    39.6 0.320   571  596   620  2211 1.001
## Kings            595 22.33    26.6 0.452   556  595   630  2445 1.001
## Lake             662 24.66    26.8 0.452   622  662   702  2982 1.000
## Lassen           706 26.45    26.7 0.487   663  706   750  2946 1.000
## Los Angeles      636  8.17    77.8 0.152   623  636   649  2903 0.999
## Madera           615 19.90    30.9 0.386   582  615   647  2657 1.000
## Marin            817 20.16    40.5 0.368   785  816   850  3000 1.000
## Mariposa         724 36.12    20.1 0.714   666  724   782  2560 1.001
## Mendocino        663 27.39    24.2 0.545   618  663   708  2522 1.001
## Merced           578 23.13    25.0 0.479   541  578   617  2332 1.000
## Modoc            672 29.78    22.6 0.659   624  671   722  2042 1.000
## Mono             705 41.44    17.0 0.783   641  704   773  2800 1.000
## Monterey         628 18.35    34.2 0.340   599  628   658  2912 1.001
## Napa             701 22.05    31.8 0.426   665  701   737  2682 1.001
## Nevada           799 29.10    27.5 0.563   751  799   847  2673 1.000
## Orange           693 15.26    45.4 0.315   668  692   718  2346 1.000
## Placer           772 23.89    32.3 0.489   734  772   813  2382 1.001
## Plumas           691 31.85    21.7 0.629   639  692   742  2562 1.000
## Riverside        636 13.92    45.7 0.270   613  636   658  2652 1.000
## Sacramento       668 14.93    44.8 0.286   644  669   693  2721 1.001
## San Benito       708 29.43    24.1 0.574   661  708   757  2633 1.000
## San Bernardino   647 12.78    50.6 0.233   626  646   668  3000 1.000
## San Diego        704 14.18    49.6 0.320   681  704   728  1967 0.999
## San Francisco    638 19.87    32.1 0.402   607  638   671  2441 1.000
## San Joaquin      630 17.53    35.9 0.368   601  631   658  2266 1.000
## San Luis Obispo  752 23.14    32.5 0.435   715  752   791  2824 0.999
## San Mateo        734 20.99    35.0 0.405   700  734   769  2681 1.000
## Santa Barbara    679 19.35    35.1 0.390   647  679   711  2461 1.000
## Santa Clara      733 15.96    45.9 0.292   705  734   759  2985 1.000
## Santa Cruz       679 19.84    34.3 0.379   646  680   712  2736 1.000
## Shasta           697 21.75    32.0 0.466   663  696   733  2184 1.001
## Sierra           707 42.43    16.7 0.775   637  707   777  3000 1.001
## Siskiyou         698 24.98    27.9 0.456   657  698   738  3000 1.000
## Solano           711 19.97    35.6 0.391   678  711   743  2612 1.000
## Sonoma           725 22.54    32.2 0.443   687  726   761  2583 1.000
## Stanislaus       662 19.57    33.8 0.357   629  661   694  3000 1.003
## Sutter           653 24.32    26.8 0.458   613  653   693  2815 1.000
## Tehama           659 28.99    22.7 0.539   611  659   706  2893 1.000
## Trinity          649 38.10    17.0 0.696   587  651   711  3000 1.001
## Tulare           583 21.90    26.6 0.440   546  584   618  2476 1.000
## Tuolumne         720 30.42    23.7 0.582   671  719   771  2731 1.000
## Ventura          710 18.03    39.4 0.342   680  709   740  2783 1.000
## Yolo             686 24.84    27.6 0.472   645  686   726  2773 1.000
## Yuba             627 28.28    22.2 0.606   581  627   673  2176 1.001
theta <- c(tapply(apipop$api00, apipop$cname, mean))  # true population quantities
plot_coef(summ, list(est=theta), n.se=2, est.names=c("mcmcsae", "true"), maxrows=30)

Binomial Unit-Level Model

A model with binomial likelihood can also be fit. We now model the target variable sch.wide, a binary variable indicating whether a school-wide growth target has been met. We use the same mean model structure as above for the linear model, but now using a logistic link function, \[ y_j \stackrel{\mathrm{iid}}{\sim} {\cal Be}(p_j)\,,\\ \mathrm{logit}(p_j) = \beta' x_j + v_{i[j]}\,,\\ v_i \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma_v^2) \]

apisrs$target.met <- as.numeric(apisrs$sch.wide == "Yes")
sampler <- create_sampler(update(mod, target.met ~ .), family="binomial", data=apisrs)
sim <- MCMCsim(sampler, store.all=TRUE, verbose=FALSE)
summary(sim)
## llh_ :
##      Mean        SD   t-value      MCSE     q0.05      q0.5     q0.95     n_eff     R_hat 
##  -83.2891    2.4918  -33.4257    0.0607  -87.5277  -83.1962  -79.4361 1685.4346    1.0008 
## 
## beta :
##                  Mean     SD t-value     MCSE   q0.05      q0.5    q0.95 n_eff R_hat
## (Intercept)  4.563445 1.4934  3.0558 0.058914  2.2238  4.480713  7.14145   643  1.00
## ell         -0.035823 0.0150 -2.3838 0.000420 -0.0612 -0.035212 -0.01215  1280  1.00
## meals        0.000808 0.0141  0.0572 0.000454 -0.0223  0.000659  0.02462   971  1.00
## stypeH      -2.821852 0.6210 -4.5441 0.019304 -3.8534 -2.800611 -1.85445  1035  1.00
## stypeM      -1.651438 0.5873 -2.8118 0.016695 -2.5977 -1.644618 -0.71640  1238  1.00
## hsg         -0.025548 0.0223 -1.1470 0.000657 -0.0626 -0.025645  0.01055  1148  1.00
## col.grad    -0.037802 0.0269 -1.4045 0.000868 -0.0816 -0.037805  0.00687   961  1.01
## grad.sch     0.014016 0.0316  0.4434 0.001113 -0.0347  0.012320  0.06888   807  1.00
## 
## v_sigma :
##     Mean       SD  t-value     MCSE    q0.05     q0.5    q0.95    n_eff    R_hat 
##   0.3754   0.2857   1.3141   0.0123   0.0297   0.3198   0.9167 535.1238   1.0121 
## 
## v :
##                   Mean    SD  t-value    MCSE  q0.05     q0.5 q0.95 n_eff R_hat
## Alameda      -0.190148 0.433 -0.43866 0.01363 -1.000 -0.07204 0.338  1012 1.003
## Amador        0.004611 0.489  0.00943 0.00915 -0.771  0.00267 0.762  2856 1.000
## Butte        -0.000808 0.478 -0.00169 0.00872 -0.759 -0.00202 0.785  3000 0.999
## Calaveras     0.035399 0.465  0.07609 0.00914 -0.675  0.00547 0.845  2588 0.999
## Colusa       -0.010062 0.462 -0.02176 0.00846 -0.797 -0.00139 0.725  2991 1.000
## Contra Costa  0.005358 0.404  0.01327 0.00940 -0.666  0.00159 0.674  1846 1.002
## Del Norte    -0.007185 0.479 -0.01501 0.00887 -0.792 -0.00109 0.734  2910 1.000
## El Dorado     0.004596 0.451  0.01019 0.00838 -0.729  0.00206 0.721  2896 1.000
## Fresno       -0.042824 0.383 -0.11190 0.00892 -0.694 -0.01627 0.544  1841 1.002
## Glenn        -0.014269 0.476 -0.02995 0.00871 -0.804 -0.00148 0.730  2994 1.000
## ... 47 elements suppressed ...

To predict the population fractions of schools that meet the growth target by county,

samplesums <- tapply(apisrs$target.met, apisrs$cname, sum)
samplesums[is.na(samplesums)] <- 0  # substitute 0 for out-of-sample areas
res <- predict(sim, newdata=apipop, labels=names(N),
         fun=function(x, p) (samplesums + tapply(x[-m], apipop$cname[-m], sum ))/N,
         show.progress=FALSE)
(summ <- summary(res))
##                  Mean     SD t-value     MCSE q0.05  q0.5 q0.95 n_eff R_hat
## Alameda         0.784 0.0615   12.74 0.001650 0.670 0.792 0.864  1390 1.002
## Amador          0.837 0.1171    7.14 0.002182 0.600 0.800 1.000  2880 1.000
## Butte           0.835 0.0755   11.06 0.001548 0.708 0.833 0.938  2377 1.000
## Calaveras       0.875 0.0953    9.18 0.001884 0.700 0.900 1.000  2558 1.000
## Colusa          0.645 0.1577    4.09 0.002951 0.333 0.667 0.889  2856 1.001
## Contra Costa    0.817 0.0570   14.32 0.001349 0.721 0.821 0.899  1786 1.001
## Del Norte       0.881 0.1079    8.16 0.002191 0.750 0.875 1.000  2424 1.001
## El Dorado       0.850 0.0758   11.23 0.001521 0.725 0.850 0.950  2482 1.002
## Fresno          0.781 0.0618   12.65 0.001463 0.667 0.790 0.871  1783 1.001
## Glenn           0.790 0.1415    5.58 0.002717 0.556 0.778 1.000  2713 1.002
## Humboldt        0.850 0.0746   11.39 0.001533 0.725 0.850 0.950  2369 1.000
## Imperial        0.699 0.1059    6.60 0.002309 0.500 0.700 0.850  2103 1.002
## Inyo            0.777 0.1584    4.91 0.003087 0.571 0.857 1.000  2635 1.000
## Kern            0.804 0.0523   15.36 0.001232 0.706 0.811 0.878  1805 1.000
## Kings           0.850 0.0791   10.74 0.001654 0.720 0.840 0.960  2289 1.001
## Lake            0.825 0.0957    8.62 0.001837 0.636 0.818 0.955  2717 1.002
## Lassen          0.834 0.1100    7.58 0.002184 0.636 0.818 1.000  2534 1.000
## Los Angeles     0.799 0.0431   18.56 0.001040 0.728 0.799 0.869  1714 1.000
## Madera          0.820 0.0753   10.89 0.001539 0.677 0.839 0.935  2391 1.001
## Marin           0.837 0.0707   11.84 0.001705 0.700 0.840 0.940  1721 1.002
## Mariposa        0.860 0.1517    5.67 0.003097 0.600 0.800 1.000  2400 1.002
## Mendocino       0.815 0.0905    9.01 0.001700 0.640 0.840 0.960  2834 0.999
## Merced          0.787 0.0826    9.53 0.001727 0.635 0.794 0.905  2289 1.001
## Modoc           0.832 0.1578    5.27 0.002998 0.600 0.800 1.000  2769 0.999
## Mono            0.748 0.2435    3.07 0.004572 0.333 0.667 1.000  2837 1.000
## Monterey        0.803 0.0682   11.78 0.001688 0.687 0.807 0.916  1633 1.001
## Napa            0.845 0.0790   10.70 0.001538 0.704 0.852 0.963  2635 1.000
## Nevada          0.866 0.0961    9.01 0.001865 0.714 0.857 1.000  2655 0.999
## Orange          0.774 0.0611   12.68 0.001364 0.672 0.775 0.873  2004 1.000
## Placer          0.813 0.0761   10.68 0.001723 0.682 0.818 0.909  1950 0.999
## Plumas          0.738 0.1558    4.74 0.002921 0.444 0.778 1.000  2844 0.999
## Riverside       0.824 0.0527   15.64 0.001133 0.729 0.831 0.895  2163 1.000
## Sacramento      0.847 0.0493   17.20 0.001286 0.764 0.847 0.924  1467 1.001
## San Benito      0.816 0.1262    6.47 0.002513 0.545 0.818 1.000  2522 1.000
## San Bernardino  0.862 0.0422   20.43 0.001159 0.793 0.862 0.934  1326 1.005
## San Diego       0.848 0.0427   19.85 0.000988 0.775 0.850 0.913  1870 1.002
## San Francisco   0.761 0.0740   10.28 0.001524 0.630 0.770 0.870  2359 0.999
## San Joaquin     0.827 0.0576   14.36 0.001311 0.726 0.831 0.911  1932 1.001
## San Luis Obispo 0.854 0.0731   11.69 0.001469 0.725 0.850 0.950  2474 0.999
## San Mateo       0.796 0.0746   10.67 0.001915 0.664 0.804 0.888  1519 1.001
## Santa Barbara   0.828 0.0669   12.39 0.001645 0.716 0.840 0.926  1652 1.001
## Santa Clara     0.782 0.0729   10.73 0.002138 0.645 0.796 0.871  1162 1.000
## Santa Cruz      0.787 0.0753   10.45 0.001667 0.653 0.796 0.898  2044 1.000
## Shasta          0.816 0.0772   10.58 0.001726 0.675 0.825 0.925  1998 1.000
## Sierra          0.723 0.2276    3.18 0.004201 0.333 0.667 1.000  2935 1.000
## Siskiyou        0.842 0.1004    8.39 0.001970 0.667 0.867 1.000  2595 0.999
## Solano          0.866 0.0634   13.67 0.001374 0.750 0.867 0.967  2125 1.002
## Sonoma          0.846 0.0620   13.64 0.001423 0.741 0.852 0.935  1900 1.000
## Stanislaus      0.812 0.0738   11.00 0.001822 0.681 0.824 0.901  1642 1.004
## Sutter          0.832 0.0904    9.20 0.001722 0.650 0.850 0.950  2758 1.000
## Tehama          0.842 0.1004    8.38 0.002095 0.647 0.882 1.000  2297 1.001
## Trinity         0.745 0.2051    3.63 0.004003 0.500 0.750 1.000  2624 1.000
## Tulare          0.831 0.0667   12.45 0.001492 0.709 0.836 0.927  2002 1.000
## Tuolumne        0.878 0.0968    9.07 0.002004 0.746 0.917 1.000  2333 0.999
## Ventura         0.840 0.0493   17.03 0.001061 0.758 0.845 0.913  2159 1.000
## Yolo            0.853 0.0736   11.59 0.001598 0.714 0.857 0.971  2125 1.001
## Yuba            0.793 0.1014    7.82 0.002037 0.632 0.789 0.947  2477 1.000
theta <- c(tapply(apipop$sch.wide == "Yes", apipop$cname, mean))  # true population quantities
plot_coef(summ, list(est=theta), n.se=2, est.names=c("mcmcsae", "true"), maxrows=30)

References

Battese, G. E., R. M. Harter, and W. A. Fuller. 1988. “An Error-Components Model for Prediction of County Crop Areas Using Survey and Satellite Data.” Journal of the American Statistical Association 83 (401): 28–36.

Rao, J. N. K., and I. Molina. 2015. Small Area Estimation. John Wiley & Sons.