Basic unit-level models

The basic area-level model (Battese, Harter, and Fuller 1988; Rao and Molina 2015) is given by $y_j = \beta' x_j + v_{i[j]} + \epsilon_j\,,\\ v_i \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma_v^2) \qquad \epsilon_j \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma^2)$ where $$j$$ runs from 1 to $$n$$, the number of unit-level observations, $$\beta$$ is a vector of regression coefficients for given covariates $$x_j$$, and $$v_i$$ are random area intercepts.

We use the api dataset included in packages survey.

library(survey)
data(api)
apipop$cname <- as.factor(apipop$cname)
apisrs$cname <- factor(apisrs$cname, levels=levels(apipop$cname)) The apipop data frame contains the complete population whereas apisrs is a simple random sample from it. The variable cname is the county name, and we will be interested in estimation at the county level. Not all counties in the population are sampled. In order to be able to make predictions for out-of-sample areas we make sure that the levels of the sample’s cname variable match those of its population counterpart. The basic unit-level model with county random area effects is fit as follows library(mcmcsae) mod <- api00 ~ reg(~ ell + meals + stype + hsg + col.grad + grad.sch, name="beta") + gen(factor = ~ iid(cname), name="v") sampler <- create_sampler(mod, data=apisrs) sim <- MCMCsim(sampler, store.all=TRUE, verbose=FALSE) (summary(sim)) ## llh_ : ## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat ## -1089.645 4.435 -245.691 0.127 -1097.336 -1089.401 -1082.719 1229.010 1.000 ## ## sigma_ : ## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat ## 5.62e+01 3.03e+00 1.85e+01 6.75e-02 5.14e+01 5.60e+01 6.14e+01 2.02e+03 1.00e+00 ## ## beta : ## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat ## (Intercept) 802.149 25.125 31.93 0.55319 760.38 802.391 844.1195 2063 0.999 ## ell -1.955 0.301 -6.49 0.00683 -2.44 -1.953 -1.4649 1942 1.000 ## meals -1.959 0.279 -7.03 0.00593 -2.42 -1.958 -1.5001 2206 1.000 ## stypeH -106.266 13.403 -7.93 0.26613 -127.99 -106.382 -83.2523 2537 1.002 ## stypeM -59.989 11.417 -5.25 0.23423 -78.75 -60.083 -41.3575 2376 1.000 ## hsg -0.619 0.411 -1.51 0.00812 -1.29 -0.620 0.0513 2559 1.001 ## col.grad 0.584 0.490 1.19 0.01133 -0.22 0.572 1.3960 1870 1.000 ## grad.sch 2.123 0.474 4.48 0.00898 1.33 2.130 2.8878 2790 1.000 ## ## v_sigma : ## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat ## 22.971 6.403 3.587 0.208 13.314 22.504 33.958 945.927 1.002 ## ## v : ## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat ## Alameda -25.3376 15.5 -1.637917 0.397 -51.4 -24.9348 -0.833 1517 1 ## Amador 0.7307 24.5 0.029783 0.448 -38.4 0.5594 40.278 3000 1 ## Butte 0.0213 24.1 0.000884 0.475 -39.9 0.0545 39.696 2566 1 ## Calaveras 7.3363 22.1 0.332020 0.424 -27.5 6.6664 44.519 2721 1 ## Colusa -0.1862 24.2 -0.007687 0.442 -40.3 -0.3438 38.628 3000 1 ## Contra Costa -9.9189 15.3 -0.648433 0.344 -36.1 -8.9934 14.541 1976 1 ## Del Norte 0.9561 23.8 0.040149 0.440 -37.1 0.4333 41.074 2933 1 ## El Dorado -0.3595 24.2 -0.014866 0.446 -40.5 0.3054 36.991 2937 1 ## Fresno 0.1117 15.3 0.007296 0.351 -25.3 -0.1095 25.895 1900 1 ## Glenn -0.2126 23.4 -0.009070 0.433 -38.4 0.0831 38.587 2935 1 ## ... 47 elements suppressed ... We want to estimate the area population means $\theta_i = \frac{1}{N_i}\sum_{j \in U_i} y_j\,,$ where $$U_i$$ is the set of units in area $$i$$ of size $$N_i$$. The MCMC output in variable sim can be used to obtain draws from the posterior distribution for $$\theta_i$$. The $$r$$th draw can be expressed as $\theta_{i;r} = \frac{1}{N_i} \left(n_i \bar{y}_i + \beta_r'(t_{x;i} - n_i \bar{x}_i) + (N_i - n_i)v_{i;r} + \sum_{j \in U_i\setminus s_i} \epsilon_{j;r} \right)\,,$ where $$\bar{y}_i$$ is the sample mean of $$y$$ in area $$i$$ and $$t_{x;i}$$ is a vector of population totals for area $$i$$. N <- table(apipop$cname)
samplesums <- tapply(apisrs$api00, apisrs$cname, sum)
samplesums[is.na(samplesums)] <- 0  # substitute 0 for out-of-sample areas
m <- match(apisrs$cds, apipop$cds)  # population units in the sample
res <- predict(sim, newdata=apipop, labels=names(N),
fun=function(x, p) (samplesums + tapply(x[-m], apipop$cname[-m], sum ))/N, show.progress=FALSE) (summ <- summary(res)) ## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat ## Alameda 679 14.88 45.6 0.332 654 679 702 2013 1.000 ## Amador 739 31.08 23.8 0.608 689 738 792 2616 1.000 ## Butte 680 26.26 25.9 0.568 638 679 722 2141 1.001 ## Calaveras 743 27.92 26.6 0.594 697 742 789 2208 1.000 ## Colusa 552 31.31 17.6 0.617 501 553 602 2575 1.002 ## Contra Costa 727 14.79 49.1 0.273 702 727 750 2933 1.000 ## Del Norte 680 32.12 21.2 0.607 628 679 734 2803 0.999 ## El Dorado 753 27.00 27.9 0.532 709 754 796 2572 1.001 ## Fresno 595 15.13 39.3 0.276 569 595 619 3000 1.002 ## Glenn 625 31.31 20.0 0.588 573 625 676 2832 1.000 ## Humboldt 701 27.35 25.6 0.543 656 701 745 2536 0.999 ## Imperial 558 24.50 22.8 0.506 519 557 600 2346 1.001 ## Inyo 706 33.27 21.2 0.708 652 706 761 2212 1.002 ## Kern 596 15.05 39.6 0.320 571 596 620 2211 1.001 ## Kings 595 22.33 26.6 0.452 556 595 630 2445 1.001 ## Lake 662 24.66 26.8 0.452 622 662 702 2982 1.000 ## Lassen 706 26.45 26.7 0.487 663 706 750 2946 1.000 ## Los Angeles 636 8.17 77.8 0.152 623 636 649 2903 0.999 ## Madera 615 19.90 30.9 0.386 582 615 647 2657 1.000 ## Marin 817 20.16 40.5 0.368 785 816 850 3000 1.000 ## Mariposa 724 36.12 20.1 0.714 666 724 782 2560 1.001 ## Mendocino 663 27.39 24.2 0.545 618 663 708 2522 1.001 ## Merced 578 23.13 25.0 0.479 541 578 617 2332 1.000 ## Modoc 672 29.78 22.6 0.659 624 671 722 2042 1.000 ## Mono 705 41.44 17.0 0.783 641 704 773 2800 1.000 ## Monterey 628 18.35 34.2 0.340 599 628 658 2912 1.001 ## Napa 701 22.05 31.8 0.426 665 701 737 2682 1.001 ## Nevada 799 29.10 27.5 0.563 751 799 847 2673 1.000 ## Orange 693 15.26 45.4 0.315 668 692 718 2346 1.000 ## Placer 772 23.89 32.3 0.489 734 772 813 2382 1.001 ## Plumas 691 31.85 21.7 0.629 639 692 742 2562 1.000 ## Riverside 636 13.92 45.7 0.270 613 636 658 2652 1.000 ## Sacramento 668 14.93 44.8 0.286 644 669 693 2721 1.001 ## San Benito 708 29.43 24.1 0.574 661 708 757 2633 1.000 ## San Bernardino 647 12.78 50.6 0.233 626 646 668 3000 1.000 ## San Diego 704 14.18 49.6 0.320 681 704 728 1967 0.999 ## San Francisco 638 19.87 32.1 0.402 607 638 671 2441 1.000 ## San Joaquin 630 17.53 35.9 0.368 601 631 658 2266 1.000 ## San Luis Obispo 752 23.14 32.5 0.435 715 752 791 2824 0.999 ## San Mateo 734 20.99 35.0 0.405 700 734 769 2681 1.000 ## Santa Barbara 679 19.35 35.1 0.390 647 679 711 2461 1.000 ## Santa Clara 733 15.96 45.9 0.292 705 734 759 2985 1.000 ## Santa Cruz 679 19.84 34.3 0.379 646 680 712 2736 1.000 ## Shasta 697 21.75 32.0 0.466 663 696 733 2184 1.001 ## Sierra 707 42.43 16.7 0.775 637 707 777 3000 1.001 ## Siskiyou 698 24.98 27.9 0.456 657 698 738 3000 1.000 ## Solano 711 19.97 35.6 0.391 678 711 743 2612 1.000 ## Sonoma 725 22.54 32.2 0.443 687 726 761 2583 1.000 ## Stanislaus 662 19.57 33.8 0.357 629 661 694 3000 1.003 ## Sutter 653 24.32 26.8 0.458 613 653 693 2815 1.000 ## Tehama 659 28.99 22.7 0.539 611 659 706 2893 1.000 ## Trinity 649 38.10 17.0 0.696 587 651 711 3000 1.001 ## Tulare 583 21.90 26.6 0.440 546 584 618 2476 1.000 ## Tuolumne 720 30.42 23.7 0.582 671 719 771 2731 1.000 ## Ventura 710 18.03 39.4 0.342 680 709 740 2783 1.000 ## Yolo 686 24.84 27.6 0.472 645 686 726 2773 1.000 ## Yuba 627 28.28 22.2 0.606 581 627 673 2176 1.001 theta <- c(tapply(apipop$api00, apipop$cname, mean)) # true population quantities plot_coef(summ, list(est=theta), n.se=2, est.names=c("mcmcsae", "true"), maxrows=30) Binomial Unit-Level Model A model with binomial likelihood can also be fit. We now model the target variable sch.wide, a binary variable indicating whether a school-wide growth target has been met. We use the same mean model structure as above for the linear model, but now using a logistic link function, $y_j \stackrel{\mathrm{iid}}{\sim} {\cal Be}(p_j)\,,\\ \mathrm{logit}(p_j) = \beta' x_j + v_{i[j]}\,,\\ v_i \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma_v^2)$ apisrs$target.met <- as.numeric(apisrs$sch.wide == "Yes") sampler <- create_sampler(update(mod, target.met ~ .), family="binomial", data=apisrs) sim <- MCMCsim(sampler, store.all=TRUE, verbose=FALSE) summary(sim) ## llh_ : ## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat ## -83.2891 2.4918 -33.4257 0.0607 -87.5277 -83.1962 -79.4361 1685.4346 1.0008 ## ## beta : ## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat ## (Intercept) 4.563445 1.4934 3.0558 0.058914 2.2238 4.480713 7.14145 643 1.00 ## ell -0.035823 0.0150 -2.3838 0.000420 -0.0612 -0.035212 -0.01215 1280 1.00 ## meals 0.000808 0.0141 0.0572 0.000454 -0.0223 0.000659 0.02462 971 1.00 ## stypeH -2.821852 0.6210 -4.5441 0.019304 -3.8534 -2.800611 -1.85445 1035 1.00 ## stypeM -1.651438 0.5873 -2.8118 0.016695 -2.5977 -1.644618 -0.71640 1238 1.00 ## hsg -0.025548 0.0223 -1.1470 0.000657 -0.0626 -0.025645 0.01055 1148 1.00 ## col.grad -0.037802 0.0269 -1.4045 0.000868 -0.0816 -0.037805 0.00687 961 1.01 ## grad.sch 0.014016 0.0316 0.4434 0.001113 -0.0347 0.012320 0.06888 807 1.00 ## ## v_sigma : ## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat ## 0.3754 0.2857 1.3141 0.0123 0.0297 0.3198 0.9167 535.1238 1.0121 ## ## v : ## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat ## Alameda -0.190148 0.433 -0.43866 0.01363 -1.000 -0.07204 0.338 1012 1.003 ## Amador 0.004611 0.489 0.00943 0.00915 -0.771 0.00267 0.762 2856 1.000 ## Butte -0.000808 0.478 -0.00169 0.00872 -0.759 -0.00202 0.785 3000 0.999 ## Calaveras 0.035399 0.465 0.07609 0.00914 -0.675 0.00547 0.845 2588 0.999 ## Colusa -0.010062 0.462 -0.02176 0.00846 -0.797 -0.00139 0.725 2991 1.000 ## Contra Costa 0.005358 0.404 0.01327 0.00940 -0.666 0.00159 0.674 1846 1.002 ## Del Norte -0.007185 0.479 -0.01501 0.00887 -0.792 -0.00109 0.734 2910 1.000 ## El Dorado 0.004596 0.451 0.01019 0.00838 -0.729 0.00206 0.721 2896 1.000 ## Fresno -0.042824 0.383 -0.11190 0.00892 -0.694 -0.01627 0.544 1841 1.002 ## Glenn -0.014269 0.476 -0.02995 0.00871 -0.804 -0.00148 0.730 2994 1.000 ## ... 47 elements suppressed ... To predict the population fractions of schools that meet the growth target by county, samplesums <- tapply(apisrs$target.met, apisrs$cname, sum) samplesums[is.na(samplesums)] <- 0 # substitute 0 for out-of-sample areas res <- predict(sim, newdata=apipop, labels=names(N), fun=function(x, p) (samplesums + tapply(x[-m], apipop$cname[-m], sum ))/N,
show.progress=FALSE)
(summ <- summary(res))
##                  Mean     SD t-value     MCSE q0.05  q0.5 q0.95 n_eff R_hat
## Alameda         0.784 0.0615   12.74 0.001650 0.670 0.792 0.864  1390 1.002
## Amador          0.837 0.1171    7.14 0.002182 0.600 0.800 1.000  2880 1.000
## Butte           0.835 0.0755   11.06 0.001548 0.708 0.833 0.938  2377 1.000
## Calaveras       0.875 0.0953    9.18 0.001884 0.700 0.900 1.000  2558 1.000
## Colusa          0.645 0.1577    4.09 0.002951 0.333 0.667 0.889  2856 1.001
## Contra Costa    0.817 0.0570   14.32 0.001349 0.721 0.821 0.899  1786 1.001
## Del Norte       0.881 0.1079    8.16 0.002191 0.750 0.875 1.000  2424 1.001
## El Dorado       0.850 0.0758   11.23 0.001521 0.725 0.850 0.950  2482 1.002
## Fresno          0.781 0.0618   12.65 0.001463 0.667 0.790 0.871  1783 1.001
## Glenn           0.790 0.1415    5.58 0.002717 0.556 0.778 1.000  2713 1.002
## Humboldt        0.850 0.0746   11.39 0.001533 0.725 0.850 0.950  2369 1.000
## Imperial        0.699 0.1059    6.60 0.002309 0.500 0.700 0.850  2103 1.002
## Inyo            0.777 0.1584    4.91 0.003087 0.571 0.857 1.000  2635 1.000
## Kern            0.804 0.0523   15.36 0.001232 0.706 0.811 0.878  1805 1.000
## Kings           0.850 0.0791   10.74 0.001654 0.720 0.840 0.960  2289 1.001
## Lake            0.825 0.0957    8.62 0.001837 0.636 0.818 0.955  2717 1.002
## Lassen          0.834 0.1100    7.58 0.002184 0.636 0.818 1.000  2534 1.000
## Los Angeles     0.799 0.0431   18.56 0.001040 0.728 0.799 0.869  1714 1.000
## Madera          0.820 0.0753   10.89 0.001539 0.677 0.839 0.935  2391 1.001
## Marin           0.837 0.0707   11.84 0.001705 0.700 0.840 0.940  1721 1.002
## Mariposa        0.860 0.1517    5.67 0.003097 0.600 0.800 1.000  2400 1.002
## Mendocino       0.815 0.0905    9.01 0.001700 0.640 0.840 0.960  2834 0.999
## Merced          0.787 0.0826    9.53 0.001727 0.635 0.794 0.905  2289 1.001
## Modoc           0.832 0.1578    5.27 0.002998 0.600 0.800 1.000  2769 0.999
## Mono            0.748 0.2435    3.07 0.004572 0.333 0.667 1.000  2837 1.000
## Monterey        0.803 0.0682   11.78 0.001688 0.687 0.807 0.916  1633 1.001
## Napa            0.845 0.0790   10.70 0.001538 0.704 0.852 0.963  2635 1.000
## Nevada          0.866 0.0961    9.01 0.001865 0.714 0.857 1.000  2655 0.999
## Orange          0.774 0.0611   12.68 0.001364 0.672 0.775 0.873  2004 1.000
## Placer          0.813 0.0761   10.68 0.001723 0.682 0.818 0.909  1950 0.999
## Plumas          0.738 0.1558    4.74 0.002921 0.444 0.778 1.000  2844 0.999
## Riverside       0.824 0.0527   15.64 0.001133 0.729 0.831 0.895  2163 1.000
## Sacramento      0.847 0.0493   17.20 0.001286 0.764 0.847 0.924  1467 1.001
## San Benito      0.816 0.1262    6.47 0.002513 0.545 0.818 1.000  2522 1.000
## San Bernardino  0.862 0.0422   20.43 0.001159 0.793 0.862 0.934  1326 1.005
## San Diego       0.848 0.0427   19.85 0.000988 0.775 0.850 0.913  1870 1.002
## San Francisco   0.761 0.0740   10.28 0.001524 0.630 0.770 0.870  2359 0.999
## San Joaquin     0.827 0.0576   14.36 0.001311 0.726 0.831 0.911  1932 1.001
## San Luis Obispo 0.854 0.0731   11.69 0.001469 0.725 0.850 0.950  2474 0.999
## San Mateo       0.796 0.0746   10.67 0.001915 0.664 0.804 0.888  1519 1.001
## Santa Barbara   0.828 0.0669   12.39 0.001645 0.716 0.840 0.926  1652 1.001
## Santa Clara     0.782 0.0729   10.73 0.002138 0.645 0.796 0.871  1162 1.000
## Santa Cruz      0.787 0.0753   10.45 0.001667 0.653 0.796 0.898  2044 1.000
## Shasta          0.816 0.0772   10.58 0.001726 0.675 0.825 0.925  1998 1.000
## Sierra          0.723 0.2276    3.18 0.004201 0.333 0.667 1.000  2935 1.000
## Siskiyou        0.842 0.1004    8.39 0.001970 0.667 0.867 1.000  2595 0.999
## Solano          0.866 0.0634   13.67 0.001374 0.750 0.867 0.967  2125 1.002
## Sonoma          0.846 0.0620   13.64 0.001423 0.741 0.852 0.935  1900 1.000
## Stanislaus      0.812 0.0738   11.00 0.001822 0.681 0.824 0.901  1642 1.004
## Sutter          0.832 0.0904    9.20 0.001722 0.650 0.850 0.950  2758 1.000
## Tehama          0.842 0.1004    8.38 0.002095 0.647 0.882 1.000  2297 1.001
## Trinity         0.745 0.2051    3.63 0.004003 0.500 0.750 1.000  2624 1.000
## Tulare          0.831 0.0667   12.45 0.001492 0.709 0.836 0.927  2002 1.000
## Tuolumne        0.878 0.0968    9.07 0.002004 0.746 0.917 1.000  2333 0.999
## Ventura         0.840 0.0493   17.03 0.001061 0.758 0.845 0.913  2159 1.000
## Yolo            0.853 0.0736   11.59 0.001598 0.714 0.857 0.971  2125 1.001
## Yuba            0.793 0.1014    7.82 0.002037 0.632 0.789 0.947  2477 1.000
theta <- c(tapply(apipop$sch.wide == "Yes", apipop$cname, mean))  # true population quantities
plot_coef(summ, list(est=theta), n.se=2, est.names=c("mcmcsae", "true"), maxrows=30)

References

Battese, G. E., R. M. Harter, and W. A. Fuller. 1988. “An Error-Components Model for Prediction of County Crop Areas Using Survey and Satellite Data.” Journal of the American Statistical Association 83 (401): 28–36.
Rao, J. N. K., and I. Molina. 2015. Small Area Estimation. John Wiley & Sons.