Small Area Estimation R Package

Dear ACS DUG,

I have posted an early Windows installer for a package that makes tract level synthetic data from PUMS data. It has the feature of being able to merge in the Supplemental Poverty Measure PUMS data files should you want to use variables in that dataset. See dorerfoundation.org/software for more information and a link to the GitHub webpage.

Feedback welcome, The package is an early development version.

Dave Dorer

A Linux (developed on Ubuntu) installer is also available.  You need gcc to compile one C routine.

  • I just posted a 0.7 updated windows installer Wed 20 Dec 2023 at about 11:30 am

    The package works with 2022 ACS marginal 5 year tables and 2022 1 year PUMS data.  Watch out - the 2022 1 year PUMS data uses the 2022 PUMA FIPS codes which are different from the 2021 FIPS codes.  The 2022 5 year PUMS data is due out 24 Jan 2024

    The current Supplemental Poverty Measure PUMS data is 1 year 2021 data which merges with the 2021 1 year "regular" PUMS data.

    Dave

  • One more thing,

    Nonprofit 501(c)(3) and government entities have access to free support to get the package working (including zoom sessions)

    Dave

  • Dear DUG,

    I've posted a version 0.8 Windows installer/zip file.  It includes functions to compute  replicate weights and MoE's for tabulations of synthetic data.

    Dave

  • Dear All,

    I continue to add functions most of which are diagnostic and test the model fit.  If you downloaded earlier you might download the package again and run install.packages("PAT",repos=NULL).  For example there is a new function (not documented yet) that takes previously created synthetic data for census tracks, aggregates the synthetic data "up" to the county subdivision ("CSD" or cosub level) and then checks the CSD synthetic data marginal tables against Detail ACS tables for the CSD.  The following statistic is computed:

     delta ==  synthetic data tablulation - Detail table "Est";

    then compute 1.645*delta / Detail table MoE.

    This corresponds to something like a standardized residual for the synthetic data table fit and can be used to locate table cells with a bad fit.

    In general you wouldn't expect the aggregated CSD synthetic data marginals to agree exactly with the CSD Detail tables as the synthetic data only fits  "lower order" interactions corresponding to the marginal tables used in the synthetic data PUMS model fit. The "higher" level interactions for the PUMS/PUMA will not be exactly the same a s the higher level interactions within the CSD.  For example the Age x Sex x Race PUMA interaction will not be the same as the Age x Sex x Race interaction within the CSD.

    Other than one user, I haven't received feedback.  It would be useful if people who have tried the package to send comments.

    Keeping with the Mission of Dorer Community Service Foundation, I am available to provide free consulting to nonprofit 501(c)(3) and government entities. See dorerfoundation.org.  Send email to the "info" email listed on the "Contact Us" webpage.

    Dave

  • I'm continuing to work on replicate weights and standard errors. If you have tried the package or have any questions send an email to info@dorerfoundation.org.  Support is free for 501(c)(3) nonprofits and government entities.

    Dave

  • Here is the latest update on the package for synthesizing tract level data from PUMA PUMS data and tract level detail tables.

    After much research and programming I am able to create replicate weights for the synthetic tract level data using a technique from:

    Fuller, Wayne A. 1998. “Replication Variance Estimation for Two-Phase Samples.” Statistica Sinica 8 (4): 1153–64.

    where you "perturb" the replicate weights from the PUMS data set using the Margin of Error for the Detail tables marginal totals.

    I'm using the R "grake" R function from the "survey" package inside the calibrate_to_estimate in the "svrep" package.  I've been able to test the method on pums data with the Age x Sex marginal from B01001 together with the Employed marginal from B23025

    This takes about 30 seconds for a single census tract. The PUMS table is Age x Sex x Race x Poverty x Employed.  This is a small set of variables and marginals.  I have run models with about a dozen PUMS variables along with about a dozen marginal tables using IPF without replicate weights. This takes about 3 days for all the tracts in a state.

    In any case I have to "dust off" my FORTRAN 77 experience and code all the matrix calculations and iteration loops so that the larger problems will be feasible in a reasonable amount to "clock" time.

    I'll release an updated package when I am further along.

    Best,

    Dave Dorer

  • Where is the link? I don't see it. 

  • Go to dorerfoundation.org  software tab (across the top)   takes you to dorerfoundation.org/software  near the bottom "Link to Github website"   this will take you to the Github webpage.  Download PAT_0.9.zip    Then from within R use install.packages(<path to PAT_0.9.zip>, repos=NULL)   You need to create some folders for storing data. Also from within r use PAT.root() to set the working directory where you created the folders. Let me know if you have any difficulties or questions. You might post some information on your profile page so I know a little about your background. Send email to the address on the "Contact" page of the website. I have received feedback from only one person so I can't tell if there are any useability issues. There are vignettes on the Github web page so you can download the pdf's and read them without installing the package.

    Dave