De-identify ZCTA based data

I'm trying to merge zip codes so the the combined area has a population of 20,000 or more.  I'm trying to de-identify a data set that has zip codes by merging adjacent ZCTAs.

If you have any ideas let me know.   Perhaps someone has already done this and they have a list.

  • I've worked on this de-identify project a little more.  I've condensed/combine US zip codes (ZCTAs) into 6184 groups. There are approximately 32923 individual ZCTAs.  Each group has a population of 20,000 or more.  I have R code to do this using the 2020 (2022) TIGERLINE ZCTA shape file. https://www2.census.gov/geo/tiger/TIGER2022/ZCTA520/tl_2022_us_zcta520.zip This allows you to take a file that has "protected/individually identifiable" information (a numeric count vector) and zip codes (ZTCAs) and  combine (sum) the values information vector over each ZCTA group.  The resulting 6184 values are de-identified after you group combined vector values of 1-5 as 5.

      If you read the HIPAA guidance for de-identifying Protected Health Information (PHI)  https://www.hhs.gov/sites/default/files/ocr/privacy/hipaa/understanding/coveredentities/De-identification/hhs_deid_guidance.pdf you can see the details. They recommend using the first 3 digits of the zip coded as a geocode when de-identifying a data set with zipcodes.  By using the grouped/combined  zip code areas above you get areas with populations of 20,000+ as per the HIPAA guidance. This gives substantially smaller geographic areas than the "first 3 digits of the zip code" areas. I'll clean up my R program and upload it.

    If you have a service program for an organization with an address list, using this method you can take the zip codes from the address file and make a (heat) map with binned counts of participants in each of the ZCTA group geographies.  The resulting map will be de-identified and can be distributed to the public.  You can also use the exact address and the geocoder https://geocoding.geo.census.gov/geocoder/geographies/addressbatch?form to get the census tract.  The census tracts can also be grouped in a similar way.  A different minimum size like 10,000 could be chosen.

    Dave

  • Interesting David. 

    There are some existing tools that combine adjacent geographic areas to meet various objective functions (minimum population, minimum margin of error, etc.).  Some try to combine areas that are homogeneous by some attribute (demographics, income, etc.).  One such Python-based tool is described in a PLOS paper from a few years ago:
    Reducing Uncertainty in the American Community Survey through Data-Driven Regionalization | PLOS ONE

  • Thanks Dave very helpful.  I'm using R and I can combine polygons that touch a central/target polygon based on the population for that polygon or based on the smallest distance between the centroid  of the central polygon and the centroids of the touching polygons.   I'm using R so it would be nice if there is an R program/package.  Right now I'm using loops in R which is slow.  A C program would be nice to speed things up. I'm familiar with calling C and Fortran from R.  I think that I called python from R at some point but it requires a big effort to get things set up.

    Dave

    PS when I get this working, I'll make a dataset with rows giving the ZCTAs for each "condensed/combined" ZCTA group.

    Maybe someone will be able to use it to de-identify a dataset with zipcodes/ZCTAs

  • Thank You for the article link Dave S. It is quite interesting, not only in succinctly describing fairly well known aspects of improving data quality, but also presenting considerations which are less instinctive.

  • Dear Stas,

    I noticed the R package but I haven't had a chance to look into it yet.  Thanks for letting me know about it.  I just got my R program working and it seems to be serviceable. I ran it on all the Massachusetts zip codes. There are 540 of them. I aggregated the ZCTAs so that the groups have a population of 10,000 or more.  It took maybe 3 or 4 minutes.  I merged the polygons that touch a selected polygon.  If there are multiple touching polygons I take the one with the smallest distance between centroids.  I haven't tried a selection criterion based on the populations of the touching polygons.  Merging on distance gives a range for the populations of 10,000 to about 219,000 for Massachusetts.   Merging using populations might lower the upper limit on the resulting populations. I'll have to try it.

    Thanks again,

    Dave

  • David - 

    It sounds like you've got your code running at this point.  But here is an R package that might be useful to consult:
    GitHub - ajstamm/gatpkg: Geographic Aggregation Tool (GAT) for R

    -- Dave

  • Dear Dave,

    I am doing a literature review of techniques for merging adjacent geographic areas for reporting statistics and I cam up with a recent paper of yours. Developing Geographic Areas for Cancer Reporting Using Automated Zone
    Design where you use SEER data.  I've used the SEER data files, which (as you know) are based on counties. quite a bit over the years.  It looks like the NCI is trying to improve the current policy of using counties.  As the problem that I am working on is essentially identical to this, any guidance is welcomed.  I haven't read your paper yet but I plan on doing so soon.

    Best

    Dave

  • Hi Dave -

    Glad you found the paper. You are right - we are working with the NCI SEER Program to create a set of cancer reporting zones that avoids the problems of reporting by county.  The objectives of these zones are to have at least 50,000 people, to be demographically and socioeconomically homogeneous, and to be relatively compact geographically.  We use a program called AZTool to identify sets of zones that meet these objectives.  It has a very robust set of features but a rather clunky user interface.  We use ACS data for the homogeneity factors.  

    I would be happy to chat more about our work and how it intersects with your ZCTA project.  Feel free to contact me via DM or my Westat email.  

    -- Dave S

  • Dear Dave,

    I couldn't (easily) find your Westat email.  I don't know if you want to communicate that way.  I don't like to post emails on public forums.  You can send me an email at info@dorerfoundation.org (which is listed on the dorerfoundation.org website) with your Westat email.  We can communicate that way if you like. I;m sure that you have many insights that I don't.  I have a mathematics PhD, so I prefer to read papers with original research.

    Best

    Dave

  • I was able to download the R GAT package and get it running on my Ubuntu Linux machine.  It required some tweaks.  It is written in R (no compiled code) and runs in R studio on Windows.  The link for the source is https://ajstamm.github.io/gatpkg/docs/dev/index.html  It seems to do a very  nice job. It can aggregate polygons base on a population criteria as well as an additional "aggregation" variable/criteria.  The current version  is 1.61.0 but a new 2.0 is being worked on.  The current version uses the R "sp" geo package which is being phased out in favor of the "sf" package which does similar things.  It runs via a GUI interface that  uses the choose.files() R function which only exists on Windows machines.  I have a simple R function that serves as a work around on Ubuntu Linux. There is one additional modification that is an edit to the package NAMESPACE file.  You have to delete a line that references choose.files. The workaround code is one line and requires the tcltk2 R add on package. The program input consists of a shape file with the polygons. the aggregation variables and a unique polygon identifier, the ZCTA in my case. The input parameters are entered via the GUI.  It produces pdf output and a nice log file that serves as documentation for your computer run. Almost as good as sliced bread,wheat toast or a burger with fries!

    Dave