De-identify ZCTA based data

I'm trying to merge zip codes so the the combined area has a population of 20,000 or more.  I'm trying to de-identify a data set that has zip codes by merging adjacent ZCTAs.

If you have any ideas let me know.   Perhaps someone has already done this and they have a list.

Parents
  • Interesting David. 

    There are some existing tools that combine adjacent geographic areas to meet various objective functions (minimum population, minimum margin of error, etc.).  Some try to combine areas that are homogeneous by some attribute (demographics, income, etc.).  One such Python-based tool is described in a PLOS paper from a few years ago:
    Reducing Uncertainty in the American Community Survey through Data-Driven Regionalization | PLOS ONE

  • Dear Dave,

    I am doing a literature review of techniques for merging adjacent geographic areas for reporting statistics and I cam up with a recent paper of yours. Developing Geographic Areas for Cancer Reporting Using Automated Zone
    Design where you use SEER data.  I've used the SEER data files, which (as you know) are based on counties. quite a bit over the years.  It looks like the NCI is trying to improve the current policy of using counties.  As the problem that I am working on is essentially identical to this, any guidance is welcomed.  I haven't read your paper yet but I plan on doing so soon.

    Best

    Dave

Reply
  • Dear Dave,

    I am doing a literature review of techniques for merging adjacent geographic areas for reporting statistics and I cam up with a recent paper of yours. Developing Geographic Areas for Cancer Reporting Using Automated Zone
    Design where you use SEER data.  I've used the SEER data files, which (as you know) are based on counties. quite a bit over the years.  It looks like the NCI is trying to improve the current policy of using counties.  As the problem that I am working on is essentially identical to this, any guidance is welcomed.  I haven't read your paper yet but I plan on doing so soon.

    Best

    Dave

Children
  • Hi Dave -

    Glad you found the paper. You are right - we are working with the NCI SEER Program to create a set of cancer reporting zones that avoids the problems of reporting by county.  The objectives of these zones are to have at least 50,000 people, to be demographically and socioeconomically homogeneous, and to be relatively compact geographically.  We use a program called AZTool to identify sets of zones that meet these objectives.  It has a very robust set of features but a rather clunky user interface.  We use ACS data for the homogeneity factors.  

    I would be happy to chat more about our work and how it intersects with your ZCTA project.  Feel free to contact me via DM or my Westat email.  

    -- Dave S

  • Dear Dave,

    I couldn't (easily) find your Westat email.  I don't know if you want to communicate that way.  I don't like to post emails on public forums.  You can send me an email at info@dorerfoundation.org (which is listed on the dorerfoundation.org website) with your Westat email.  We can communicate that way if you like. I;m sure that you have many insights that I don't.  I have a mathematics PhD, so I prefer to read papers with original research.

    Best

    Dave

  • I was able to download the R GAT package and get it running on my Ubuntu Linux machine.  It required some tweaks.  It is written in R (no compiled code) and runs in R studio on Windows.  The link for the source is https://ajstamm.github.io/gatpkg/docs/dev/index.html  It seems to do a very  nice job. It can aggregate polygons base on a population criteria as well as an additional "aggregation" variable/criteria.  The current version  is 1.61.0 but a new 2.0 is being worked on.  The current version uses the R "sp" geo package which is being phased out in favor of the "sf" package which does similar things.  It runs via a GUI interface that  uses the choose.files() R function which only exists on Windows machines.  I have a simple R function that serves as a work around on Ubuntu Linux. There is one additional modification that is an edit to the package NAMESPACE file.  You have to delete a line that references choose.files. The workaround code is one line and requires the tcltk2 R add on package. The program input consists of a shape file with the polygons. the aggregation variables and a unique polygon identifier, the ZCTA in my case. The input parameters are entered via the GUI.  It produces pdf output and a nice log file that serves as documentation for your computer run. Almost as good as sliced bread,wheat toast or a burger with fries!

    Dave