I'm trying to merge zip codes so the the combined area has a population of 20,000 or more. I'm trying to de-identify a data set that has zip codes by merging adjacent ZCTAs.
If you have any ideas let me know. Perhaps someone has already done this and they have a list.
Interesting David.
There are some existing tools that combine adjacent geographic areas to meet various objective functions (minimum population, minimum margin of error, etc.). Some try to combine areas…
There are some existing tools that combine adjacent geographic areas to meet various objective functions (minimum population, minimum margin of error, etc.). Some try to combine areas that are homogeneous by some attribute (demographics, income, etc.). One such Python-based tool is described in a PLOS paper from a few years ago:Reducing Uncertainty in the American Community Survey through Data-Driven Regionalization | PLOS ONE
Thanks Dave very helpful. I'm using R and I can combine polygons that touch a central/target polygon based on the population for that polygon or based on the smallest distance between the centroid of the central polygon and the centroids of the touching polygons. I'm using R so it would be nice if there is an R program/package. Right now I'm using loops in R which is slow. A C program would be nice to speed things up. I'm familiar with calling C and Fortran from R. I think that I called python from R at some point but it requires a big effort to get things set up.
Dave
PS when I get this working, I'll make a dataset with rows giving the ZCTAs for each "condensed/combined" ZCTA group.
Maybe someone will be able to use it to de-identify a dataset with zipcodes/ZCTAs
Thank You for the article link Dave S. It is quite interesting, not only in succinctly describing fairly well known aspects of improving data quality, but also presenting considerations which are less instinctive.
Does `rgeoda` have the tools that you need? geodacenter.github.io/.../lab8.html
Dear Stas,
I noticed the R package but I haven't had a chance to look into it yet. Thanks for letting me know about it. I just got my R program working and it seems to be serviceable. I ran it on all the Massachusetts zip codes. There are 540 of them. I aggregated the ZCTAs so that the groups have a population of 10,000 or more. It took maybe 3 or 4 minutes. I merged the polygons that touch a selected polygon. If there are multiple touching polygons I take the one with the smallest distance between centroids. I haven't tried a selection criterion based on the populations of the touching polygons. Merging on distance gives a range for the populations of 10,000 to about 219,000 for Massachusetts. Merging using populations might lower the upper limit on the resulting populations. I'll have to try it.
Thanks again,
David -
It sounds like you've got your code running at this point. But here is an R package that might be useful to consult:GitHub - ajstamm/gatpkg: Geographic Aggregation Tool (GAT) for R
-- Dave
Dear Dave,
I am doing a literature review of techniques for merging adjacent geographic areas for reporting statistics and I cam up with a recent paper of yours. Developing Geographic Areas for Cancer Reporting Using Automated ZoneDesign where you use SEER data. I've used the SEER data files, which (as you know) are based on counties. quite a bit over the years. It looks like the NCI is trying to improve the current policy of using counties. As the problem that I am working on is essentially identical to this, any guidance is welcomed. I haven't read your paper yet but I plan on doing so soon.
Best
Hi Dave -
Glad you found the paper. You are right - we are working with the NCI SEER Program to create a set of cancer reporting zones that avoids the problems of reporting by county. The objectives of these zones are to have at least 50,000 people, to be demographically and socioeconomically homogeneous, and to be relatively compact geographically. We use a program called AZTool to identify sets of zones that meet these objectives. It has a very robust set of features but a rather clunky user interface. We use ACS data for the homogeneity factors.
I would be happy to chat more about our work and how it intersects with your ZCTA project. Feel free to contact me via DM or my Westat email.
-- Dave S
I couldn't (easily) find your Westat email. I don't know if you want to communicate that way. I don't like to post emails on public forums. You can send me an email at info@dorerfoundation.org (which is listed on the dorerfoundation.org website) with your Westat email. We can communicate that way if you like. I;m sure that you have many insights that I don't. I have a mathematics PhD, so I prefer to read papers with original research.
I was able to download the R GAT package and get it running on my Ubuntu Linux machine. It required some tweaks. It is written in R (no compiled code) and runs in R studio on Windows. The link for the source is https://ajstamm.github.io/gatpkg/docs/dev/index.html It seems to do a very nice job. It can aggregate polygons base on a population criteria as well as an additional "aggregation" variable/criteria. The current version is 1.61.0 but a new 2.0 is being worked on. The current version uses the R "sp" geo package which is being phased out in favor of the "sf" package which does similar things. It runs via a GUI interface that uses the choose.files() R function which only exists on Windows machines. I have a simple R function that serves as a work around on Ubuntu Linux. There is one additional modification that is an edit to the package NAMESPACE file. You have to delete a line that references choose.files. The workaround code is one line and requires the tcltk2 R add on package. The program input consists of a shape file with the polygons. the aggregation variables and a unique polygon identifier, the ZCTA in my case. The input parameters are entered via the GUI. It produces pdf output and a nice log file that serves as documentation for your computer run. Almost as good as sliced bread,wheat toast or a burger with fries!