A Threat to the ACS from the Bureau Itself

1. Yesterday at the ACS Data Users Conference, the Census Bureau described its plans to replace the American Community Survey (ACS) microdata with “fully synthetic” data over the next three years.
/2. Details of the methodology have not been disclosed, but the idea is to develop models describing the interrelationships of all the variables in the ACS, and then construct a simulated population consistent with those models. 
/3. Such modeled data captures relationships between variables only if they have been intentionally included in the model. Accordingly, synthetic data are poorly suited to studying unanticipated relationships, which impedes new discovery. 
/4. The large size of the ACS means that it is possible to study small population subgroups, but the synthetic data cannot capture all the ways in which interrelationships among variables can vary across subgroups. 
/5. For example, the synthetic data would certainly incorporate a general relationship between income and education but could not assess that relationship separately for every possible subgroup. 
/6. The relationship of income and education might be different for American Indians in South Dakota or Asian Indians in Queens. 
/7. The power of ACS microdata in large measure derives from their hierarchical structure: individuals are nested in households, and the interrelationships of household members are known. 
/8. This allows analysis of millions of potential associations across household members. For example, investigators can measure ethnic intermarriage, or the impact of a partner’s education on women’s fertility. 
/9. The synthetic data apparently incorporates only individual-level interrelationships among variables, so analysis across household members will be impossible. 
/10. These limitations are important because the ACS microdata is the most intensively used source available for demographic and economic research. 
/11. Hundreds of thousands of academic researchers, planners, and policy makers rely on the ACS, and according to Google Scholar they generate about 12,000 publications per year. 
/12. Common topics of analysis include poverty, inequality, immigration, internal migration, ethnicity, residential segregation, disability, transportation, fertility, nuptiality, occupational structure, education, and family change. 
/13. If public use data become unusable or inaccessible because of overzealous disclosure control, there will be far-reaching consequences. The quantity and quality of research about U.S. policies, the economy, and social structure will decline precipitously. 
/14. The Census Bureau appears to recognize that synthetic data will be inappropriate for most research purposes. 
/15. The Census Bureau proposes a system whereby investigators would develop analyses using synthetic data, and then submit then to the Census Bureau for “validation” using real data. 
/16. One problem is that investigators need access to the real data for exploratory analysis to discover the relevant variables to incorporate in their analyses. 
/17. Another problem is logistical: The Census Bureau is not equipped or funded to carry out the tens or hundreds of thousands of validation analyses that would be needed to replace current usage. 
/18. And the results of the validation runs would then have to go through disclosure review, and the Census Bureau also lacks the capacity to do that work at scale. 
/19. The reason the Census Bureau wants to get rid of one of the world’s most intensively used scientific resources is concern about respondent confidentiality. 
/20. The Census Bureau implicitly acknowledges that there is not a single documented case of reidentification of a respondent in the ACS or decennial census microdata. 
/21. Over 100 countries around disseminate similar microdata through @ipums, and again there is not a single documented case in which respondent’s identity has been revealed. 
/22. In the presentation yesterday, the Bureau maintained that
/23. Not only are the risks of disclosure unmeasurably infinitesimal, if by some miracle someone’s ACS data were exposed the resulting harms would be minimal. 
/24. The ACS has no information that could be used to aid identity theft, and most of the information it does include could far more easily be obtained from other sources. 
/25. If we weigh the profound cost of eliminating the ACS microdata against fanciful benefits for respondent confidentiality, the Census Bureau has no case. 
/26. Such a massive shift in the Nation’s statistical infrastructure would be "arbitrary and capricious, an abuse of discretion” and therefore in violation of the Administrative Procedures Act. 
/27. Although some Census Bureau staff members treat the synthetic ACS as if it were a done deal, there is still time to avert this disastrous course. 
/28. Acting Census Director @jarmin_ron or Census Director nominee @_Rob_Santos may decide to back away from the precipice. 
/29. It may be necessary, however, for the research community to pursue political or legal strategies to retain open access to the crown jewels of demographic data infrastructure. 
/30. To stay informed as things develop, watch this space. We will also post updates as we learn more at ipums.org/changes-to-cen…. /fin.
Parents
  • I was surprised to see this topic come up in the City Observatory newsletter this week so I am copying their comment below.  (City Observatory is newsletter that looks at urban planning issues through an economic lens.):

    New Knowledge

    Synthetic microdata:  A threat to knowledge.  Each week at City Observatory, we usually profile an interesting or provocative research study.  This week, we're spending a minute to highlight a potential threat to a key source of data that helps us better understand our world, and especially the nation's cities: the public use microsample of the American Community Survey (ACS).  The ACS is the nation's largest and most valuable source of data on population, housing, social and economic characteristics.  While the Census Bureau produces many tabulations of these data, its impossible to slice and dice data in a way that bears on every question.  So Census Bureau makes available what is called a "public use microsample" which allows researchers to craft their own customized tabulations of these data to answer specific questions.  At City Observatory, for example, we've used these data to estimate the income, race and ethnicity of peak hour drive alone suburban commuters traveling from suburban Washington State to jobs in Oregon--a question that would be essentially impossible to answer from either published Census tabulations or other publicly available data.

    Microdata are valuable because they link answers to different ACS questions--linking a persons age, gender or race to their income, occupation or housing type, and on.  But because the microdata are individual survey responses, some are concerned that there's a potential violation of privacy:  that someone could use answers to a series of questions to deduce the identity of an individual survey respondent.  While that may technically be a possibility, there's no evidence it occurs in practice.  Still, Census Bureau is hypersensitive about privacy concerns, and has proposed replacing actual microdata with "synthetic" microdata, in order to make it even more difficult to identify an individual.  Essentially, synthetic data would replace actual patterns of responses with statistically modeled responses.  The trouble is, this modeled, synthetic data actually subtracts information, and makes it impossible for researchers to know whether the answers to any particular question are a product of actual variation, or just a quirk of Census Bureau's model.  As University of Minnesota data expert Stephen Ruggles puts it, "synthetic data will be useless for research."

    The privacy threat from ACS microdata is a phantom menace.  Ruggles and a colleague at the University of Minnesota have just published a paper showing that attempting to use Census microdata to create individually identifiable records via database reconstruction would produce vastly more random (i.e. false) matches that real ones.   This undercuts the idea that microdata is an actual threat to privacy.

    But a proposal to replace PUMS data with synthetic data is a real threat to our ability to better understand our world.  It is like requiring piano players to wear mittens when playing Beethoven sonatas:  the piano will still produce sound, but the result will be noise, not music.

    Mike Schneider, Census Bureau's use of 'synthetic data' worries researchers,  Some researchers are up in arms about a U.S. Census Bureau proposal to add privacy protections by manipulating numbers in the data most widely used for economic and demographic research, ABC News, May 27, 2021

    Steven Ruggles and David VAn Piper, "The Role of Chance in the Census Bureau Database Reconstruction Experiment," University of Minnesota, May 2021 Working Paper No. 2021-01 DOI: https://doi.org/10.18128/MPC2021-01

Reply
  • I was surprised to see this topic come up in the City Observatory newsletter this week so I am copying their comment below.  (City Observatory is newsletter that looks at urban planning issues through an economic lens.):

    New Knowledge

    Synthetic microdata:  A threat to knowledge.  Each week at City Observatory, we usually profile an interesting or provocative research study.  This week, we're spending a minute to highlight a potential threat to a key source of data that helps us better understand our world, and especially the nation's cities: the public use microsample of the American Community Survey (ACS).  The ACS is the nation's largest and most valuable source of data on population, housing, social and economic characteristics.  While the Census Bureau produces many tabulations of these data, its impossible to slice and dice data in a way that bears on every question.  So Census Bureau makes available what is called a "public use microsample" which allows researchers to craft their own customized tabulations of these data to answer specific questions.  At City Observatory, for example, we've used these data to estimate the income, race and ethnicity of peak hour drive alone suburban commuters traveling from suburban Washington State to jobs in Oregon--a question that would be essentially impossible to answer from either published Census tabulations or other publicly available data.

    Microdata are valuable because they link answers to different ACS questions--linking a persons age, gender or race to their income, occupation or housing type, and on.  But because the microdata are individual survey responses, some are concerned that there's a potential violation of privacy:  that someone could use answers to a series of questions to deduce the identity of an individual survey respondent.  While that may technically be a possibility, there's no evidence it occurs in practice.  Still, Census Bureau is hypersensitive about privacy concerns, and has proposed replacing actual microdata with "synthetic" microdata, in order to make it even more difficult to identify an individual.  Essentially, synthetic data would replace actual patterns of responses with statistically modeled responses.  The trouble is, this modeled, synthetic data actually subtracts information, and makes it impossible for researchers to know whether the answers to any particular question are a product of actual variation, or just a quirk of Census Bureau's model.  As University of Minnesota data expert Stephen Ruggles puts it, "synthetic data will be useless for research."

    The privacy threat from ACS microdata is a phantom menace.  Ruggles and a colleague at the University of Minnesota have just published a paper showing that attempting to use Census microdata to create individually identifiable records via database reconstruction would produce vastly more random (i.e. false) matches that real ones.   This undercuts the idea that microdata is an actual threat to privacy.

    But a proposal to replace PUMS data with synthetic data is a real threat to our ability to better understand our world.  It is like requiring piano players to wear mittens when playing Beethoven sonatas:  the piano will still produce sound, but the result will be noise, not music.

    Mike Schneider, Census Bureau's use of 'synthetic data' worries researchers,  Some researchers are up in arms about a U.S. Census Bureau proposal to add privacy protections by manipulating numbers in the data most widely used for economic and demographic research, ABC News, May 27, 2021

    Steven Ruggles and David VAn Piper, "The Role of Chance in the Census Bureau Database Reconstruction Experiment," University of Minnesota, May 2021 Working Paper No. 2021-01 DOI: https://doi.org/10.18128/MPC2021-01

Children
No Data