Quasi-experimental design using PUMS data

I want to do a QED analysis on ACS data for causal inference -- specifically between broadband and income. Before proceeding with that I want to understand the nuances of using such method on the ACS data. I could not find any existing work that has done QED matching with the PUMS data. Does anyone know of any work using it or criticizing such methods?



  • Since I've spent many years as a consulting statistician,  I'll ask you my first question when I work on a project with someone new  "What is the scientific question that you would like to answer."  Here is an example. Using the language of epidemiology, what is your outcome ?   Have broadband?  What is your treatment ?  High v Low income ? In this case what you want to show is that households that have a higher income are more likely to have broadband "all other things being equal" If there are other factors that might affect the likelihood of having broadband, you want to factor them into the analysis.  For example, occupation might also affect whether or not you have broadband.

    One QED method is propensity score matching: https://en.wikipedia.org/wiki/Propensity_score_matching.    Conceptually, A much more simple method  is to get the adjusted (for occupation in this example) effect of income on having broadband.  An analysis like this only requires multivariate logistic regression or (as it is much more common with census data) loglinear  models.  Loglinear models are  related to "raking" a common method used with census data.  You can also go with the propensity score based method which is harder to understand.  By the way the propensity score is usually constructed with logistic regression so the two approaches are related.  Sorry for typos syntax etc, The font is so small on my computer I can barely read this.


  • Hi Tarun,

    I am not an expert -- but I have looked into this before. It is not going to be an easy task to draw casual inference using cross-sectional data like the ACS PUMS. However, it is not impossible -- I think you just have to come up with a really good identification strategy. I think David Dorer posted some really great questions. What is your treatment? What is your outcome? I would also like to add: what is your unit of analysis? Is it  county level, household, or individual? It would be harder to use propensity score matching at the household/individual level because you don't know the outcome for the next period -- since cross-sectional samples are not the same across years. You also have to think about what to do with the survey weights -- are you going to match/analyze them with or without the weights? If your unit of analysis is at the county level or state level, then you have to control for migration and other policies at the state and county level. You have to control for all of these factors to draw casual inference. I hope this helps! Sorry for any typos.

    - T

  • Thanks T. I am planning to use the PUMS individual data. The treatment is income and outcome is broadband. I did not understand your concern about cross-sectional samples. Can you please elaborate? 

    Re Survey weights: since I am using the individual data do I still need to consider survey weights? I was thinking of not using them. Thanks again! 

  • Thanks for the feedback and apologies for not being clear. The question I am considering is "does income impact broadband adoption?" and the confounding factors that I am considering are age and education as these two may impact perceived utility of internet. 

    One reason I am not using mutlivariate logisitic regression as the variables (e.g., income and education) can be highly correlated. Therefore, I am thinking of using exact matching (instead of PSM) to control for age and education. One issue that also came up using QED is the direction of causality. Does matching establish the causal direction -- like does income impact broadband or vice-versa? Or is this an assumption that is baked into the QED.

  • Hi Tarun,

    If you're just looking at a single ACS year (the One-Year ACS PUMS) to draw causal inferences, then it is not easy. It is hard to establish the direction of causality in single year (like you mentioned -- the direction of causality is not establish really well here) -- think of it as a snapshot in time. You see people with high income with broadband in a single ACS year (as an example), that is not a strong enough case to say that high income impacts broadband. However, let's say you have person A and person B with the same level of income in period t1, both with no broadband. But in the next period (t2), you see an increase in income for person A and now he/she has broadband -- then that's a good case for causality. Otherwise, it is hard to establish the direction of causality here if you only use a single year of data. Hence, that was my initial concern in the last post, cross-sectional data means you don't have the same people across sample years to do this type of analysis. That is my understanding -- and again I am no expert in this field -- I just came across the same problem. I think it is okay to not use the survey weights. However, make sure you have enough samples for the exact matching if you're doing it within a small geographical location. Hope this helps! 


  • Doing multiple time points to track the "over time" relationship between broadband access and income is a much harder problem. I would set that aside for now.  Do a "cross-sectional" analysis for a single time point, for example the 2016-2020 ACS vintage. I would start with logistic regression.  A multivariate regression takes into account any correlation between the "input" "predictor" or "x-variables".   If you want to do your PUMS analysis correctly you need a logistic regression package that handles weights, even replicate weights . You will be able to compute errors in your estimates correctly. You will get "error bars."

    R is an open source free statistical analysis system.  There is a "GUI" point-and-click version called R Studio. The free version should have everything that you need. The add-on package that you need is the "survey" package.  With enough "googling" you should be able to find code (including replicate weights) to solve your problem.  You can probably find an ACS example. When using R it is helpful to have some programming experience.  Any computer language will do.   R-Studio is all point and click and it will write the necessary R code for you. There are other packages that are able to do weights and replicate weights, SAS, Stata, SPSS ?.   If you have access to them then great !  If you don't, you need to get out your wallet.


    The regular windows GUI version of R has some pull down menus but I've heard good things about R-Studio.

    Regular R for WIndows https://cran.r-project.org/bin/windows/base/   I don't think that the survey package is part of the baseline package.  use install.packages("survey") to install the package.

    Best of luck !

  • As an FYI -- if you read my profile -- I use regular R on my Ubuntu Linux machine -- it's the best !

  • Dear Tarun,

    Just wanted to check back with you to hear if you were able to run an analysis to get what you need. One additional comment on correlated x/predictor variables in a regression.  You can check for this by running a correlation matrix on the x variables.  If there are 2 highly correlated x-variables the coefficients in the regression will be unstable.  Just drop one of the variables and rerun the regression.  For 2 binary variables you can look at the odds ratio in the table generated by a cross tabulation.  Odds ratios near 0 or infinity show that the x variables are highly "correlated/associated"