Quasi-experimental design using PUMS data

I want to do a QED analysis on ACS data for causal inference -- specifically between broadband and income. Before proceeding with that I want to understand the nuances of using such method on the ACS data. I could not find any existing work that has done QED matching with the PUMS data. Does anyone know of any work using it or criticizing such methods?

Thanks,

Tarun

Parents
  • Since I've spent many years as a consulting statistician,  I'll ask you my first question when I work on a project with someone new  "What is the scientific question that you would like to answer."  Here is an example. Using the language of epidemiology, what is your outcome ?   Have broadband?  What is your treatment ?  High v Low income ? In this case what you want to show is that households that have a higher income are more likely to have broadband "all other things being equal" If there are other factors that might affect the likelihood of having broadband, you want to factor them into the analysis.  For example, occupation might also affect whether or not you have broadband.

    One QED method is propensity score matching: https://en.wikipedia.org/wiki/Propensity_score_matching.    Conceptually, A much more simple method  is to get the adjusted (for occupation in this example) effect of income on having broadband.  An analysis like this only requires multivariate logistic regression or (as it is much more common with census data) loglinear  models.  Loglinear models are  related to "raking" a common method used with census data.  You can also go with the propensity score based method which is harder to understand.  By the way the propensity score is usually constructed with logistic regression so the two approaches are related.  Sorry for typos syntax etc, The font is so small on my computer I can barely read this.

    Dave

Reply
  • Since I've spent many years as a consulting statistician,  I'll ask you my first question when I work on a project with someone new  "What is the scientific question that you would like to answer."  Here is an example. Using the language of epidemiology, what is your outcome ?   Have broadband?  What is your treatment ?  High v Low income ? In this case what you want to show is that households that have a higher income are more likely to have broadband "all other things being equal" If there are other factors that might affect the likelihood of having broadband, you want to factor them into the analysis.  For example, occupation might also affect whether or not you have broadband.

    One QED method is propensity score matching: https://en.wikipedia.org/wiki/Propensity_score_matching.    Conceptually, A much more simple method  is to get the adjusted (for occupation in this example) effect of income on having broadband.  An analysis like this only requires multivariate logistic regression or (as it is much more common with census data) loglinear  models.  Loglinear models are  related to "raking" a common method used with census data.  You can also go with the propensity score based method which is harder to understand.  By the way the propensity score is usually constructed with logistic regression so the two approaches are related.  Sorry for typos syntax etc, The font is so small on my computer I can barely read this.

    Dave

Children
  • Thanks for the feedback and apologies for not being clear. The question I am considering is "does income impact broadband adoption?" and the confounding factors that I am considering are age and education as these two may impact perceived utility of internet. 

    One reason I am not using mutlivariate logisitic regression as the variables (e.g., income and education) can be highly correlated. Therefore, I am thinking of using exact matching (instead of PSM) to control for age and education. One issue that also came up using QED is the direction of causality. Does matching establish the causal direction -- like does income impact broadband or vice-versa? Or is this an assumption that is baked into the QED.

  • Dear Tarun,

    Just wanted to check back with you to hear if you were able to run an analysis to get what you need. One additional comment on correlated x/predictor variables in a regression.  You can check for this by running a correlation matrix on the x variables.  If there are 2 highly correlated x-variables the coefficients in the regression will be unstable.  Just drop one of the variables and rerun the regression.  For 2 binary variables you can look at the odds ratio in the table generated by a cross tabulation.  Odds ratios near 0 or infinity show that the x variables are highly "correlated/associated"

    Dave