Calculating household incomes when household and person files are merged

Hi all,

I'm brand new to the ACS Data Users Group and very excited to be a part of it! What an incredible wealth of resources. 

I actually have a challenge that I've been running into while using PUMS. I've been using FactFinder for years, but have only started using PUMS in the last six months, so please forgive my lack of knowledge. 

We (my research center, Boston Indicators at the Boston Foundation) are trying to find the average (either mean or median) income of same-sex couples in Massachusetts disaggregated by race and ethnicity. This is not possible through FactFinder or other ACS sources (like NHGIS) that pre-can tables. So, we thought we'd use PUMS to get at this problem. The best we think we can do is finding the race/ethnicity of the primary householder of same-sex couples, which is not ideal, but still better than nothing. 
One challenge, as I'm sure you've run across in your own work, was that the desired data was recorded in two different datasets. Data on same-sex couples and household income is in the household record while data on race and ethnicity is in the person record. We solved for this by doing a simple merge of the data and then subsetting to include only people who are primary householders (or, in PUMS' lingo, the "reference person"). So, theoretically, we should have a dataset of primary householders that includes: the primary householder's race/ethnicity, same-sex couples, and household income. 
The second challenge is around weighting. Because we merged the two datasets, I am a little unclear on how to appropriately use the PUMS weights. I've chosen to use the household weights (WGTP) because we are looking for household income. After adjusting the household income using ADJINC, I went ahead and calculated the mean income for each subset population (e.g. same-sex couples with a white primary householder) using the following process: I multiplied each household income by it's weight (let's call this "whi") and summed all the weighted household incomes (let's call this "s.whi"). Then, I separately summed all the weights (let's call this "sw") to get what is essentially an estimate of the total number of households in the subset. Then, I divided s.whi by sw to get average income (avg income = s.whi/sw). 
When I finished the script, there was a glaring problem: the results were obviously incorrect. For example, the mean income for same-sex couples was about $157,000 and the mean income for different-sex couples was about $139,000. Based on research done by the Williams Institute, we would expect the mean income for same-sex couples in Massachusetts to be higher than different-sex couples, but neither of those mean incomes should be anywhere close to $139,000 or $157,000. 
I am happy to share the R script if anyone is interested! 
Any thoughts on why the figures are so out of whack? Am I weighting incorrectly? Is there some quirk of PUMS data I'm missing? 
Any insight you could give would be much appreciated!
  • After reading your post, I ran a tabulation for MA on SSMC through Gateway To MAST (2014 data, Android/Kindle app). I suspect that you are calculating correctly, but that the self-reported incomes in the PUMS are high. I too would be interested in any insights the group has on this.
  • Hi Anise,

    I'm happy to take a look. What you propose sounds roughly correct (keeping the householder record, using the household weight, using a weighted average). The most likely problem I can think if is that you're dropping cases with 0 income. Another problem that could have occurred is not adjusting the for number of years of data you use if by chance you combined more than one year of data.

    As a new PUMS user, I highly recommend downloading the verification files and attempting to recreate the weighted estimates therein. Even better if you can reproduce the replicate weight standard errors. (See PUMS Estimates for User Verification:

    From there, I'd then try to reproduce estimates of mean household income for all households in Massachusetts and compare those against what you see in American FactFinder. This is a good way to benchmark your approach, though your PUMS estimate will differ slightly from the pre-tabulated results because you're using a smaller (public use) sample and, I think, incomes are rounded to protect confidentiality. (See Subject Definitions:

    I'll see what I get when I get a moment, but you can send your code to as well.

  • In reply to Vincent Palacios:

    I absolutely agree with Vincent in validating your processing with the verification files, weighted estimates, and replicate weights. However, Gateway To MAST is spot on with those, and the incomes still come out high. I would be very interested if you find that your high numbers are a result of a significant error in your processing.
  • In reply to Vincent Palacios:

    Related... You may want to check to see how your code is handling negative income levels. Reported household income can be negative or positive.
  • In reply to Beth Jarosz:

    From what I can tell, these numbers are not that unreasonable. For 2016 1-year ACS data, table S1901 shows mean incomes by household type. (See:

    From table S1901:
    There are 2,579,398 households with a mean household income of $101,911.
    There are 1,633,661 family households with a mean HHI of $122,310.
    There are 1,194,726 married-couple family households with a mean HHI of $143,966.
    There are 945,737 non-family households with a mean HHI of $62,500.

    With weighted 2016 1-year ACS PUMS for MA I get:
    There are 2,579,453 households with a mean household income of $101,593.
    There are 1,634,746 family households with a mean HHI of $124,355.
    There are 1,195,011 married-couple family households with a mean HHI of $144,204.
    There are 944,707 non-family households with a mean HHI of $62,207.

    -And for comparison, with PUMS-
    There are 17,396 same-sex couple households with a mean HHI of $181,920.

    Stata code:
    ...load data...
    gen hincp_adj = hincp * adjinc/1000000
    gen hh = 1
    gen fam_hh = inrange(hht, 1, 3)
    gen mcfam_hh = hht == 1
    gen nonfam_hh = inrange(hht, 4, 7)
    gen ssmc_hh = inlist(ssmc, 1, 2)
    tab1 hh fam_hh mcfam_hh nonfam_hh ssmc_hh [fw=wgtp], sum(hincp_adj)
  • In reply to Vincent Palacios:

    Speaking qualitatively, as someone familiar with the situation on the ground in Massachusetts I do not find the results here surprising. We have the fifth highest mean household income in the country and the fourth highest family median income. Two adults in a household who both have professional or union jobs could certainly earn salaries commensurate with these mean values. This is reflected to some extent in our ever soaring real estate values. The issue in Massachusetts lies more with the distribution of incomes rather than a lack of income overall.
  • I ran a tabulation that breaks households into state (so you can just look at MA, or compare MA to other states), SSMC, and whether or not the reference person is white. I've tried including it in this post, but it says that .csv files are invalid. So I've put it on google drive here:

    It's 2014 data, so expect some differences if you are using newer data. Line 135 shows you that in MA with a white reference person, the average household income for a same sex couple is about $164,000. With a non-white reference person (line 132) there is an average household income of $151,000. However, the sample size for line 132 is pretty tiny - there are only 10 (UW_hdrs) actual households in the sample in that category. For non same-sex couples with a white householder my numbers are quite a bit lower than yours ($76,000 for family income, $100,000 for household income, where you had $139,000 for household income). The tabulation has two levels (sections). The upper is household level information. If you want to know a bit about the people who live in those households, scroll down to the person level section (line 438 for MA).

    If you have any questions (most people aren't used to working with 2 level MAST tabulations) feel free to ask.
  • Curious to see your R script.
  • In reply to Vincent Palacios:

    Thanks for these recommendations Vincent! I will definitely download the verification files and attempt to recreate the weighted estimates. And I'll try to reproduce the replicate weight standard errors as well.

    I'll send on the R script to that email address now--thanks for offering to take a look when you get a moment!
  • In reply to Mihir Iyer:

    Happy to send it along! Where is best to send it? Or is it standard practice to post in groups? Not sure of the protocol...
  • In reply to John Grumbine:

    John, this is great. Thanks for taking the time to pull that sheet together. It's interesting to see how MA compares to other states in your tabulation--very useful overview.
  • In reply to Anise Vance:

    You are very welcome.
  • In reply to Anise Vance:

    When I'm working with open data like the ACS PUMS, I create a github repo and then use that to share since it provide a reproducible example. This site also has a messaging feature but I have been unsuccessful in using it. Dropbox etc. work too.