Top-coded values

Has anyone ever attempted to "fill-in" the top-coded values for the 5 year ACS PUMS data? For example, in 2009, all of the house values (variable: valp) greater than 4 million dollars in Hawaii get cutoff and it's impossible to tell if the observation you're looking at is a 50 million dollar home or a 4.01 million dollar home - since both look the same. The top-coded cutoff values are different for every year by state combination and I haven't had much luck trying to break through the top-coded value. I've mostly been running linear regressions on the data and am just trying to get a reasonable estimate for each observation that is top-coded. Any insight would be greatly appreciated.

[Updated on 2/24/2015 3:08 PM]
Parents
  • Hi ultimate,

    What you're doing is intriguing, so I looked into it a bit, and have uploaded 3 tabulations to the PUMS group that might be worthwhile for you. I'm not a statistics guy, so I can't much comment on the Tobit censored/truncated data model, though I did read about it. 20 years ago I created an analytical tool for a telephone company that I'm using on census data (better described at www.OneGuyOnTheInternet.com) – the tabulations that I uploaded come from that tool.

    Although the VALP data item (house value) is described as 0-$9,999,999 in the documentation, there are only about 2,300 different values in the PUMS, because it is rounded and topcoded. Because there are only about 2,300 values, we can use VALP as a dimension – to see how many households have each possible value. I did that with the 2009-2013 PUMS file (my 2010-2014 file isn't quite ready yet, though I expect similar results). The uploaded file is called valp.csv.

    There are, in that tabulation, about 56 million households that fall into the GQ/Vacant category. However, there are only about 16.7 million vacant households in the US+PR (GQ 'households' count as zero), as we can see in the second tabulation, gqvac.csv. The first line of that tab is the group quarters (those with a zero Wgtp), the second line are the vacants (Np = 0), and the third line is the populated households. UW_hdrs is the unweighted header count, LoVal and HiVal provide MOE for Wgtp (weighted household count).

    I wondered if it was a 2009-2013 file issue, so I took the VALP field from the 2010-2014 file and found that 41% of them have a 'null' value (the census bureau calls them blanks), so although I can't run a tab on that file yet, there will be a very large number of households in the vacant/gq category for VALP.

    The third upload (valpst.csv) dimensionalizes on state and valp. What's interesting here is that there are evidently 5 different topcodes for each state. I would expect that there is a different topcode for each state for each of the 5 years in the file.

    John Grumbine
Reply
  • Hi ultimate,

    What you're doing is intriguing, so I looked into it a bit, and have uploaded 3 tabulations to the PUMS group that might be worthwhile for you. I'm not a statistics guy, so I can't much comment on the Tobit censored/truncated data model, though I did read about it. 20 years ago I created an analytical tool for a telephone company that I'm using on census data (better described at www.OneGuyOnTheInternet.com) – the tabulations that I uploaded come from that tool.

    Although the VALP data item (house value) is described as 0-$9,999,999 in the documentation, there are only about 2,300 different values in the PUMS, because it is rounded and topcoded. Because there are only about 2,300 values, we can use VALP as a dimension – to see how many households have each possible value. I did that with the 2009-2013 PUMS file (my 2010-2014 file isn't quite ready yet, though I expect similar results). The uploaded file is called valp.csv.

    There are, in that tabulation, about 56 million households that fall into the GQ/Vacant category. However, there are only about 16.7 million vacant households in the US+PR (GQ 'households' count as zero), as we can see in the second tabulation, gqvac.csv. The first line of that tab is the group quarters (those with a zero Wgtp), the second line are the vacants (Np = 0), and the third line is the populated households. UW_hdrs is the unweighted header count, LoVal and HiVal provide MOE for Wgtp (weighted household count).

    I wondered if it was a 2009-2013 file issue, so I took the VALP field from the 2010-2014 file and found that 41% of them have a 'null' value (the census bureau calls them blanks), so although I can't run a tab on that file yet, there will be a very large number of households in the vacant/gq category for VALP.

    The third upload (valpst.csv) dimensionalizes on state and valp. What's interesting here is that there are evidently 5 different topcodes for each state. I would expect that there is a different topcode for each state for each of the 5 years in the file.

    John Grumbine
Children
No Data