Top-coded values

Has anyone ever attempted to "fill-in" the top-coded values for the 5 year ACS PUMS data? For example, in 2009, all of the house values (variable: valp) greater than 4 million dollars in Hawaii get cutoff and it's impossible to tell if the observation you're looking at is a 50 million dollar home or a 4.01 million dollar home - since both look the same. The top-coded cutoff values are different for every year by state combination and I haven't had much luck trying to break through the top-coded value. I've mostly been running linear regressions on the data and am just trying to get a reasonable estimate for each observation that is top-coded. Any insight would be greatly appreciated.

[Updated on 2/24/2015 3:08 PM]
  • You need a secondary source of data to distribute the top-coded data. Your results will vary according to how large (and representative) the secondary dataset is. I used to use proprietary rental listing databases to further break out rental units in the top category according to bedroom size and square footage with some degree of success. You might see if you can get some historical MLS databases (or something similar) and try to create some distributions based on price per square foot or some other parameters.
  • Referring to a second data source would be my advice as well.

    In some states you can access the local tax assessor records for approximate home valuation. (Assessed value is never the same as market value, but in some states those two values are closer than in others.)

    But the technique also depends on your ultimate goal for the data. If you're trying to estimate some sort of regression formula linking household characteristics to the value of the home, I'd be a bit skeptical of using modeled inputs (i.e. filled in top-coded values) in conjunction with survey data.
  • I also think property assessor data would be very good...if you can get your hands on it. The problem is that it is much harder to get access to the database(s). Each database is maintained by the county that levies the taxes so there is no one-stop shopping if you are looking at a larger area. Maybe if you contact the Property Assessor for each county you are examining, you might get lucky. I have never been so fortunate. Manually looking up properties on the county websites and hard-coding the data is not a viable solution. On the other hand, you can usually get access to historical MLS data if you talk to the right people: a Realtor you know or, perhaps, the Board of Realtors in that area. Though they will almost certainly charge you for the data if you are not a member. Good luck.
  • Have you tried using models for censored data? One of the outputs you can get from this is the expected value of the dependent variable conditional on the independent variables and the fact that it is censored.
  • I wasn't able to fully implement what I wanted to do last year with the 2013 PUMS but am hoping to do it with the 2014 PUMS. What models for censored data would you recommend? I have heard a lot about the Tobin model for censored regressions but was unsure of which package would best serve my needs in R.
  • Hi ultimate,

    What you're doing is intriguing, so I looked into it a bit, and have uploaded 3 tabulations to the PUMS group that might be worthwhile for you. I'm not a statistics guy, so I can't much comment on the Tobit censored/truncated data model, though I did read about it. 20 years ago I created an analytical tool for a telephone company that I'm using on census data (better described at www.OneGuyOnTheInternet.com) – the tabulations that I uploaded come from that tool.

    Although the VALP data item (house value) is described as 0-$9,999,999 in the documentation, there are only about 2,300 different values in the PUMS, because it is rounded and topcoded. Because there are only about 2,300 values, we can use VALP as a dimension – to see how many households have each possible value. I did that with the 2009-2013 PUMS file (my 2010-2014 file isn't quite ready yet, though I expect similar results). The uploaded file is called valp.csv.

    There are, in that tabulation, about 56 million households that fall into the GQ/Vacant category. However, there are only about 16.7 million vacant households in the US+PR (GQ 'households' count as zero), as we can see in the second tabulation, gqvac.csv. The first line of that tab is the group quarters (those with a zero Wgtp), the second line are the vacants (Np = 0), and the third line is the populated households. UW_hdrs is the unweighted header count, LoVal and HiVal provide MOE for Wgtp (weighted household count).

    I wondered if it was a 2009-2013 file issue, so I took the VALP field from the 2010-2014 file and found that 41% of them have a 'null' value (the census bureau calls them blanks), so although I can't run a tab on that file yet, there will be a very large number of households in the vacant/gq category for VALP.

    The third upload (valpst.csv) dimensionalizes on state and valp. What's interesting here is that there are evidently 5 different topcodes for each state. I would expect that there is a different topcode for each state for each of the 5 years in the file.

    John Grumbine
  • John,

    Thanks for your response. The set of households that I've been trying to model are the households with tenure equal to either '1' or '2', meaning that the property is owned free and clear or is mortgaged. This is the owner-occupied set and excludes renters, group quarters, and vacant units.

    You are correct - depending on the year, the top-coding values for each state are different. For example, in 2010, Alaska is top-coded at $800k but in 2014, Alaska is top-coded at $825k. Each record that is top-coded reports the average value of all the top-coded records within that state.
  • Ah .... that clears it up. The rented dwellings are lumped in with the vacant/gq group for VALP. I added a "Tenure" dimension to my tabulation & proved it to myself,

    Thanks!