ACS PUMS data variable and (factor) value labels

Hello all, I'm an experienced, professional user and have been using and analyzing ACS micro-data for over a decade but suddenly I have a basic question. 

I always used IPUMS ACS micro-data, which come with a file to input the variable labels, factor/value labels for categorical variables, and format all variables. For various reasons, I have to use the Census PUMS for an upcoming project. I need the full PUMS data (all rows, all columns) for at least 2005 - 2019 and the 2000-2004 data if possible. I can analyze the data in Stata, R, or Python. But I can't find Census PUMS microdata with variable and value labels (factor labels in R lingo).

Does anyone either have a link for where to download these data or a script (in Stata, R, or Python) for labeling and formatting the microdata? 

For a more elaborate description of the problem, see below: 

The Census PUMS FTP site only give data in .csv or SAS format. The .csv data have no variable labels and no value labels. The SAS data have variable labels but no value labels. I can download the SAS data, import it into Stata or use StatTransfer, and get variable names so that I know that the variable "occp" is labeled "Occupation recode for 2018 and later based on 2018 OCC codes" but all the distinct values for occupation are saved as strings with no meaning, just "2545" and "4920" etc. To apply labels to these values, I would have to navigate to the Census ACS documentation page, download the "ACS 1-year PUMS Code Lists.xlsx" file, clean then import each 15 or so worksheets into Stata, perform a one-to-many merge on occupational codes so that I have the meaning of each, say, occupational codes, as its own variable/column in the data, then apply the label utilities to attach the labels to the value labels in the original data. Thus I could see immediately that, for instance, the occp code "2545" means "Teaching assistants." But I'd have to do this for many variables, for each year of the ACS 1-year PUMS data. This process is doable but incredibly time-consuming I'm sure that someone, somewhere has done this already. 

The other option seems to be to use the MDAT utility but this tool forces users to click a check box for EVERY variable to be included in the data sample. Given >500 distinct variables and 19 years of data, this process also seems unnecessarily time-consuming. 

The "tidycensus" package also doesn't seem to help. While it lets me download the microdata, it forces me to browse a separate data frame called "pums_variables" to look up variable and value labels rather than just attaching these labels to the actual microdata. It also doesn't have 2000-2004 data available. 

Can anyone help direct me either to a clean-and-ready -- i.e. fully labeled, all rows, all columns -- version of the 1-year ACS for 2000/5 - 2019 or have scripts for labeling the variables and values? 

Thank you very much in advance!!!

  • I've been where you are.  I've found no work around if you absolutely must use the entire dataset. I've settled for putting in the time to use the .csv, covert to xlsx and run LOOKUP functions against PUMS codes in the census lists --creating a new column that is the "labeled" field in the PUMS dataset.  I do this knowing I'll likely be using it again down the road and will not be a one-off ...so I justify that front end work.  I do this only for the state in which I live and work and usually only for the person records and not housing records.

    Im hoping someone chimes in with a shortcut solution.

  • why can't you just go to https://usa.ipums.org/usa/, log in (register if you need to), browse and select data, and get any of the data sets that you need?

  • Hi Stas, thanks for the reply. I have an IPUMS account and have used it many times. The reason I need the Census PUMS is because I'm preparing to access the confidential ACS samples in an RDC. The confidential data will have the variable names, structure, etc. of the PUMS, not IPUMS. I'm preparing scripts to ready the data, do some exploratory/initial analysis and light model building/selection. If I have scripts ready, I can also hit the ground running within the RDC and hopefully save some time cleaning and prepping the confidential data. 

  • The "tidycensus" package also doesn't seem to help. While it lets me download the microdata, it forces me to browse a separate data frame called "pums_variables" to look up variable and value labels rather than just attaching these labels to the actual microdata. It also doesn't have 2000-2004 data available. 

    If you use the argument `recode = TRUE` in tidycensus's `get_pums()`, it'll attach value labels to your data as you desire.  See the example here: walker-data.com/.../introduction-to-census-microdata.html

  • I use pums data all the time.  I download it using the API with R. So far I have had to use the PDF PUMs codebooks and type the categories by hand.  You might be able to use an OCR program on the codebook.  Or if you have "real" Adobe Acrobat you should be able to export it as text.   I often cut and paste from the pdf but most of the time the formatting gets messed up.  Sorry,

    Dave

  • Thank you! Can't believe I missed a simple option call. Going to try it.

  • Thank you everyone for your responses and advice. After digging into tidycensus and other options, I've decided to continue using IPUMS but accessing the source variables instead of the harmonized ones. It's still a lot of work to select, there will still be cleaning costs, but it seems like the least costly option at this point. If I get around to it in the future, perhaps I'll convert the code files that IPUMS provides to format & label data to the version of the ACS which the Census provides. If that happens, I'll make sure to post about it here.