Hello all, I'm an experienced, professional user and have been using and analyzing ACS micro-data for over a decade but suddenly I have a basic question.
I always used IPUMS ACS micro-data, which come with a file to input the variable labels, factor/value labels for categorical variables, and format all variables. For various reasons, I have to use the Census PUMS for an upcoming project. I need the full PUMS data (all rows, all columns) for at least 2005 - 2019 and the 2000-2004 data if possible. I can analyze the data in Stata, R, or Python. But I can't find Census PUMS microdata with variable and value labels (factor labels in R lingo).
Does anyone either have a link for where to download these data or a script (in Stata, R, or Python) for labeling and formatting the microdata?
For a more elaborate description of the problem, see below:
The Census PUMS FTP site only give data in .csv or SAS format. The .csv data have no variable labels and no value labels. The SAS data have variable labels but no value labels. I can download the SAS data, import it into Stata or use StatTransfer, and get variable names so that I know that the variable "occp" is labeled "Occupation recode for 2018 and later based on 2018 OCC codes" but all the distinct values for occupation are saved as strings with no meaning, just "2545" and "4920" etc. To apply labels to these values, I would have to navigate to the Census ACS documentation page, download the "ACS 1-year PUMS Code Lists.xlsx" file, clean then import each 15 or so worksheets into Stata, perform a one-to-many merge on occupational codes so that I have the meaning of each, say, occupational codes, as its own variable/column in the data, then apply the label utilities to attach the labels to the value labels in the original data. Thus I could see immediately that, for instance, the occp code "2545" means "Teaching assistants." But I'd have to do this for many variables, for each year of the ACS 1-year PUMS data. This process is doable but incredibly time-consuming I'm sure that someone, somewhere has done this already.
The other option seems to be to use the MDAT utility but this tool forces users to click a check box for EVERY variable to be included in the data sample. Given >500 distinct variables and 19 years of data, this process also seems unnecessarily time-consuming.
The "tidycensus" package also doesn't seem to help. While it lets me download the microdata, it forces me to browse a separate data frame called "pums_variables" to look up variable and value labels rather than just attaching these labels to the actual microdata. It also doesn't have 2000-2004 data available.
Can anyone help direct me either to a clean-and-ready -- i.e. fully labeled, all rows, all columns -- version of the 1-year ACS for 2000/5 - 2019 or have scripts for labeling the variables and values?
Thank you very much in advance!!!
why can't you just go to https://usa.ipums.org/usa/, log in (register if you need to), browse and select data, and get any of the data sets that you need?
Hi Stas, thanks for the reply. I have an IPUMS account and have used it many times. The reason I need the Census PUMS is because I'm preparing to access the confidential ACS samples in an RDC. The confidential data will have the variable names, structure, etc. of the PUMS, not IPUMS. I'm preparing scripts to ready the data, do some exploratory/initial analysis and light model building/selection. If I have scripts ready, I can also hit the ground running within the RDC and hopefully save some time cleaning and prepping the confidential data.