Beginner seeks advice on PUMS returns: Are R and tidycensus the best route?

I am a new user. My task is nothing new, as I've seen such statistics that I seek by quintiles or quartiles in a few reports, but the latest ACS would be best. I used the search for topics on this forum, and didn't see a specific answer for my task about quintiles.

Before I even start my first use of R, tidycensus, and the like, I concluded that I should ask the commuinity first, as advised by Census staff.  As I consider that your time is valuable, I will highly appreciate any advice you might offer to get me on the correct path on this project. For background, the project is research for changing U.S. and other national housing inventories (so it's not so expensive for average earners). I am former appointed government official for a regional government (Cook County in Chicago, Illinois) and worked on political campaigns, with a few mentions on Wikipedia, for refences to my previous work.

Other than significantly altering the housing inventory in the next few decades, I am not trying to do much, that is, as far as a census query goes, I hope. For now, I only need a few ACS PUMS variables like VALP, for example, separated out by numeric parameters, like by the income numbers provided in Table B19080. It would be great to have the low to high range and median for each U.S. quintile. I will add up states, regions, etc, if I need to in order to get U.S. totals, or use a U.S. total for metro regions, for example, explaining that as such, depending on what PUMS data can return.

I planned to use R and tidycensus to return queries for VALP, for example using 5 queries, using the quintile numeric value ranges from Table B19080. 

Would you recomend R?

And titdycensus?

These both were recommended by Corey Sparks of the Univ. of Texas at San Antonio, who touted them on the census.gov webinar of last April, 2022.

How much time would a paid service like Stata save me? I didn't yet ask for a SAS price quote. Stata's cost is $840 for their beginner service. Would that Stata package do the job? Or would I have to still expend a fair amount of effort on Stata for the data returns?

I would just as soon spend some extra time on R, if you advise that it won't take me forever, because I hope that knowing the more about R, tidy census and other such options from GitHub and the like would be beneficial in the long run, and save me the 800 bucks, so I could support my own time, and support others like Kyle Walker with a few bucks with his ongoing tidycensus work.

As a bumbling beginner, I welcome any advice. Thank you.

Troy Deckert

  •  there is no good answer to that.

    If you are happy with the tables that you can download from data.census.gov, maybe you could end up manipulating them happily in Excel. This is typically an awful tool for anything data-related, but everybody has access to it, and everybody knows at least some of it.

    Any of Stata, R or Python (specifically the data frame package pandas) would be a more professional way to handle that. Using the proper scripted code in any of these would help you with reproducibility of your analyses. Some of the tasks that would take you painful VLOOKUP programming in Excel are easily solved by the native merge (Stata lingo) / join (pandas, R and SQL lingo) functionality in these -- and with most Census data manipulation tasks, there will be a lot of that kind of stuff happening.

    If you end up needing to work with microdata because the Census Bureau does not tabulate the stuff that you need across the variables you want, you'd have to download the public use microdata series files -- and again to deal with these properly, you really have to use one of the statistical packages. If at the moment you don't know any of these, then it basically does not matter which one you would want to start learning. I would argue that better educational materials are available for R; but other people can reasonably argue that you can get to the actual data analysis faster in Stata (and with R and Python, you end up paying with your own time, at least while you are on a learning curve; my personal mix is 70% R, 20% Stata and 10% Python). Don't bother with SAS.

    None of the packages would have "Get me the Census data and produce the table that I want" button to click; with any of these packages, it would take you combining the lego blocks that make available to you to get the results you need. In either R, Stata or Python, producing a table of house values by income quintiles would be about 15-20 lines of code (load the micro data / request the tables from API, define the geographies, define the quintiles, run the analysis / do some joins, possibly format the output table).

    With ChatGPT, you can get fairly close to workable code for generic data manipulation in a focused chat, especially with R and Python as the patterns for these are well wired into ChatGPT "brain" from crawling the massive GitHub repositories -- and Stata isn't nearly as strongly present there so ChatGPT fails a significant portion of the time to write good Stata code. I doubt ChatGPT would have good knowledge of tidycensus, though; it's just too obscure.

    FWIW you seem to use the words like "number", "query", "parameter", "value", "return" differently than most statisticians would. Maybe that's the political campaigning dialect of English ;).

  • Hi Troy!  Thanks for considering tidycensus.  One advantage of going the R / tidycensus route is that Matt and I have built tooling to help with some of the more challenging aspects of working with PUMS data, such as getting replicate weights / calculating standard errors.  

    I don't believe there is a time savings to using Stata unless you already know how to use Stata.  Since the introduction of the tidyverse framework, I find the learning curve for Stata and R to be comparable.  

    For getting started with PUMS data in R as an absolute beginner, I'd recommend the following sequence:

    You might also look at IPUMS USA which has a really smooth interface for data downloads and comprehensive documentation.  You'd still need Stata or R to analyze the data, but IPUMS has a number of tutorials to help with that too.  Good luck getting started!

  • it's funny, I've tried a few times to get ChatGPT to write tidycensus code.  It's almost always wrong, but in a way that looks right (for example, it makes up functions and Census variables that don't exist) which would be super confusing to someone unfamiliar with the package.  

  • Hi Troy - this response may be too late to be helpful to you, but I always suggest folks try swirl to start learning to use R (https://swirlstats.com/students.html). It walks you through using R and there are many courses available now (http://swirlstats.com/scn/title.html). 'Getting and Cleaning Data' is really good! 

  • Thank you for that information. I look forward to using this in the future. It's great to know that more is available.Thank you very much for responding. I know more about publishing and policy (www.publicpolicypress.com), if you have questions on those areas, I'll be happy to respond, than i do about  IPUMS, but I'm trying.