Hi Group,
I'm working with the 5 year ACS PUMS data for the state of Indiana to calculate different variables which requires merging the population and household data. I'm working on these using python and SQL and was trying to merge the datasets using a left join (population data, household data) thus keeps all the records from population data and its matching ones from the household data using the serial no. Later If I calculate the variable the number I'm getting for households for different conditions are not matching with the numbers that I obtain from MDAT. Have anyone done similar joins using SQL or Python?
Thanks!
Can you post the code that precedes the joins and then the code for the joins itself? It would also help if the PUMS variables are explicitly listed. Thank you
Reading the household and population data in python
pdata = pd.read_csv('psam_p18.csv')hdata = pd.read_csv('psam_h18.csv')
Changing the serial number to string to avoid datatype mismatch when merging using serial no.
pdata = pdata.astype({'SERIALNO':'string'})hdata = hdata.astype({'SERIALNO':'string'})
Merging the datasets using left join to keep all the records from population data and matching records from housing data
in_data = pd.merge(pdata, hdata, how = 'left', on = 'SERIALNO')
Now, calculating the number Total Households Age 55+ Living Alone
a = in_data[(in_data.AGEP >= 55) & (in_data.HHT.isin([4, 6]))]
a.WGTP.sum()
The number I get is "184320" where as the number given by MDAT is "443428"
You could look at the IPUMS-USA data at the Minnesota Population Center, University of Minnesota. From their FAQs "IPUMS is not a collection of compiled statistics; it is composed of microdata. Each record is a person, with all characteristics numerically coded. In most samples persons are organized into households, making it possible to study the characteristics of people in the context of their families or other co-residents."