Merging the Population and Household PUMS data

Hi Group,

I'm working with the 5 year ACS PUMS data for the state of Indiana to calculate different variables which requires merging the population and household data. I'm working on these using python and SQL and was trying to merge the datasets using a left join (population data, household data) thus keeps all the records from population data and its matching ones from the household data using the serial no. Later If I calculate the variable the number I'm getting for households for different conditions are not matching with the numbers that I obtain from MDAT. Have anyone done similar joins using SQL or Python?

Thanks!

Parents Reply Children
  • Reading the household and population data in python

    pdata = pd.read_csv('psam_p18.csv')
    hdata = pd.read_csv('psam_h18.csv')

    Changing the serial number to string to avoid datatype mismatch when merging using serial no.

    pdata = pdata.astype({'SERIALNO':'string'})
    hdata = hdata.astype({'SERIALNO':'string'})

    Merging the datasets using left join to keep all the records from population data and matching records from housing data

    in_data = pd.merge(pdata, hdata, how = 'left', on = 'SERIALNO') 

    Now, calculating the number Total Households Age 55+ Living Alone

    a = in_data[(in_data.AGEP >= 55) & (in_data.HHT.isin([4, 6]))]

    a.WGTP.sum()

    The number I get is "184320" where as the number given by MDAT is "443428"