Question concerning merging person and household files in R

Hello, I am interested in joining person and household data using the SERIALNO variable from the 2019 5-year ACS. Here is my code for this action using RStudio Version 1.4.1106:

# Load Person and Household data for 2019 ACS

pums2019pop <- read.csv("C:\\psam_p48.csv")
pums2019hsg <- read.csv("C:\\psam_h48.csv")

# Join person and household data using SERIALNO
pums2019 <- inner_join(pums2017pop, pums2017hsg, by = c("SERIALNO"))

Once I have completed this action, I check the output for pums2019 and immediately notice that some variables such as PUMA, DIVISION, and ADJINC, have ".x" appended to the end of their names so that they now appear as: "PUMA.x", "DIVISION.x", and "ADJINC.x". 

While I can work with the file without issue, I just have to spell out the variable name with the appended ".x", I am wondering if there is anyway to keep this from happening in the first place? 

Any help would be appreciated. 

Thank you. 

Parents
  • Hi - this is happening because the housing and person records have some variables in common, and `dplyr::inner_join()` will not overwrite values in one dataset with values in another dataset.

    Two simple ways to handle this are:

    1. You can include all the common variables in the join-by vector, even though they're not needed to match the cases:
      1. `pums2019 <- inner_join(pums2017pop, pums2017hsg, by = c("SERIALNO", "PUMA", "DIVISION", [others here]))`
    2. Or you can drop the common variables from one file before the join:
      1. `pums2019 <- inner_join(pums2017pop, select(pums2017hsg, -PUMA, -DIVISION, [others here]), by = "SERIALNO")`

    Good luck!

Reply
  • Hi - this is happening because the housing and person records have some variables in common, and `dplyr::inner_join()` will not overwrite values in one dataset with values in another dataset.

    Two simple ways to handle this are:

    1. You can include all the common variables in the join-by vector, even though they're not needed to match the cases:
      1. `pums2019 <- inner_join(pums2017pop, pums2017hsg, by = c("SERIALNO", "PUMA", "DIVISION", [others here]))`
    2. Or you can drop the common variables from one file before the join:
      1. `pums2019 <- inner_join(pums2017pop, select(pums2017hsg, -PUMA, -DIVISION, [others here]), by = "SERIALNO")`

    Good luck!

Children
No Data