How to translate STATA code to R (creating flags and new variables)

Hello,I have the following STATA code that I am needing to translate over to R (this code uses the 2019 5-year ACS PUMS file):

/*Create a flag to identify workers in the household*/
gen worker=(wkl=="1")
replace worker=. if wkl==""

/*Generate a variable representing number of workers in the household*/
by serialno, sort: egen wihh=sum(worker)

Where I am coming up blank is in the following places:

1. How do I tell R to create a flag? I presume this is different than creating a new variable? and

2. Generating a variable using SERIALNO and sorting by a new variable (WIHH) using a sum function referring to the number of worker flag previously created. 

For example, I can create a new variable using 2019PUMS$worker <- recode(2019PUMS$WKL, recodes = "1=1, else = NA", as.factor = T), but I am only creating a new variable here and I presume this is not the same as creating a flag. 

Any and all help on this would be appreciated.  

Thank you. 

Parents
  • Sounds like you might be pretty new to R -- if so, I highly recommend this free, online book: https://r4ds.had.co.nz/.

    For your specific questions, I think the easiest would be to create a new variable that is TRUE if the person is a worker and FALSE if the person is not a worker. Then for each household (SERIALNO), you can add up the rows that are TRUE. Conveniently, TRUE is 1 and FALSE is 0, so if you add up 2 TRUEs and 1 FALSE you will get 2. Here is some code that downloads PUMS data and creates those two new variables.


    library(tidyverse)
    library(tidycensus)

    vt_pums <- get_pums(
      state = "VT",
      variables = c("WKL")
      )
    #> Getting data from the 2015-2019 5-year ACS Public Use Microdata Sample
    vt_pums %>%
      mutate(worker = WKL == 1) %>%
      group_by(SERIALNO) %>%
      mutate(whih = sum(worker))
    #> # A tibble: 32,056 x 8
    #> # Groups:   SERIALNO [15,467]
    #>    SERIALNO      SPORDER  WGTP PWGTP ST    WKL   worker  whih
    #>    <chr>           <dbl> <dbl> <dbl> <chr> <chr> <lgl>  <int>
    #>  1 2015000002940       1     9     8 50    1     TRUE       2
    #>  2 2015000002940       2     9     6 50    1     TRUE       2
    #>  3 2015000004368       1    16    16 50    3     FALSE      1
    #>  4 2015000004368       2    16    16 50    1     TRUE       1
    #>  5 2015000004368       3    16    14 50    b     FALSE      1
    #>  6 2015000004368       4    16    13 50    b     FALSE      1
    #>  7 2015000005987       1     4     4 50    1     TRUE       1
    #>  8 2015000006748       1     9     9 50    1     TRUE       2
    #>  9 2015000006748       2     9    10 50    1     TRUE       2
    #> 10 2015000007235       1    25    25 50    3     FALSE      0
    #> # ... with 32,046 more rows

  • small typoe -- that's wihh (workers in HH), not whih. Other than that, what Matt posted does the job.

    There are multiple ways to get the thing done in R (and FWIW there were multiple ways to do that in Stata). And there are multiple cracks in both the original code and in 's code that tiny things where fidelity of the code and the data may cause minor issues.

    1. The concept of a "flag" is that it is a 0/1 variable indicating a certain logical condition. Stata does not have the explicit logical type, but it internally converts true statements (2==2) to the value of 1, and false statements (0==1) to 0. Conversely, 0 values are understood as false and nonzero values, as true. So if you want to create a variable that identifies a subset, the above code does exactly that. As Matt said, in R, you would want to use the explicitly defined logical format for that.

    2. Matt's code does not reproduce creating missing values for the situation when WKL is missing, but it is likely that this is being picked up in the WKL==1 logical condition.

    3. In fact, WKL appears to be a string variable in Stata, but Matt treats it as a numeric variable, so that PUMS data set may have been read differently in Stata and R.

    4. I would have to assume that get_pums() picks up all the technical variables by default, including SERIALNO, the household ID. This variable was not specified in the list to read variables=c("WKL"), but appears nevertheless in the data set, along with the weights and such.

    5. This is a style issue -- I would have ungroup() in the end of the pipe, getting your grouping variables in the summary data sets sometimes becomes annoying.

    6. In R, you could have skipped the whole step of creating a separate worker variable and just go with
    vt_pums %>% group_by(SERIALNO) %>% mutate(wihh = sum(WKL==1))... and likewise in Stata, you could simply
    egen wihh = total( WKL=="1"), by(SERIALNO). Actually, the code that is there is quite weird and not quite Stata-style; the person who wrote it was confusing the sum() function of creating the running sum, and the total() sub-function of egen command that creates the groupwise totals. So that would raise a brow with me, frankly.

    It looks like prefers the base R syntax; the group operations in the base R must be weird, as I have to admit I never bothered going back to the base R for data management. Also there are at least two recode functions, car::recode() and dplyr::recode(), it looks like the former was used.

  • Thanks for all the details ! My code was admittedly a bit of a quick gloss Slight smile

  • well yeah it does the job but then you find that 500 people are using it as the golden standard written in stone. Production code would check the variable types and ranges and missing values and testthat everything ;).

    Now that I looked at it again: WKL is indeed character out of get_pums(): it is labeled as such in the tibble output, and there are offending non-numeric values like "b". But its type is getting coerced with comparison to 1, which is really below the belt, and would be a source of errors that are very difficult to debug. Stata would have refused to compare an integer to a character... SAS would probably have swallowed it.

  • Thank you both , this was easy to understand. If I may ask an additional question. When running the following code(s):

    pumadata %>% mutate(worker = WKL == 1) %>%
    group_by(SERIALNO) %>%
    mutate(wihh = sum(worker))

    OR

    pumadata %>% group_by(SERIALNO) %>%
    mutate(wihh = sum(WKL == 1))

    I get the following output:

    # A tibble: 10,154 x 25
    # Groups: SERIALNO [4,746]
    SERIALNO AGEP PWGTP RELP SCH SCHG ST PUMA WAGP WKL wihh
    <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
    1 2.01e12 37 17 0 1 NA 48 3602 26000 1 NA
    2 2.01e12 18 11 2 2 14 48 3602 1500 1 NA
    3 2.01e12 14 28 2 2 11 48 3602 NA NA NA
    4 2.01e12 5 19 2 1 NA 48 3602 NA NA NA
    5 2.01e12 45 1 0 1 NA 48 3602 54000 1 1
    6 2.01e12 54 2 1 1 NA 48 3602 0 3 1
    7 2.01e12 47 18 0 1 NA 48 3602 0 3 0
    8 2.01e12 39 4 16 1 NA 48 3602 0 3 0
    9 2.01e12 69 14 0 1 NA 48 3602 0 1 1
    10 2.01e12 68 13 1 1 NA 48 3602 0 3 1

    So far, so good. However, when I look at the dataframe (pumadata then CTRL+ENTER), I see that the wihh variable is missing. When I've recoded variables in the past, using "recode" from base R, it will add the new variable as a column into the dataframe. Am I doing something wrong?

  • We can't tell anything from your output. At the very least, it needs to show the SERIALNO with all the digits so that we could see who is in the same HH. The sum() function would probably benefit from na.rm=TRUE.

Reply Children
  • Thank you for your reply. I went ahead and added the na.rm=TRUE to the code and added -> dataframe to the end and the wihh variable is now appended to the datarame:

    pumadata %>% group_by(SERIALNO) %>%
    mutate(wihh = sum(WKL == 1, na.rm=TRUE)) -> pumadata

    I exported it to csv and here are the results:

    SERIALNO AGEP PWGTP RELP SCH SCHG ST WKL hhsize wihh
    1 2013000000018 37 17 0 1 NA 48 1 4 2
    2 2013000000018 18 11 2 2 14 48 1 4 2
    3 2013000000018 14 28 2 2 11 48 NA 4 2
    4 2013000000018 5 19 2 1 NA 48 NA 4 2