Hello all,I'm trying to put together some state-level estimates of household eligibility for the American Connectivity Program. The eligibility requirements (sheet 2)are essentially the same as for Lifeline, but the income to poverty ration is 200%, rather than 135. I'm running this analysis for all states, but am just focusing on Alabama in this example. The estimate of eligible households produced (AL.pums) is much lower than expected compared to 2021 eligibility estimates in this dataset from USAC (sheet 1) (USAC manages Lifeline). I realize this is a really specific issue, but since this is my first time working with the PUMS data I have a feeling I may have made an error along the way. Thanks in advance for any help with this!
library(tidyverse)library(tidycensus)census_api_key(Sys.getenv("CENSUS_API_KEY"))pums_vars <- c("HINS4", "FS", "PAP", "SSIP", "POVPIP")all_pums <- get_pums(variables = pums_vars, state = "AL", year = 2020, survey = "acs5")AL.pums <- all_pums %>% #filter records that meet eligibility requirements subset(HINS4 == 1 | FS == 1 | PAP == 1:30000 | SSIP == 1:30000 | POVPIP == 0:200) %>% subset(SPORDER == 1) %>% #retain just one record per householdsummarize(hh_eligible = sum(WGTP)) #calculate total number of households eligible
Dear Christine,
MAKING PROGRESS
I was able to reproduce the AL row in the Lifeline spreadsheet. The value is 626087 I used the 2021 1-year PUMS. I added SPORDER==1 to the code posted earlier This code here computes with only SPORDER==1 person records and with all person records If you can tell me now to paste in the code like you did so it looks OK then I'll post it like you did. If you send an email to info@dorerfoundation.org I'll email the code as an attachment.
I don't use the %>% format in my code but I believe if you add | SPORDER == 1 to the first subset line and eliminate the second subset line you will get the right thing. I'm a long time FORTRAN programmer so I still use loops with an integer index. The other issue is that you used 2020 5 year data and the spreadsheet uses 2021 1-year PUMS data. You might also ask why USAC Lifeline people didn't "|" across all members of the household. The application seems to say that if someone in the household is eligible then the entire household is eligible. So if anyone receives Medicaid then the household is eligible. The ACS has the concept of a single Householder (person 1 on the form). If there is a husband and wife for example and the wife fills out the form as "person 1" and the husband is on Medicaid then the ACS based calculation with SPORDER==1 is wrong and will count the household as NOT eligible. This makes the spreadsheet computed participation rate too high. They should be using the denominator of 774694 (for AL 2021 - 1 year data) which looks at all persons in the household.
Another thing is that the csv and API downloaded data have a different rule for missing values. The csv files have a ",," blank field for missing. The API cannot handle a "blank" data field so the API puts in a value. The rule that a Census person posted on the forum is lowest possible value -1 for missing with the caveat of "usually," This detail does not appear in the PUMS pdf codebook. Your code with POVPIP==0:200 should take care of this in the API case but I'm not sure. Also remember to use a 135 limit for POVPIP to match the Lifeline spreadsheet.
Reference - Lifeline Federal Regulations giving qualification for Lifeline: (uses 135% of poverty GUIDELINE)
https://www.law.cornell.edu/cfr/text/47/54.409
"The consumer, one or more of the consumer's dependents, or the consumer's household must receive benefits from one of the following federal assistance programs: Medicaid; Supplemental Nutrition Assistance Program; Supplemental Security Income; Federal Public Housing Assistance; or Veterans and Survivors Pension Benefit."
ACP (Affordable Connectivity Program (uses 200% of Poverty GUIDELINE (Not Poverty Threshold as does the ACS)
https://www.law.cornell.edu/cfr/text/47/54.1805
Both Lifeline and ACP use "Family"
Household and income are defined here:
https://www.law.cornell.edu/cfr/text/47/54.400#f
## USAC lifeline_pums_analysis_021823.R ## R function to compute number of eligible Lifeline Households from ACS PUMS data## www.usac.org/.../LI_Application_NVstates.pdf## By David J Dorer, Ph.D.## copyright 2023 by Dorer Community Service Foundation Inc of Massachusetts## info@dorerfoundation.org## v1.0 djd 16 Feb 2023 12:40 # v1.1 djd 18 Feb 2023 13:38 added group quarters eligible persons calculation# v1.2 djd 19 Feb 2023 18:43 added calculation using only person records with SPORDER==1### z1<-usac(vintage=2021,period=1)# z2<-usac(vintage=2020,period=1);# z3<-usac(vintage=2020,period=5);# z4<-usac(vintage=2021,period=5);# z5<-usac(vintage=2021,period=1,state="MA",fips="25");
usac<-function(vintage=2021,state="AL",poverty.level=135,fips="01",period=1) {## ARGUMENTS:## vintage: PUMS vintage# period: PUMS period# state: state abbreviation# fips: state FIPS Code# poverty.level: poverty threshold to use in calculation## VALUE:# # list with## households: number of eligible households in state
# householders: number of eligible householders (SPORDER==1) records only# vintage: argument# period: argument# url.person: Census FTP site url for PUMS Person records# url.house: Census FTP site url for PUMS House records# state: argument# fips argument# data: Derived data.frame used to make calculation.#
state<-tolower(state);
error<-0; note<-paste0(vintage," ",period,"- PUMS"); msg<-paste("State=",state,"fips=",fips,"PUMS vintage=",vintage,"period=",period);
note<-paste("vintage: ",vintage,"period: ",period," year PUMS"); urlh<-paste0(""/",period,"-Year/csv_h",state,".zip");">www2.census.gov/.../csv_h",state,".zip"); urlp<-paste0(""/",period,"-Year/csv_p",state,".zip");">www2.census.gov/.../csv_p",state,".zip");
if((vintage==2020)&(period==1)) { # because of covid 2020 1 year PUMS was released with experimental weights urlh<-paste0("".zip");">www2.census.gov/.../csv_h",state,".zip"); urlp<-paste0("".zip");">www2.census.gov/.../csv_p",state,".zip"); }; # if
# note this code downloads zip file to your "working directory" use getwd() to see the directory/folder vh<-try(download.file(urlh,"pumsh.zip")); # housing records from FTP site vp<-try(download.file(urlp,"pumsp.zip")); # person records from FTP site
if(class(vh)[1]=="try-error") { msg<-c(msg,paste0("ERROR downloading housing file url=",urlh,"vintage=",vintage,"period=",period)); error<-error+1; };
if(class(vp)[1]=="try-error") { msg<-c(msg,paste0("ERROR downloading person file url=",urlp)); error<-error+1; }; if(error>0) {cat("usac: ERRORS\n");return(msg); };
hfile<-paste0("psam_h",fips,".csv"); pfile<-paste0("psam_p",fips,".csv");
house<-try(read.csv(unz("pumsh.zip",hfile))); # load housing records directly from zip file person<-try(read.csv(unz("pumsp.zip",pfile))); # load person records directly from zip file
if(class(house)[1]=="try-error") { msg<-c(msg,paste0("ERROR reading zip archive: pumsh.zip file=",hfile)) error<-error+1; };
if(class(person)[1]=="try-error") { msg<-c(msg,paste0("ERROR reading zip archive: pumsp.zip file=",pfile)) error<-error+1; };
if(error>0) {cat("usac: ERRORS\n");return(msg); };
# trim down house and person to the variables of interest hvar0<-c("SERIALNO","TYPEHUGQ","WGTP","FS"); hvar<-c("TYPEHUGQ","WGTP","FS"); pvar<-c("SERIALNO","SPORDER","HINS4","PAP","SSIP","POVPIP","PWGTP"); house<-house[,hvar0]; cat("\nsummary house data.frame=\n");print(summary(house)); person<-person[,pvar];
# merge person and housing records m<-match(as.character(person$SERIALNO),as.character(house$SERIALNO));
# data.frame like the on returned by tidycensus function get_pums() ??# note tidycensus get_pums() appears to use the API. # The API cannot handle blank fields, missing is lowest value - 1 (typically)
dat<-data.frame(person,house[m,c("TYPEHUGQ","WGTP","FS")]); cat("\nWGTP==0 by TYPEHUGQ\n");print(table(wgtp0=addNA(house$WGTP==0),typehq=addNA(as.factor(house$TYPEHUGQ))));
cat("\nusac: summary dat merged person/house data.frame:\n");print(summary(dat)); cat("\nusac: summary dat merged person/house data.frame with house weight (WGTP)==0:\n");print(summary(dat[dat$WGTP==0,]));
# create logical variables in the merged "dat" data.frame pov<-dat$POVPIP;pov[pov<0]<-NA; pov[pov<=poverty.level]<-1; # <= poverty.level argument poverty level clause pov3<-rep(NA,length(pov)); pov3[pov<0]<-3;pov3[is.na(pov)]<-3; flag<-(pov>=0)|(pov<=poverty.level);flag[is.na(flag)]<-FALSE;pov3[flag]<-1; flag<-pov>poverty.level;flag[is.na(flag)]<-FALSE;pov3[flag]<-2; pov3f<-factor(pov3,levels=c(1,2,3),c("below","above","undefined")); pov[pov>poverty.level]<-0; sporder<-dat$SPORDER;
ssip<-dat$SSIP; ssip[is.na(ssip)]<-0; # Receive Supplemental Social Security Income SSI >= 1 otherwise 0
pap<-dat$PAP; pap[is.na(pap)]<-0; # Public Assistance Income missing value set to zero
# HINS4 Health Insurance Medicade etc ==1 TRUE other wise no missing hins<-dat$HINS4; hins[is.na(hins)]<-0; # result 1==receives Medicaid etc. 0==missing 2==no # FS Receives Food Stamps/SNAP == 1 (yes) == 2 (no) no missing values cat("summary of PAP>0\n"); print(summary(dat$PAP[dat$PAP>0])); cat("summary of pap>0\n"); print(summary(pap[pap>0])); cat("summary PAP==0/pap==0\n");print(table(PAP0=addNA(dat$PAP==0),pap0=addNA(pap==0))); cat("summary of SSIP>0\n"); print(summary(dat$SSIP[dat$SSIP>0])); cat("summary of ssip>0\n"); print(summary(ssip[ssip>0])); cat("summary SSIP==0/ssip==0\n");print(table(SSIP00=addNA(dat$SSIP==0),ssip0=addNA(ssip==0)));
# some defensive coding fs<-dat$FS==1; fs[is.na(fs)]<-0; fs[fs==2]<-0; # recode 1=receives SNAP yes, 0 missing or no cat("summary of FS/fs\n"); print(table(addNA(as.factor(fs)),addNA(as.factor(dat$FS))));
life<-(fs==1)|(pov3f=="below")|(pap>0)|(ssip>0)|(hins==1); # logic at household person record level
# no missing values for all derived variables including "life" dfs<-data.frame(dat,pov3f,pap,hins,fs,ssip,lifeline=life); # data.frame to make sure everything aligns and check logic
cat("\nusac: Summary of derived per person data frame (dat with derived variables)\n");print(summary(dfs));
cat("\nusac: Summary of derived per person (for households) data frame with POVPIP missing.:\n"); print(summary(dfs[is.na(dfs$POVPIP),])); # check logic on pov variable POVPIP missing pov set to 0 (no)
serialH<-sort(unique(as.character(dat$SERIALNO))); # unique household SERIALNO in dat data.frame
mhp<-match(house$SERIALNO,serialH); # look for records in house data.frame that don't match records in dat data.frame multi.person<-house[is.na(mhp),]; cat("\nusac: PUMS Housing Records that match multiple SERIALNO id's from Person Records\n"); cat(" should be all Household records)\n");# all records have TYPEHUGQ set to 1 i.e. household records. print(summary(multi.person));
# now apply | (or) logic across values of life for all persons in a household## application form that lifeline applicants fill out# www.usac.org/.../LI_Application_NVstates.pdf# This application seems to indicate that the Household is elgible in anyone individual personin the househould# is eligible not just the ACS SPORDER==1 (Head of Household -- ACS Questionnaire Person 1)# so the calculation should OR across all person records in the Household# N<-length(serialH); # loop over serialH (unique dat$SERIALNO values) lifeline1<-lifeline<-rep(0,N); pweight<-rep(NA,N);
for(i in seq(1,N)) { if(!(i%%2000)) cat("i=",i,"\n"); housei<-serialH[i]==dat$SERIALNO; lifeline[i]<-any(life[housei]); lifeline1[i]<-any(life[housei&(sporder==1)]); pweight[i]<-sum(person$PWGTP[serialH[i]==person$SERIALNO]); # sum person weights }; # for i weight<-dat$WGTP[match(serialH,dat$SERIALNO)]; weight2<-house$WGTP[match(serialH,as.character(house$SERIALNO))];
dataH<-data.frame(SERIALNO=serialH,weight,pweight,lifeline,lifeline1,house[match(serialH,house$SERIALNO),]); cat("\nusac: compare weights: weight-weight2\n");print(summary(weight-weight2)); cat("\nsummary dataH: \n");print(summary(dataH)); cat("\nsummary 0 weights \n");print(summary(dataH[dataH$weight==0,]));
# number of qualified households cat("\nsummary dataH: for weight==0\n");print(summary(dataH[dataH$weight==0,])); qhouse<-sum(dataH$weight[dataH$lifeline==1],na.rm=TRUE); # any person in Household qualifed make Household qualified qhouse1<-sum(dataH$weight[dataH$lifeline1==1],na.rm=TRUE); # qualified Household based on person SPORDER==1 qperson<-sum(dataH$pweight[dataH$TYPEHUGQ!=1]); # qualified Group Quarters persons (sum person weights) qhouse2<-sum(dataH$weight[(dataH$lifeline==1)&(dataH$TYPEHUGQ==1)],na.rm=TRUE);
cat("\nLifeline/TYPEHUGQ=\n");print(table(H.type=addNA(as.factor(dataH$TYPEHUGQ)),lifeline=addNA(as.factor(dataH$lifeline)))); cat("summary(dataH: for Group Quarters=\n");print(summary(dataH[dataH$TYPEHUGQ!=1,])); cat("\nsummary qualified households\n");print(summary(dataH[dataH$lifeline==1,]));
cat("\nusac: Results: ",date(),note,"\n"); cat("usac: Number of households/group.quarters qualified for Lifeline Households=",qhouse,"\n"); cat("usac: Number of households qualified for Lifeline Households=",qhouse2,"\n"); cat("usac: Number of households qualified for Lifeline based on Householder=",qhouse1,"\n"); cat("usac: Number of Group.Quarters persons qualified for Lifeline Households=",qperson,"\n");
rtn<-list(households=qhouse2,householders=qhouse1,personsGQ=qperson,includeGQ=qhouse ,poverty.level=poverty.level,state=toupper(state),fips=fips,note=note,error=error ,url.house=urlh,url.person=urlp,vintage=vintage,period=period,data.person=dfs,data.house=dataH); invisible(rtn);}; # usac
## END ##
Hi David,
I really appreciate all your effort on this! After trying out several versions of the same datasets (via tidycensus API, downloaded from FTP site, and downloaded from iPUMS), I finally got this figured out. I also realized that the SPORDER == 1 would limit the count to householder information only. I also noticed in your code the TYPEHUGQ field. Once I included that field AND properly dealt with NA values that arose when I joined the person & household records...VOILA! I did keep the POVPIP range at 0:200, because that is the range specified for the Affordable Connectivity Plan (very to lifeline but focused on Internet access, and eligibility-wise POVPIP is the only difference that I'm aware of).
I keep the code for this on github here if you're interested: https://github.com/ILSR-GIS-DATA/Affordable-Connectivity-Program-Analysis.
For future reference, if you want to add in code (see dropdown options in pic below). Thanks again for brainstorming with me!
I just fixed your tidyverse code Here is the corrected code:
lifeline.eligible<-function(state="AL",vintage=2021,period=1) { library(tidyverse) library(tidycensus)
census_api_key(Sys.getenv("CENSUS_API_KEY")) pums.vars <- c("HINS4", "FS", "PAP", "SSIP", "POVPIP")
## note all.pums data.frame has no missing (NA) values.# -1 is used as a missing value for PAP SSIP and POVPIP# all.pums <- get_pums(variables = pums.vars, state = state, year = vintage, survey = paste0("acs",period));
## the following logic drops records with missing (-1) values# selects SPORDER==1 records# note this logic is incorrect but matches that used for the Lifeline eligiblity spreadsheet
AL.pums <- all.pums %>% #filter records that meet eligibility requirements with SPORDER==1 filter((SPORDER==1)&(HINS4 == 1 | FS == 1 | between(PAP,1,30000) | between(SSIP,1,30000)| between(POVPIP,0,135))) %>% summarize(hh.eligible = sum(WGTP)) #calculate total number of households eligible AL.pums;};
ADDED 3:05 PM EST
Working on this has gotten me interested in the ACP and Lifeline. I have a friend in DC who worked with the OMB and now works with the CBO. Apparently he has worked on the program and is familiar with it -- presumably on the Congressional budget side. In any case working with 501(c)(3) organizations is part of the mission/purpose of the Dorer Community Service Foundation. So if you need anymore help we can create a "Project" description and formalize a relationship between DCSF and the ILSR for pro bono consulting services. Email info@dorerfoundation.org if you want to proceed along this line.
The main mistake with you original code was using "subset" in instead of "filter" and using the sequence operator 1:30000. You need a logical AND with >=. (between) I've learned a lot about tidyverse and tidycensus. I tend to like to use the R "base" package as much as possible. You might also connect with the people at USAC and point out that their spreadsheet over estimates the participation rate by quite a bit.
Note get_pums uses the API to download the data so what are blank (missing) fields in the FTP files have a numeric value, in this case -1, as a placeholder for NA
Best,
Dave
UPDATE email is info@dorerfoundation.org Left off the .org in post earlier