I am working on refining and publicizing the Yost Index, which is a composite index incorporating information from 7 ACS tables (poverty, education, home value, income, employment, mortgage, and rent). It has been around since the early 2000s and does a good job of reducing the complexity of SES information in health studies where SES is an important confounder but not the primary focus of inquiry. It has been calculated for the nation and by state, at the block group and census tract level, and for a variety of years.
I have recently been asked me to compute the margin of error for the index. That was not the request exactly - it was more in the form of a challenge: "aren't the MOEs for your variables so massive that what you are doing is not workable?" I disagree - for many of these variables the MOEs are typically around 10-20% of the estimate. One way to respond is to compute the MOEs and let the users decide if they are massive or not. However, I am not sure the best way to do this across 7 tables. It seems that the errors would not be independent - for a given census tract, if income is underestimated, then poverty would be likely to be overestimated.
This ACS Handbooks provide a set of formulas you can use to calculate MOE. (See Chapter 8 https://www.census.gov/content/dam/Census/library/publications/2020/acs/acs_general_handbook_2020_ch08.pdf)
The…
The only challenge is (as you described above) the formulas do not take into account whether or not the error in each component covaries, so these formulas likely overstate the error if the errors are correlated to begin with. To the best of my knowledge, no one has solved that "problem" yet (though if someone has, hopefully they'll post here!)
(To answer that question you could use PUMS microdata to produce the index and error estimates for a larger areas, compare that with the aggregating method of calculating MOE from tract data, and analyze the magnitude of over (or under) estimates of MOE.)
It wouldn’t work for all summary levels, but Census provides select tables in their variance replicate tables (https://www.census.gov/programs-surveys/acs/data/variance-tables.html). In each table you’ll find 80 columns: each column representing the estimate using the replicate weight. From these tables, you could calculate 80 indices (assuming all the components are available) and use the successive differences formula to estimate the standard error.
Thanks, I'll start with this (assuming all components are available)
Thanks for your suggestion - it seems like that would make for a good paper on its own.
Unfortunately, only 3 of the 7 Yost index components are available.
That's a shame. Out of curiosity, which measures were not available?
I ask because while it may not be possible to generate the "true" Yost index, a close approximation may be able to be done:
For instance, if median income is not directly available, you could substitute mean income or calculate the median from the distribution table (ACS medians are linear interpolations anyways, not true medians).
And the SE on the approximation may be closer to the true SE than other methods.
I have the education, occupation, and employment variables. I am missing median home value, median rent, and median income - but as you say, these could be developed. I have the wrong poverty value, but again it could be approximated.
However, since I can calculate the covariance structure for these 7 variables myself (they all range from moderate to strong), couldn't I just simulate replications myself that would have the correct mean and standard deviation and covariance structure? (In fact, I went ahead and did this, and the results look ok).
That works too.
I know it's been four months, but I'm just getting back to this topic in the past week. I successfully used your PUMS suggestion, and the results suggest that the Yost index of SES is fairly stable for large areas, as I expected. Calculating MOEs for smaller areas using the census method is proving to be much more of a challenge. Some of the derived variables I am generating are built from dozens of ACS variables, some of which have few observations and high MOEs that seem to drive the end result, plus they are not independent (e.g., high counts for one higher education variable correlates with high counts for other higher education variables). I am ending up with uninformatively high MOEs for my derived variables (that is, MOEs that span beyond all possible values in both directions). I'm open to any suggestions for a next step!
How are you combining the several components to build the index? Is it equally weighted?
I ask because if each component is equally weighted (the simple mean of scaled component scores), then you could be putting too much weight on the less precise variables.
So, if your index = 0.25 * A + 0.25 * B + 0.25 * C + 0.25 * D but let's say D is less precise because its based on fewer observations, you could instead weight the components differently, to put greater weight on the more precisely measured components: index = 0.3 * A + 0.3 * B + 0.3 * C + 0.1 * D.
There are various methods for choosing the weights, but one method is to take the inverse of the variance of each component and divide it by the sum of the weights:
Prelim: pwA = 1/Var(A), pwB = 1/Var(B), ...
Final: fwA = pwA / sum(pwA--pwD)
This method will put more weight on index components with small variance and less weight on index components with large variances.
Well, let's take the education element in the index, since that's the most complicated. You need "number attending college" as an intermediate variable, so this is the sum of those with some college, those with associate's degree, those with bachelor's degree, those with graduate degree, etc. For many tracts many of these individual components are 0 plus or minus 12.
Since I am working with counts, they are implicitly weighted.
By the time the end result is computed, there are three chains of sums like this (all with the associated MOE calculations), each of which is then converted to a ratio (more MOE calculations), then these are summed (more MOE calculations), leaving me with every tract having an average of something like 13 plus or minus 6 years of education, even though this variable is designed to have a fixed range of 9 to 16.
Hi Frank. Which MOE calculation are you using - the formulas vs the variance replicate tables (with your additions)? I'm curious to know if you have the high MOEs using variance replicates.
Adding zero estimates are a special case and need to be handled differently: It may seem counterintuitive, but if group A has a count of 100 +/-10 and group B is 0 +/- 12, then the combined A+B is still 100 +/- 10. (Maybe someone from Census could confirm this?).
Basically, if A>0 and B=0 then MOE(A+B) = MOE(A)
If A=0 and B=0 then MOE(A+B) = MOE(A) = MOE(B) (These should be the same)
That is interesting. Maybe it would be useful to go through a worked example of one of these to show what is happening. I will post this as a fresh thread when I do.
I am talking about the formulas. I am wary of the variance replicate tables because of the high covariance between all my variables. If a replicate is high for income, it should be low (or very likely to be low) for poverty, etc.)