Hello,
I am using ACS PUMs data 2019 and 2020 to develop some dashboards and was writing a manuscript about it. I found that there are few records for individuals 16-17 years age and below high school level education with very high incomes and they do show up as the highest mean incomes (wage income and total income). They are greater than $400,000 annually and cannot be correct. Should i remove these records to do my analysis?
Thanks,
Anasua
I am working on a project using 2019 and 2020 one-year data to develop dashboards classified by occupation codes and grouped into two groups, healthcare workers and non-healthcare workers. There are three dashboards being developed, (i) the first dashboard uses data on selected demographic characteristics, (ii) the second dashboard uses data on mean wage income and mean total income, (iii) the third dashboard uses data on income and demographic characteristics for health workers. As a part of this project I am also writing a manuscript to explain these dashboards which will also incorporate some tables explaining the dashboards. While working on the tables I came across few highest mean incomes for 2019 and 2020 that are earned by those in the age group of 16-17 years old and few who are older but have below high school level education. Now if we keep these records they would show up as the highest ones which cannot be correct and if we remove them then we have to remove more observations based on the condition we use to remove observations, for example, those with below high school education or 16-17 years old earning >$400,000. My question is how to address this issue? I am adding some of these observations here in a table below, but there are more.
May I ask source of this PUMS? These are the row level data that you've recoded, correct? You have not aggregated these, right?
I had downloaded the psum_pusa and psum_pusab from the website. And yes, these are records recoded, not aggregate data.
I'm still not 100% sure what your goal is here but if you are using the PUMS as input data then I would leave it alone and not throw out these outliers, which is really what these are. I would not characterize these as "incorrect" as implausible as they seem.
On the other hand, if you feel you have to discard them their weight would be so small in your aggregation that they probably wouldn't have a huge effect as you're using national files.