Proposed changes to the ACS Summary File format

The ACS Office at the Census Bureau is currently testing a new format for the ACS Summary File, which is a comma-delimited text file that contains all the Detailed Tables for the ACS.  

Information about the proposed updates to the ACS Summary File are described on the Census Bureau's website. 

We are starting this new Discussion Thread so that ACS data users can post any comments or questions about the proposes changes. ACS Summary File users are also encouraged to participate in the webinar scheduled for this afternoon on this topic.

Parents
  • I am conflicted with some of the proposed changes. We import the data into our SQL database and we normally import only 4 areas (US, TX, NM, AR), and the proposed changes would require us to process an extraordinarily large number of records that we do not use. If there are over 500K geo entries (approx. 280K Non Track/Block Group) that would mean we would process an estimated 300M records to retrieve about 60M records (*see calculation comment below). For those that use the entire set this is not an issue, but for those of us that use 4 areas or less it does have an impact. Having the state level files is a great service that you provide and I absolutely understand that painstaking process to generate all the files, but it seems to me that that process would not fall on each of the Data Users that do not use the entire set.

    What I do like is the addition of the column headers for the files and the single GEO file. As far as the GEO file it would be nice to have the Land/Water area and LSAD code columns added. I also noticed on the example files that most of the columns in the geo file are no longer formatted entries, for example summary levels are show as 10, 50, 150 and not 010, 050, 150, etc. same for all area identifiers, for example counties are show as 1,3,5 as opposed to 001,003,005.

    If decide to move forward with a single geo file, is there any reason why you would not have a LOGRECNO go across all states as opposed to being reset every state? This way the LOGRECNO could be the unique identifier to join geo files with data files as opposed to using the GEOID which is a 19 character variant alphanumeric value. For us having the LOGRECNO for the join is much more efficient way to join tables.

    For those that use databases the new file structure add another layer of complexity because some of the data files now contain more the 1100 column, and in SQL server the natural (non-sparse) column limit is 1024, not sure, but I believe Oracle has a limit of 1000 columns per table. Just putting that out there for those that do import the data to a database.

    As far as the column names, as others have mentioned, I would prefer:
    B01001_e001, B01001_m001, B01001_e002, B01001_m002 … or
    eB01001_001, mB01001_001, eB01001_002, mB01001_002 …

    *Record Calculation Estimate: Since each file varies in number of records I took the number of Non Track/Block Group areas as the most common set of rows and multiplied it with the 1,100 tables being produced.

Reply
  • I am conflicted with some of the proposed changes. We import the data into our SQL database and we normally import only 4 areas (US, TX, NM, AR), and the proposed changes would require us to process an extraordinarily large number of records that we do not use. If there are over 500K geo entries (approx. 280K Non Track/Block Group) that would mean we would process an estimated 300M records to retrieve about 60M records (*see calculation comment below). For those that use the entire set this is not an issue, but for those of us that use 4 areas or less it does have an impact. Having the state level files is a great service that you provide and I absolutely understand that painstaking process to generate all the files, but it seems to me that that process would not fall on each of the Data Users that do not use the entire set.

    What I do like is the addition of the column headers for the files and the single GEO file. As far as the GEO file it would be nice to have the Land/Water area and LSAD code columns added. I also noticed on the example files that most of the columns in the geo file are no longer formatted entries, for example summary levels are show as 10, 50, 150 and not 010, 050, 150, etc. same for all area identifiers, for example counties are show as 1,3,5 as opposed to 001,003,005.

    If decide to move forward with a single geo file, is there any reason why you would not have a LOGRECNO go across all states as opposed to being reset every state? This way the LOGRECNO could be the unique identifier to join geo files with data files as opposed to using the GEOID which is a 19 character variant alphanumeric value. For us having the LOGRECNO for the join is much more efficient way to join tables.

    For those that use databases the new file structure add another layer of complexity because some of the data files now contain more the 1100 column, and in SQL server the natural (non-sparse) column limit is 1024, not sure, but I believe Oracle has a limit of 1000 columns per table. Just putting that out there for those that do import the data to a database.

    As far as the column names, as others have mentioned, I would prefer:
    B01001_e001, B01001_m001, B01001_e002, B01001_m002 … or
    eB01001_001, mB01001_001, eB01001_002, mB01001_002 …

    *Record Calculation Estimate: Since each file varies in number of records I took the number of Non Track/Block Group areas as the most common set of rows and multiplied it with the 1,100 tables being produced.

Children