Calculating a median

Hi All,

Occasionally I need a median income or median age value for either an aggregated area or for a population that only has the data available in a distribution table (like in attached screenshot). I created a VBA macro that will run in Excel and calculate the median based on a table range. In the example provided, it is for median household income. I've ran it for known/given median income values and the calculated numbers come up pretty close to those provided in the ACS. The smaller the "bins" (ranges), the better the calculated value. 

 

I have documented each step with comments in the VBA code. I'm not a VBA expert by any means, so there may be better ways to perform some of the steps, but this is works. this is something I've been needing myself for awhile and used to do it with a bunch of formulas in an Excel worksheet. I thought some of you may find it helpful so thought I would share it. I also have a Python version that reads in the data from a CSV file, so if you're interested in that let me know and I can provide it. 

Here is the VBA code:

Sub GetMedian()
    'have user select the range for the input data
    'the first column must be the number for the beginning of the range
    'for the bin
    'the second column (can have more than one if doing mulitple areas)
    'has the population for that bin
    Dim UserRange As Range
   
    Prompt = "Select the input range." & vbCrLf & _
         vbCrLf & "The first column must be the beginning of the " & _
         vbCrLf & "range for the bin. The second column has the  " & _
         vbCrLf & "population for the bin. You can have more than " & _
         vbCrLf & "one column of populations." & _
         vbCrLf
    Title = "Select a range"

    'Display the Input Box
    
    Set UserRange = Application.InputBox( _
        Prompt:=Prompt, _
        Title:=Title, _
        Default:=Selection.Address, _
        Type:=8) 'Range selection
    
    'get current selected range
    Dim myString As String
    'myString = Selection.Address
    myString = UserRange.Address
    'go though the columns
          
    'how many columns are in the UserRange?
    NoOfCol = Range(myString).Columns.count
    
    'get the output range from the user
    Dim OutRange As Range
                
    Prompt = "Select a cell for the output" & vbCrLf & _
    vbCrLf & "Values for multiple medians will be " & _
    vbCrLf & "returned in a row beginning with the " & _
    vbCrLf & "selected cell " & _
    vbCrLf
    '   Display the Input Box
    On Error Resume Next
    Set OutRange = Application.InputBox( _
        Prompt:=Prompt, _
        Title:=Title, _
        Default:=ActiveCell.Address, _
        Type:=8) 'Range selection
                
    'now get to work
   
    'for loop starts with the first population column (2nd in range)
    'end is how many pop columns are in the user input range
    
    For col = 2 To NoOfCol
        
        'set range as an array
        Dim myArray() As Variant
        myArray = Range(myString).Value
            
        'get sum of pop and divide by 2 to get halfpoint
        'initialize a total to aggregate values
        popSum = 0
        For i = 1 To UBound(myArray)
            popSum = popSum + myArray(i, col)
        Next
        'MsgBox popSum
        MedianIndex = popSum / 2
        'MsgBox MedianIndex
        'initialize for running total
        runtotal = 0
        
        'step through each row
        For i = 1 To UBound(myArray)
            
            'add the cumulative total pop
            runtotal = runtotal + myArray(i, col)
            'check if the runtotal exceeds the medianIndex
            If runtotal > MedianIndex Then
                'get the cumulative pop for the bin just before we exceeded
                'so subtract the current array value from the current runTotal
                prevTotal = runtotal - myArray(i, col)
                'determine how much into the bin it is to reach the medianIndex
                howMany = MedianIndex - prevTotal
                'get the pop value in the previous bin
                Binpop = myArray(i, col)
                'determine the pct of how far into this bin the medianIndex falls
                pctInto = howMany / Binpop
                
                'CORRECTED BLOCK
                'determine the span of the bin BinSpan
                'and multiply it by the pct into the bin
                BinSpan = (myArray(i + 1, 1) - myArray(i, 1))
                MultiSpan = BinSpan * pctInto
                
                'calculate the median by adding the result to
                'the number for the beginning of hte range
                Median = myArray(i, 1) + MultiSpan
                
                'put the median in the output cell
                OutRange.Offset(0, col - 2).Value = Median
               
                Exit For
            End If
        Next

    Next
           
    End Sub

  • I'm not a VBA person, but this is a valuable service. At some point in the future, we would like to support aggregation in CensusReporter.org so a python version would be great to see.
  • I tested this out with a few different data sets and seem to be getting calculated medians that are pretty different than those provided in the ACS. For example, when I used the distribution table for household income for the United States (2013-17 ACS 5-Year Estimates), the macro gave me a calculated median of $53,987, while the median household income provided in the ACS is $57,652 +/- $138.

    I ran the macro using the numbers provided for income in Arizona, as per the screenshot above, and was able to replicate the pictured results, so I'm pretty sure that the macro is working as it is intended to. Do you know why my calculations might be so off?

    Thank you,
    Bailey Werner

  • hmmmm.. thanks Bailey.. I think there may be an error in the script! I'm investigating it now. I thought I had tested this several times - so I'll post an update when I get to the bottom of it.
  • thanks for testing this, Bailey! I found the error and have corrected it in my original post! (see 'CORRECTED BLOCK note)

    Anyway, it was an error in calculating the span of the group. The span changes from $5,000 to $10,000 between the $45,000 to $49,999 (a 5,000 span) and $50,000 to $59,9999 (a 1,000 span). I had it taken the span of the group/bin before it. It was correct in my Python code, but wrong here.

    anyhow, I tested it with the U.S. data and it comes up much closer to the ACS provided median. The calculated median for the U.s. is $57,974 - only 322 difference. Still not within the MOE given, but much closer!
  • Hi Joe,  

    Below is the Python code I came up. It uses a CSV file as the input and assumes that the first row of the CSV are the field names so it skips that row.  Since I cannot attach a file here, I'll put the sample CSV file in a code block below the Python code. 

    Python Code:

    #-------------------------------------------------------------------------------
    # Filname:      calcmedian.py
    # Description:   This script reads data in "bins" (groups or
    #       ranges) and calculates a median. Sample data is
    #       population by 5-year age groups from the US Census
    #       American Community Survey. 
    #
    # Sample file AgeBins.csv: contains age group data for
    #       males in Maricopa County, Arizona from the
    #       2017 American Community Survey 5-year estimates
    #       See table B01002 for median age comparison/check
    #       
    #       First column contains the starting
    #       number for the range (i.e. 0-4 age group is 0;
    #       5-9 is 5; 10-14 is 10, etc)
    #       Second column is the population for that group       
    #
    # Author:       Jami Dennis
    # Created:      1/10/2019
    #-------------------------------------------------------------------------------
    
    from __future__ import division
    import csv
    
    # Function to read in the CSV containing bin data
    
    def read(filename):
        # initialize the directory
        agebins = {}
        with open(filename, 'U') as csvfile:
            reader = csv.reader(csvfile)
            next (csvfile) #skip first row headers
            # read in each row into the dictionary
            # make sure to set them as integers
            for row in reader:
                agebins[int(row[0])] = int(row[1])
                
                # print(row) # this just prints your inputs - used for testing
    
        return agebins
    
    # have the user enter the filename
    agebins = read(raw_input('Enter the filename: '))
    
    # intialize a running total to aggregate the values
    total = 0
    
    # determin the half way point simply by divding the total
    # population by 2 
    median_index = (sum(agebins.values()))/2
    
    keyList = sorted(agebins.keys())
    
    # initialize a dictionary for the cumulative values for each bin
    aggList = {0:0}
    
    # step through the sorted dictionary until the cumulative value
    # exceeds the mid point (median_index)
    for i,value in enumerate(keyList):
    
        # calculate the cumulative value
        total += agebins[value]
        
        # add cumulative value to aggList
        aggList[i]= total
    
        # check if the cumulative value has exceeded median_index
        if total > median_index:
            
            # get the cumulative pop for the bin just before
            # we exceeded the median_index using aggList[i-1]
            # and subtract that from the median_index to find out
            # how much much into the bin it is to reach the
            # median_index
            howMany = median_index - aggList[i-1]
            
            # get the population in the bin (ageBin) for
            # the keyList[i] bin (the one before we exceeded
            # the median_index
            ageBin= agebins[keyList[i]]
            
            # determine the percent of how far into this bin
            # the median_index falls
            pctInto = howMany / ageBin
            
            # determine the span of the bin using
            # keyList[i+1]-keyList[i]
            # and multiply it by the percent into the bin
            multiSpan = (keyList[i+1]-keyList[i]) * pctInto
    
            # cacluate the median by adding the result to
            # the number for the beginning of the range
            median = keyList[i] + multiSpan
            print 'the median is: ', "{:.1f}".format(median)
    
            # we found the median so break out of the for loop
            break
    
    # mission accomplished

    This is the sample CSV file. It is bins of Age data.

    BinStart,Amount
    0,142040
    5,146847
    10,148778
    15,144749
    20,146136
    25,155721
    30,147226
    35,135861
    40,137356
    45,133662
    50,130913
    55,118030
    60,103880
    65,91697
    70,66802
    75,47965
    80,30810
    85,26991
    

    Again, I'm no expert, so if anyone has ideas for improving these scripts, please feel free to share!