If a SUMMARY data action is obtained across the levels of another variable/field, then the summary process includes a GROUP BY action as well. The GROUP BY action simply implies that a summary will be done across the levels of one or more fields.
Example 2.3C.1 For this example, we will again consider the Small Area Health Insurance Estimates (SAHIE) data that was used in Module 2.3A. A subset of records from the original data table is considered in this module.
Source: https://www.census.gov/data/datasets/time-series/demo/sahie/estimates-acs.html
Google Folder: Link to Data
List of Fields
Goal: Obtain the overall percent of uninsured people by race.
Race | % Uninsured |
1 (White) | 7.52% |
2 (Black) | 11.38% |
3 (Hispanic) | 19.12% |
Data Processing Steps
There are two basic methods of obtaining a summary measure using a spreadsheet – using a formula or using a Pivot Table. The Pivot Table approach is much easier than the formula based approach and thus will be discussed first.
Pivot Table Approach
GROUP BY with Pivot Tables |
Source: https://youtu.be/hmq-v3FURLI |
The following is the Pivot Table summary for the % Uninsured by race (1:White, 2:Black, and 3:Hispanic)
The Pivot Table above was constructed using the following setup. The Filters box is used to exclude the racecat = 0 from consideration in the summary.
Setup of Pivot Table | |
Rows | Columns |
![]() |
![]() |
Values | Filters |
![]() |
![]() |
Comments:
FORMULA Approach
The GROUP BY action for a summary can be done using a formula-based approach as well in a spreadsheet. There is a collection of SUMIF(), AVGIF(), COUNTIF(), etc. that can be used when a single condition for the summary is needed. In addition, there is a second collection of SUMIFS(), AVGIFS(), COUNTIFS(), etc. that can be used when there are one or more conditions for the summary. The SUMIFS() function is used here to obtain the percent uninsured across race.
Summary via Formulas Google Sheets |
Source: https://youtu.be/boRMpWIO1Fk |
The use of named ranges in suggested when using formulas. The following is a list of the named ranges for this example.
The following snip-it shows the formulas used to obtain the % Uninsured for each race category.
The desired output is shown here.
The following snip-it from the spreadsheet shows one how to add additional conditions to the =SUMIFS() function. Here, the desired summaries are to be obtained by race and across medicare expansion (i.e. has the state adopted medicare expansion?).
The summary table showing the % Uninsured across race and medicare expansion.
The =QUERY() function that was used for a SUMMARY in Module 2.3B can be adapted to include both a GROUP BY and/or PIVOT. Unfortunately, there is not an analogs feature in Excel - Power Query as it is not setup to easily compute summaries on a data table; however, once the data table is loaded into an Excel sheet, a Pivot Table can be used to obtain the desired summaries.
GROUP BY - QUERY |
Source: https://youtu.be/_hG_0a7Nx1k |
The =QUERY() Statement
The following query statement can be used to obtain the percent uninsured by race. The named ranged for the data table in Google Sheets is SAHIE. The desired summary table is displayed – the row/column headers on the summary table were edited to be more informative.
Note: The F != 0 is used to FILTER exclude racecat = 0 (all races) from being included in this summary table.
=QUERY( SAHIE, " SELECT F, SUM(K) / SUM(J) WHERE F != 0 GROUP BY F")
The =QUERY() function can easily be modified to obtain the percent uninsured across race and across medicare expansion. The desired summary table is displayed – again the row/column headers were edited to be more informative.
=QUERY( SAHIE, " SELECT D, F, SUM(K) / SUM(J) WHERE F != 0 GROUP BY D, F")
The =QUERY() function can be modified to include a PIVOT action. Here, Medicare_Expansion is being used to pivot; thus, the summaries will be displayed as columns in the output table (not as rows). The desired summary table is displayed here – the row/column headers were edited to be more informative.
=QUERY(SAHIE, “SELECT F, SUM(J) / SUM(I) WHERE F != 0 GROUP BY F PIVOT D”)
A Python script can be written to obtain the percent uninsured across race. The following video shows you how to obtain the desired summary table in Python.
Source: https://youtu.be/quECW1Aldgw
The Python code for this example was written using the pandas and dfply packages. The desired summaries can be obtained using the following Python code.
To obtain the percent uninsured across race and across medicare expansion, the group_by() function will include both fields and the spread() function can be used to pivot on medicare expansion so the percent uninsured appears in two separate columns.
You are able to obtain your own copy of this Python code. To obtain a copy of this file, click the Python Code link, and select Open in Colab.
Python Code: https://github.com/christophermalone/HLA311/blob/main/Module2_Part3C_Advanced.ipynb
Example 2.3C.2: This task will again use the Home Health Care Agencies data that was used for the task in Module 2.3B. The data includes many fields, but a limited set of fields will be considered here.
Source: https://data.cms.gov/provider-data/dataset/6jpm-sxkc
Google Folder: Link to Data
Answer the following questions for this task.
1. Obtain a summary table that includes the average patient care star rating across Ownership type. Include a screen-shot of your summary table.
TRUE | FALSE | |
2. The home health care agencies that are Proprietary (i.e. For Profit) had the highest average patient care star rating. |