Starting in 2021, hospitals were required to post their prices online. But unlike the newer Transparency in Coverage regulations that cover payers, CMS did not specify a standard file format for the provider price transparency machine-readable files (MRF). Without a standard file format to follow, every hospital posted their prices in their own custom format.
Hospitals are required to post a machine-readable file that displays all “list prices” (a.k.a. chargemaster rates, charge description master, CDM rates), negotiated rates with commercial insurance (a.k.a. contracted rates), and cash rates (a.k.a. self-pay, non-insured rates, discounted cash price).
On one hand, the lack of a strict standard offers flexibility in the choice of file format, choice of structure, and it allows providers to post information beyond the regulatory minimums. On the other hand, the flexibility creates difficulty for healthcare purchasers in making price comparisons. This also creates some uncertainty as to whether hospitals are following the rule correctly. This is one of the reasons that we created the Machine-Readable File Transparency Scorecard: to create a data quality feedback loop.
Lots of thinking went into developing our scorecard methodology (which you can read here) but we still often get questions as to how one might begin to parse, think about, and make sense of raw price transparency data. In the rest of this blog, we hope to pull back the curtain on how we parse hospital pricing data and illustrate some of the challenges of doing it at scale. This is called Type 2 Fun: it hurts when it’s happening but it’s satisfying in hindsight.
It's also a lot like this:
First, we need to acquire the pricing data. Thanks to the Hospital Price Transparency Rule, providers are required to post files publicly on their websites. The files are usually available under a section of the site that is titled something like “Price Transparency,” “Billing,” or ”Financial Services.” For a one-off file, you can typically use a search engine and search for keywords like “<hospital name> price transparency” and find the data set without too many clicks. Sometimes, the data sets may be trickier to obtain and sort out. For example, the hospital may redirect to a vendor URL or a separate point-and-click tool with a data download button. Why? Well, because they can, so they will.
For example, here is Northwestern Medicine’s price transparency page, which links to pricing data for several hospitals together on one page.
Here we can see there is a “standard charges” CSV file (that includes negotiated rates) and a chargemaster XLSX file for each hospital in the health system. Now that we have our file downloaded, we need to open it. What you do next heavily depends on the file type and subsequent data format.
Data Format and File Types
Hospital pricing files are typically published in CSV, XLSX/XLS, XML, or JSON file formats. They’re sometimes called “MRFs” or machine-readable files.
- CSV and XLSX files are conventional, tabular data consisting of rows and columns. You can open these files with any spreadsheet application, like Excel.
- XML and JSON formats can support tabular or hierarchical data by storing lists of elements with fixed or variable lengths. They can be hierarchical in the sense that they can store nested data structures. You can open these files with a text editor, like NotePad++ or BBEdit, but the structure of the data in each file will vary between hospitals.
Here are a few examples of hospital pricing files in Excel, JSON, and XML formats.
This is a relatively complicated Excel file at first glance, but it is dense with useful pricing data. The first tab, “Contract Mapping”, maps from contract ID numbers to insurance carrier plan names. We need this crosswalk because the subsequent tabs use contract IDs in the column names rather than human-readable plan names. The original file was a large XML file (214 MB) that we can convert by opening in Excel and re-saving as an XLSX for easier viewing; this also results in a smaller file size (17.8 MB). In the Excel file, on separate worksheets, we can see facility outpatient, professional, inpatient MS-DRG, inpatient APR-DRG, and APC (Ambulatory Payment Classification) rates. The “Procedure” [code] column sometimes includes standard medical codes (like HCPCS or CPT procedure codes) or internal hospital-specific CDM (Charge Description Master) codes. The chargemaster rates are listed in the “Price” and “IP Price” columns. Chargemaster rates are less commonly billed except for out-of-network or percent-of-charge pricing. For self-pay/uninsured, many hospitals have a reduced discounted cash price. This is not always the case, unfortunately. This health system does supply a discounted cash price.
After adding indentation to this JSON file, we can see that each set of curly braces (a price “object”) is annotated with the relevant service codes, discounted cash rates, gross charges, and contracted payer amounts.
In this XML file, we have three sections: 'HCPCS/CPT®', 'MS-DRG', and a 'Revenue Codes' section. We’ve only pictured the first element of the ‘HCPCS/CPT®’ section. Each rate is an “Item” element. Within each item, there are gross charges and cash rates at the header level and an array of contracted rates in the nested ‘Contract’ section. Luckily, in this case, the ‘MS-DRG’ and ‘Revenue Codes’ sections further down in the file follow a similar pattern.
Now, what do you do with all this data?
Defining the Target Table Layout (or Common Format Schema)
We have just seen a few pricing files. To make this data comparable, we should define a common format. Once we define the target layout, we can parse and transform the file to our common format. Woo!
Say we want to go from a wide, pivoted view like this:
To a tall, unpivoted view for our common format:
Imagine that you have the machine-readable rate files for 50 hospitals and want to compare their data. These files are all in various file types (CSV, XLSX, JSON, XML), varying structures and levels of granularity, and varying table layouts; field/column names may not match across files or even within the same health system. For example, an Excel workbook might store pricing data in multiple worksheets with one sheet for each payer; or perhaps chargemaster rates are on one tab and commercial rates are stored on one or more other tabs. Commercial payer rates might be stored as columns or rows (wide vs. tall). XML or JSON files can vary with nested data structures and providers can create multiple sections within a single file. A relatively small or medium data set can become a big data problem once you want to analyze the data in aggregate. A single file might be too large for a single computer’s memory. Perhaps we’d need to stream the data to avoid loading it all at once. If you happen to have WOPR, that computer from War Games, in your house, I recommend powering it up, ASAP.
Now that we have a target table layout, we can transform the source data. Let’s walk through the pseudocode in a somewhat simplified example. Pseudocode is a plain language translation of a procedure as opposed to a program, which must be syntactically correct.
Here is our source table, which again, came from a CSV file.
The pseudocode for our parser:
- Define our target column names:
source_name- name of this source file
code- the service code
code_type- type of service code for billing purposes. E.g.,
HCPCS/CPTfor procedure codes,
MS-DRGfor inpatient encounters
description- service description
- we can report the original or use a cleaned up reference data set depending on what we want
rev_code- revenue code if supplied
- arguably, this can also go in the
codefield with its own
- revenue codes are often reported together with other codes
payer- this will be the commercial insurance carrier (a.k.a., payer) that has a contracted rate; this field can also be
List Price(gross charges) or
Cash Rateto (a.k.a., self-pay or discounted cash price), so we don't repeat the rate for each service observation.
rate- a currency value formatted as a numeric float ("
- Fill blanks with empty string ' '
- We need a common way to represent empty values.
- Identify “payer” vs. “non-payer” columns.
- Non-payer columns:
De-identified Minimum Negotiated Charge,
De-identified Maximum Negotiated Charge
- Payer columns: everything else;
Discounted Cash Price, and all of the
Negotiated Charge:* columns
- For each row,
codeto the value of
- if the substring 'MS-DRG' is in
- else if the substring 'HCPCS' or
- else, set
code_typeto empty string ' '
- clean up the code value so only the code is stored (exclude annotations)
- e.g., transform from
MS-DRG V39 (FY 2022) 192to
- We also need to decide how to handle leading zeroes.
- Snack break.
- determine if the code is a valid, standard service code
- For this example, we only want comparable service codes. Invalid codes or internal billing codes should not be in the
codefield. We'll still store the rates and descriptions for those. For simplicity, we'll exclude any codes where the length of the code is greater than 5 characters. For example, the code
30208800will not get reported as it is not a HCPCS/CPT, MS-DRG code, or revenue code.
rev_codeto the value of
descriptionto the value of
- for each payer column that we defined earlier,
payerto the payer column value.
- If the value is
Discounted Cash Priceor
Gross Charge, map to
List Price, respectively.
Negotiated Charge: is in the name, replace it with an empty string. E.g.,
NMH AETNA _3004_should be
NMH AETNA _3004_.
rateto the rate (/price) of the corresponding payer column
- Clean up the rate format to a numeric float. E.g., the string
230.84, because we expect to perform calculations on numeric values.
There are also other steps we don’t even have space to get into here, like automated tests to validate the data quality and to flag mistakes, algorithms to identify outliers and other bad data, designing a way to track price changes over time, and so on.
The parsing procedure can be implemented in a variety of software styles using loops (“for each row, do X”) or table- and column-oriented programming like Python pandas dataframes or R data.frames.
We should end up with a table in the target common format that we defined earlier.
Congratulations! You’ve almost made it. How do you feel?
We have roughly 5,000-6,000 hospitals in the US depending on how you count them. Do we have to write a parser from scratch for each unique file? Yes and no. There are unavoidable aspects of the process that are brute-force in nature (e.g., acquiring data source locations and the data itself; writing up new parsers). We can organize our parsers to avoid repeating ourselves. The diagram below is an example of how we can use object-oriented software design to reuse common components. The lower layer of parser modules – individual programs that parse a unique file – would have their own unique
format_data() procedures; however, we can reuse procedures like importing standard file formats or applying validations (e.g.,
Object-Oriented Programming Implementation Example for Code Reuse
- Imagine 7 unique parsers and 4 unique file types (XLSX, CSV, JSON, XML).
- Most of the time, reading an Excel workbook or CSV is a very similar procedure with some edge cases (e.g., accounting for multiple sheets, encoding issues, or formatting issues). The
ExcelParser.read_data()procedure might be reused by both parsers Parser_01 and Parser_02, so we don’t need to re-write that step every time.
Once we have our common format table, we have a data set that we can load to a database, enrich the data, and perform queries for analysis! But, the work doesn’t end there. We need a system for keeping this data fresh, managing metadata, and a way to handle unexpected file formats or changes in a hospital’s file format over time. It’s a challenge, but one that us health data nerds revel in. As you attempt to run through this exercise, remember that parsing data when there is no standard for how that data must be initially displayed is a never-ending game of iterating. Did you find a faster (better?) way of parsing through these files? Send me an email! Our data team would love to see what you did.
We hope you’ve enjoyed this blog! Next up, payer data!