Since July 2022, health insurance companies have been required to publish how much they pay healthcare providers for nearly all health care services. In the past, these negotiated prices were kept secret, leading to wide differences in payments between providers. Patients and even doctors had very little idea how much was actually being paid for healthcare.

Posting prices of items and services seems like an easy problem to solve. After all, across other industries, we can instantly browse prices for nearly anything, whether we are looking for something big like a new house or something truly important, like fuzzy cat socks:

If other complex markets like air travel and real estate have public pricing, we should be able to do the same for healthcare. Unfortunately, nothing in the US healthcare system is simple.

While the Transparency in Coverage (TiC) rule has been effective in revealing payer-negotiated rates, the amount of data is overwhelming. Insurance companies are posting in total about a petabyte per month of price data. And at Turquoise Health, we are collecting, cleaning, and augmenting all of that data every month. We use the data to populate our patient-facing website, to share with our customers and to power our own products.

If you are a software engineer, you might have just done a spit take at “a petabyte of data per month.” A petabyte is a truly astronomical amount of information — many hundreds of trillions of numbers. With only 66% of Americans covered by commercial health insurance, how is it that we need trillions of numbers to represent health care prices for only 220 million people?

Some Context On Hospital Pricing

Imagine that you are an online retailer selling cat socks. You would have a simple table of prices that looked like this:

That’s how most people think about prices — a menu of options provided at a fixed rate. So you’d think hospitals and insurance companies have a simple database table containing a list of services with prices like any other retailer. But in most cases, they have complex contracts that look like this:

Notice how every item has a price that requires external information to understand:

  • Per diem rates are paid for each day a patient is in the hospital. We need to know how long the patient will be in the hospital to know the total amount.
  • Rates for Cardiac Studies require knowing the price the hospital will bill in the future. The rate is essentially “44.8% of another unknown price,” which isn’t terribly helpful to a patient.
  • Radiology rates are based on an external price list that has to be looked up in an entirely different database. External rate lists are very common in health insurance but are not helpful unless you have access to the latest price list and can do the math yourself.

Making a Menu of Prices

Health insurance companies don't have a simple price list showing every price for every service. They have a rules engine that calculates prices based on formulas and look-up tables. While the healthcare system has lived with formula-based pricing for decades, the government was rightfully concerned that releasing rules-based prices would not be helpful to patients. Is “44.8% of an unknown price” better or worse than “162% of a benchmark rate that I also don’t know?”

As a result, the Transparency in Coverage rule requires insurance companies to do the math for patients and, in most cases, publish prices as dollar amounts. That’s helpful for patients, but it requires that a price be pre-calculated for every possible service. Pre-calculating all the prices also causes the data to mushroom from a few formulas that apply to a range of billing codes to thousands of individual prices.

For example, using the table above, a payer would display a dollar amount for each billing code in the range shown in the Cardiac Studies column.

This single row:

Expands to many rows:


Billing Code


Cardiac Studies



Cardiac Studies



Cardiac Studies



1, 2, ..., 99…

Cardiac Studies



This makes the data much larger than what insurance companies store internally, but it also makes the data much easier to use.

Same Procedure, Different Scenarios

We can handle storing a menu of prices with each service and still have a manageable amount of data. In practice, though, every service has many possible prices because each service has many different scenarios that could affect the final price. Here are some common situations that change the price:

  • A surgery was started but cancelled mid-operation for medical reasons.
  • A nurse practitioner performed the procedure in place of a doctor.
  • A doctor performed multiple procedures at once, which is cheaper than performing each procedure separately.
  • Instead of a doctor performing the MRI, a technician runs the MRI machine and a doctor reviews the results later.
  • A flu vaccination is given by a mobile flu clinic instead of in a hospital.

So every item in the price list gets a price for all of the possible situations, creating hundreds of price variations of each service.

Massive Data Redundancy

Even with separate prices for each variant of each service, the data is still possible to manage. Where things get hairy is how the data is represented. The Transparency in Coverage schema requires rates to be arranged in a hierarchical fashion, like this:

This means that each hospital's long list of doctors is repeated for every possible price variation of every medical service. Data balloons exponentially.

The schema does provide a way to avoid this duplication by referring to a separate list of doctors instead of repeating them. But taking advantage of that feature is currently optional, so not all insurance companies use it.

Clinically Implausible Rates

The data is further ballooned because it includes rates for many medical services that can never occur. For example, an insurance company may have a contract that says they pay $180,000 for a heart transplant. The payer follows the rules and publishes that price for the list of “all doctors in this hospital.” That means including a heart transplant rate for podiatrists, pediatricians, urologists, and neurologists. Those doctors would never perform a heart transplant, so the data is dead weight. When multiplied by every possible procedure performed by every type of care provider, this issue adds a significant amount of useless data.

Highly Redundant Insurance Plans

Lastly, the law requires that rates are published for every insurance plan. Insurance companies don't usually set rates at the plan level. Insurance companies agree to rates with networks of doctors and those rates are re-used among many insurance plans. They may offer 1,000 slightly customized insurance plans to 1,000 different employers that all share largely the same rates for the same doctors. The result is 1,000 giant files that repeat the same prices for the same doctors repeatedly. There is some guidance on avoiding this in the schema, but the solutions aren't always used effectively.

Shrinking it Down

First, we must credit CMS. It developed its payer data schema in public on Github, accepts suggestions from interested parties, and is responsive to feedback. We expect to see some of the issues mentioned here addressed in the future. Thanks, CMS!

But in the meantime,  Turquoise Health is processing around a petabyte of data per month which our users want to use to make decisions about healthcare prices. Our goal is to make the data as small, clean and easy to use as possible.

We’ve been working to pull the data out of an overly-hierarchical structure to remove the inherent duplication. The idea is to look for duplicated information being stored lower in the hierarchy and only store one copy. Along the way, we can also remove some unneeded levels of hierarchy:

By storing only one copy of each piece of information, we can store the exact same information with an 85% reduction in size.

We are applying further optimizations to shrink the data even more. We've analyzed historical healthcare records to identify which types of doctors are likely to perform which types procedures. By removing prices for clinically implausible services, we can cut the data at least in half again.

Smaller Data Leads to Insights (and Saves Money)

Performing this de-duplication processing is a tricky data engineering problem. The data is spread across millions of files from hundreds of insurance companies. Cross-referencing the data to find only unique information within a massive dataset is computationally intensive.

But the effort is worth it. Reducing this data down to a much smaller dataset has huge implications for our users and for us.

First, a much smaller dataset is exponentially faster to query. Queries that used to take minutes or hours now take seconds. Faster data exploration naturally leads to more exploration and insights.

Second, working with a much smaller dataset is quicker and more cost-effective when running complex analyses across the entire dataset. We are working on several exciting projects that require summarizing the entire price landscape. One of our first goals is to quantify the quality of each insurance company's published rates, which requires quickly analyzing all of the data together.

Third, the duplicated data we eliminated was making the data harder to understand. When users searched and exported data, they could get reams of repeated information that was challenging to pick through. Removing redundant data makes rates easier to understand.

Show Me Tangible Results

After filtering down the data, we are left with a few tens of terabytes of compressed information that represents the prices of the entire US commercial healthcare system. But tens of terabytes is still a challenge for traditional database systems.

To query this data quickly, we use the open-source Trino database engine. Trino is designed from the ground up for big data analytics. The raw rate data is stored in Amazon S3 in a compressed file format. Users write normal SQL and Trino scans the data in S3 to find what it needs. To the user, it looks just like any other database. But behind the scenes, Trino executes many queries in parallel using a cluster of servers to speed up results.

Imagine that we want to grab the Anthem insurance rates for insurance plans for hospitals in California. It seems like a simple request, but even something this basic would be slow and difficult with the original data.

  • Anthem operates in 14 states, including country-sized states like California and New York, and offers many plans in each market, making the pool of data we are checking quite large.
  • We look for hospitals using National Provider IDs (NPIs), but each hospital has several NPIs, and the original payer data schema repeats the list of providers for every price – and we have to check them all!

So we are actually looking for hundreds of needles in a haystack approximately the size of Mars. But with the deduplicated payer data and Trino, this query is no sweat. Previously, this kind of query could take an hour or more to run. Now it runs in 10-20 seconds.

Better Data Makes Everything Better

Raw performance improvements are great on their own, but these data improvements will also flow through to all of the tools on the Turquoise Health platform.

Many users explore pricing data through Rate Sense, our health rates search engine. Using Rate Sense, very complex questions like "how much does it actually cost to deliver a baby in California?" can be answered in seconds.

Just query for MSDRG 768 (the code for an inpatient child delivery in a hospital) and get specific prices or even a complete market summary in seconds:

These data updates will make all our tools even more delightful to use. We are excited about the new possibilities! Want to learn more? Get in touch.