Synthetic Data in Healthcare: the Great Data Unlock
By Blake Madden
A while back I wrote a deep dive on synthetic data in healthcare, which holds a ton of potential to unlock data access while solving the data privacy problem in healthcare.
I’d love to hear your thoughts and opinions about this space, and about what trends in healthcare data I need to be paying attention to.
Let’s dive in!
SPONSORED BY ADONIS
There are countless changes impacting physicians and private practices. That’s why Adonis is committed to putting all physicians and private practices on the operational front foot.
On November 14th at 2PM EST, Adonis is teaming up with a Resident Physician at Mount Sinai and co-founder of Healthcare Huddle Jared Dashevsky for an exclusive 1:1 conversation.
The wide-ranging talk will explore the challenges that physicians are facing, including:
- Revenue cycle management
- Navigating prior authorization
- Finding harmony with private equity partners
It’s a must-watch for any physician or healthcare operator.
An Introduction to Synthetic Data
Imagine a computer ‘teaching’ self driving car software how to drive.
By creating arenas like dense, urban environments with lots of turns, curbs (RIP my old Mazda CX-9) and cars, the software can be ‘taught’ without ever actually leaving a laboratory setting. At this point, over 10 billion simulated/synthetic miles have been driven!
Flight simulators are another great example, as they’re commonly used to train pilots prior to putting anything or anyone at risk in the real world. In fact, synthetic training of jet fighter pilots reached the point of sophistication where pilots describe real world environments as flying ‘exactly like the simulator.’
Synthetic data has been around for a while. In the early 90’s, Donald Rubin created the first framework for synthetic data by generating a dataset of anonymous U.S. census responses based on real Census data. In doing so, he successfully created a new synthetic dataset that matched high-level population statistics of the real census data.
Use of synthetic data is gaining steam. Accenture notes it as one of the top trends to watch in the life sciences + medtech space, while Gartner predicts that 60% of data use will be synthetic by 2024.
It’s a compelling complement to real patient data for a few different reasons.
It alleviates most privacy concerns. Because fictional patients are used and generated, privacy concerns are largely obsolete.
- Firms using synthetic data worry much less about HIPAA (or GDPR, for that matter) compliance. Is this what freedom feels like?
It’s flexible and expandable. Healthcare data engineers can design datasets for their specific use case and balance demographics to avoid algorithmic bias in the dataset. Data teams can expand the datasets to increase their size and overcome a lack of data (e.g., more miles for self-driving cars or more types of rare patients).
- Once they’ve received the desired dataset, teams can build out models for testing and iteration. Without synthetic datasets, data teams would have to rely on rigid real-world data and expend lots of time and resources to access it. Further, firms still wouldn’t be able to access enough data for specific patient populations with limited available data.
It’s accessible and economical. Since synthetic data is procedurally generated, the only big cost involved is training the dataset – a computer intensive process – which means that the cost to access synthetic data compared to actual data is orders of magnitude lower.
- Like I mentioned before, sourcing real data in contrast is hella expensive. For instance, getting data from a single patient in a clinical trial can cost upward of $20,000 while licensing real-world data can cost $100,000 into the millions of dollars.
Synthetic data is 100% developed by machine learning and natural language processing algorithms (buzzwords, I know) but again, entirely based on real-world data. Although both are used to create synthetic data, firms are over time leaning toward natural language processing for this purpose.
The Current State of Data in Healthcare
Healthcare has always experienced a conflict between the benefits of data and the importance of patient privacy.
Accessing healthcare data for commercial and even academic use is notoriously difficult and expensive. For instance, accessing and getting approved to use health system data could take up to 24 months. Patient-level data can cost hundreds of thousands of dollars or more depending on scope. These roadblocks stifle progress.
We’re in the second inning related to the sophistication and application of data in healthcare. Privacy issues, lack of data standards, and data silos all have hamstrung progress and innovation, but this dynamic is rapidly evolving, especially given the emergence of generative AI products.
Despite these challenges, more data is being used than ever before, with platforms like Snowflake and AWS racing to provide tools and capture the potential of this information. The rise of cloud computing capabilities is unlocking the pent-up demand to enable more sophisticated data analytics and quicker product development.
Synthetic data has the potential to solve a lot of problems associated with data access in healthcare.
Synthetic data is data that is created by a computer program, based on real data or from algorithms. It is used to imitate real data and can be used instead of real datasets in different applications.
Although synthetic data cannot completely replace real data in every situation, privacy-preserving synthetic data is a valuable addition that enables researchers, engineers, and developers to work more effectively in various stages, including early feasibility and exploration, product development, scenario planning, and model training. This iteration allows for better fine-tuning of final products once they’re ready to be tested or implemented with less secure, more expensive real data.
Synthetic data is much more flexible than real patient data for product development and research purposes. It solves for the traditional complexities associated with the use of healthcare data while minimizing privacy concerns.
Challenges with Real World Data in Healthcare
Healthcare is notoriously slow-moving and a lot of this lack of progress stems from data practices (and fax machines, of course). While policy and access are progressing, there are still a number of issues hampering innovation:
Privacy issues: Healthcare data breaches hit all-time high’s in 2021, affecting 45 million people. With the recent Supreme Court abortion decision and expected fallout, patients and consumers are more wary than ever about privacy protections, even more so after Meta and hospitals were sued for collecting sensitive healthcare data and targeted advertising based on that protected data.
The hyper-sensitivity around patient data privacy is abundantly clear:
- Google has gotten into several patient data privacy scuffles over the years with both Project Nightingale and its work with UChicago. Although the firm worked with both Ascension and UChicago under a legal agreement, the public damage was already done.
- HIPAA violations are rampant among provider organizations as fines stack up, especially with recent Pixel issues and associated costs.
Compliance / HIPAA: Data requires stringent measures involving lots of red tape. To be HIPAA compliant, healthcare data specifically requires patient de-identification through one of the following methods under the Privacy Rule:
- Safe Harbor, a complete redaction of the 18 data fields containing patient health information (“PHI”) – AKA, all of the useful information like age, dates, locations, etc; or
- Expert Determination (“ED”), which requires a partial redaction of data, then an expert determines (get it lol) whether the data is now appropriate to share. ED is problematic since no universal explicit standard exists for healthcare data sharing.
Any use of healthcare data, whether for commercial or research purposes, has to be extremely secure, which limits an organization’s ability to test products or accelerate research projects and collaborations.
Data Complexity: Different data standards and formats exist. Databases are inconsistent and lack normalized structures. Valuable engineering time is wasted on the idiosyncrasies involved with traditional healthcare data. As the Tuva Project puts it, “Compared to other disciplines, doing healthcare data engineering and data science requires a tremendous amount of domain knowledge.”
Incumbent Status Quo Issues: Data infrastructures at provider organizations are closed and by default do not communicate with one another. Silos of data exist across organizations. Pricing is prohibitive for new entrants and favors incumbents in its current form.
Lack of Data Representation: Not only is data access broadly difficult, but healthcare also suffers from a lack of data representing diverse populations. Especially in the current health tech boon, many groups are underrepresented in the data used to train AI/ML models. For example, currently available datasets often do not have enough representation of rare disease patients to allow for effective predictive modeling, meaning the model’s impact in a real-world setting will be subpar and insufficient for the patients it is meant to help.
I’ll be discussing how synthetic data addresses these healthcare-specific problems later on.
But first, let’s talk about what synthetic data even is.
Use-Cases for Synthetic Data in Healthcare.
In the same way that a fighter pilot trains in a simulated environment, healthcare organizations can harness synthetic data to validate and iterate clinical workflows or set baselines for drug development in clinical trials (or even discover what treatments are working better than others).
Some specific use cases:
Digital Health and Interoperability: A digital health company building interoperability infrastructure is leveraging synthetic data for building and testing its offerings first in a non-HIPAA environment. The use of synthetic data here reduces development costs and risks associated with working with real data to build products.
Life Sciences, Real-World Evidence and Clinical Trial Design: A global pharma company used synthetic data to access EU partner datasets in order to improve and accelerate real-world evidence (RWE) research and health economics outcomes research. In addition to RWE research, there are also several use cases for synthetic data in clinical trial design, particularly looking at how to design trial eligibility and intervention / control arms. Beyond clinical trial design, there’s an appetite to use synthetic patient-level data for commercial use cases, such as post-launch market surveillance for label expansion and diagnostic / risk stratification algorithms to identify under-treated patients. As we all know from those annoying cookie notifications in browsers, GDPR is notoriously strict. Consequently, synthetic data becomes even more valuable in EU nations.
Academic Medical Centers and Research and Education: A top academic university created a synthetic version of its EHR dataset to enable more secure research and mitigate privacy risks with less IRB oversight. The data is also being used in the academic setting to teach machine learning classes.
Public Health & Predictive Analytics: Synthetic data can be used to create scenarios to predict outbreaks, patient inflows, and other healthcare trends. This can be especially useful in planning and resource allocation.
Medical Imaging: In medical imaging, synthetic data can be used to augment datasets, especially when certain conditions or anomalies are rare. This helps in training better diagnostic models.
SPONSORED BY WEAVER
Health care regulation is messy. Especially HIPAA compliance.
A deep dive into understanding privacy and security measures around protected health information (PHI) may shock you.
A basic security assessment will determine whether:
- You are subject to HIPAA
- Data in your environments are adequately protected
- Your security measures comply with the HIPAA Security Rule.
To learn more, read part 1 of Weaver’s 3 part series on HIPAA Security Rule compliance.
How the Synthetic Data Model Works.
Synthetic data vendors act as an infrastructure layer for healthcare organizations who want to access the manufactured data.
Vendors can do a few different things depending on the license and scope of a specific project:
- Data Licensing. For clients who need data but don’t already have it (think: early stage health tech startup), a vendor may license its own data (which has been trained on and is representative of a certain level of patient records) on a monthly basis via its API, or via a direct data delivery.
- Custom Generation. When engaging with a client that has its own data or wants to work with partner data (like a health system or pharma company), a synthetic data vendor will generate synthetic datasets from those real patient records. This is typically an annual license that includes regular updates to the synthetic data, but can also take the form of a monthly license delivered via API.
- Augmentation. As a supplement to both Data Licensing and Custom Generation Licensing listed above, vendors can also offer highly customized datasets for specific use cases, a process called augmentation, and charges a flat rate for the service. Augmentation can include expanding population size (such as for rare cohorts) or balancing populations to address bias.
Pricing and flexibility is a key differentiator for a synthetic data player versus a more traditional data vendor who might charge care delivery organizations in the hundreds of thousands of dollars up to the millions of dollars for access. Even then, the vendor might limit your access or what you can do with their data.
Challenges & Threats to Synthetic Data Adoption.
What are some of the roadblocks that synthetic data players may face?
Adoption & Education. Despite the potential in healthcare, synthetic data’s biggest threat to growth comes from understanding what synthetic data is and what it shouldn’t be used for. “Is synthetic data even possible at the level of quality that I need? What are the possibilities with synthetic data? Is it worth it?” As we’re all painfully aware, healthcare as an industry is conservative and slow-moving, which may continue to give synthetic data players fits.
Disruption & Obsoletion: With the emergence of generative AI, it’s hard to know what will happen to the synthetic data space. My guess is that those vendors with access to better, cleaner datasets will use AGI to augment their offerings. Still, you can’t ignore the fact that a generative AI product could render the entire synthetic data market obsolete in short order by allowing internal developers and engineers to create their own fake patient data quickly. For care delivery organizations testing new products or use cases, however, the emergence of generative AI for generating synthetic datasets is a net positive, as I imagine it’ll only increase the speed to market. I imagine tools like ChatGPT will only get better at creating synthetic data:
- Generative AI models, especially Generative Adversarial Networks (GANs), are particularly adept at creating synthetic data. GANs consist of two neural networks – a generator and a discriminator – that work against each other. The generator tries to produce data, while the discriminator tries to distinguish between real and generated data. Over time, the generator becomes proficient at creating data that the discriminator can’t distinguish from real data. Generative AI serves as the engine behind the creation of high-quality synthetic data. When used correctly, it can address data scarcity, improve model training, ensure data privacy, and provide valuable insights in various domains, especially in sectors like healthcare. – ChatGPT
Data Fidelity and Integrity. Since datasets are generated and processed by machines, one of the biggest roadblocks for synthetic data will continue to be whether commercial users trust the fidelity of the data. Is the data reliable? When using synthetic data, vendors must continually confirm that the insights gleaned from synthetic and real data are the same. Still, even with faulty datasets, the ‘wrong’ data can still be used for 90% of tasks needed prior to validation.
Ability to Synthesize Real-World Data. The biggest threat to healthcare innovation, like any large industry with behemoth incumbents and rampant regulatory roadblocks, is always the status quo. Synthetic data players will need access to real-world EHR and claims data to continue to generate synthetic data. Will existing incumbents – health systems, payors, academic centers – be willing to share their data in the near future with potential competitors (perhaps encroaching digital health players)? On top of that, will new, more stringent privacy regulations meddle with synthetic data? Or will the emergence of generative AI and healthcare specific large language models alleviate these problems? It’s a really interesting time for data in healthcare. That being said, healthcare is by default an industry shrouded in mystery and trade secrets, and I could see data-sharing following a similar path.
The Conclusion on Synthetic Data.
I hope you guys found this dive into the synthetic data world as interesting as I did. Privacy and utility of healthcare data no longer have to be at odds, and synthetic data vendors (along with generative AI) are working to make that future a reality.
Synthetic data and increasing data accessibility are game-changers when it comes to the acceleration of innovation in healthcare, and we’re just getting started.
If there’s anything you have thoughts on, please feel free to give me a shout.
Thanks for reading! Til next time,
- Harnessing the power of synthetic data in healthcare: innovation, application, and privacy
- Synthetic data in health care: A narrative review
- The Value of Synthetic Data in Healthcare
- Synthetic data in medical research
If you enjoyed this, consider subscribing to Hospitalogy, my newsletter breaking down the finance, strategy, innovation, and M&A of healthcare. Join 20,000+ healthcare executives and professionals from leading organizations who read Hospitalogy! (Subscribe Here)