Syntegra and Synthetic Data: Solving Healthcare’s Data Problem
By Blake Madden
Welcome back to another edition of Hospitalogy, a newsletter about the latest trends across healthcare!
Today I’ll be diving into synthetic data’s vast potential in healthcare, including a startup leading the way in the space – Syntegra – and what sets the firm apart in how it’s helping to solve the long-standing data access problem.
Before we get started, I wanted to clarify that this series is NOT a sponsored post – rather, this essay came from organic discussions between Xander Kerman and me.
I wanted to learn more about the synthetic data space and he and his team were willing to share Syntegra’s model and thesis with me.
This post is a 100% organic representation of working with the Syntegra team while understanding the opportunity that synthetic data as a whole represents
As someone who is intellectually curious about everything in healthcare, I really appreciated the discussion and the Syntegra team’s transparency!
Let’s dive in.
Join 8,700+ smart, thoughtful healthcare folks and stay on top of the latest trends in healthcare. Subscribe to Hospitalogy today!
Sponsored by: Weaver
I scan hundreds of articles and essays on healthcare each week. There are very few firms writing consistent, high-quality thought leadership content that make me stop in my tracks and read every word when they hit publish.
But Weaver’s health care advisory and valuation group is one of those firms that I read consistently.
The Weaver health care team has built a group of perceptive individuals who deliver concise and actionable insights on the latest health care business topics. They’re former colleagues and mentors of mine and one of the main reasons why I’m before you writing a healthcare newsletter today!
Weaver’s latest piece dives into the assumptions and implications behind McKinsey’s perspective on the outlook for provider profitability in 2022 and beyond. It’s worth checking out.
Introduction to Synthetic Data and Syntegra.
I’ll be the first one to tell you that I’m not any sort of data expert. But when Xander Kerman at Syntegra reached out to me with what they’re doing to solve the data access problem, I jumped at the opportunity to dive into the SaaS-based business model.
But what is synthetic data? At a high level, synthetic data is realistic data, where the data points are not real people but instead represent real relationships with fictional patients. We’ll dive deeper on this definition shortly and why it holds important implications for healthcare.
In this essay, we’re going on a journey to understand the growing demand for accessible and actionable healthcare data. I’ll explain why synthetic data in particular holds a unique opportunity to unlock care innovation across healthcare. Finally, we’ll get into why synthetic data holds the potential to solve a solid portion of the pesky, legendary data privacy vs. utility tradeoff that plagues data accessibility in healthcare.
Also, again, as a quick aside, this is not a sponsored post for Syntegra. I just think this topic is super cool. Onward!
Key Takeaways and Core Thesis for Synthetic Data.
In healthcare, there has historically existed a fundamental conflict between the potential impact and utility of data and the need to preserve patient privacy.
Accessing healthcare data for commercial and even academic use is notoriously difficult and expensive. For instance, accessing and getting approved to use health system data could take up to 24 months. Patient-level data can cost hundreds of thousands of dollars or more depending on scope. These roadblocks stifle progress.
We’re in the second inning related to the sophistication and application of data in healthcare. Privacy issues, lack of data standards, and data silos all have hamstrung progress and innovation.
Despite these challenges, more data is being used than ever before, with platforms like Snowflake and AWS racing to provide tools and capture the potential of this information. The rise of cloud computing capabilities is unlocking the pent-up demand to enable more sophisticated data analytics and quicker product development.
While synthetic data won’t replace real data for all cases, privacy-preserving synthetic data is an excellent complement that allows researchers and builders to work more efficiently through early stage feasibility and exploration, product development, scenario planning, and model training prior to fine tuning the final product with less secure real data.
Synthetic data is much more flexible than real patient data for product development and research purposes. Synthetic data solves for the traditional complexities associated with the use of healthcare data while minimizing privacy concerns.
To meet the growing demand, Syntegra is on a journey in the synthetic data space to make patient-level data rapid, flexible, and accessible to digital health firms, life sciences companies, health systems and payors alike.
Syntegra functions as an infrastructure SaaS layer to meet data needs for clients of all sizes and maintains a sustainable economic moat, which will allow it to differentiate its product offerings and remain a leader in the synthetic data space.
While I’m excited about synthetic data’s future, challenges to synthetic data adoption still remain.
Background: Current Challenges with Healthcare Data.
Healthcare is notoriously slow-moving and a lot of this lack of progress stems from data practices (and fax machines, of course). While policy and access are progressing, there are still a number of issues hampering innovation:
Privacy issues: Healthcare data breaches hit all-time high’s in 2021, affecting 45 million people. With the recent Supreme Court abortion decision and expected fallout, patients and consumers are more wary than ever about privacy protections, even more so after Meta and hospitals were sued for collecting sensitive healthcare data and targeted advertising based on that protected data.
The hyper-sensitivity around patient data privacy is abundantly clear:
- Google has gotten into several patient data privacy scuffles over the years with both Project Nightingale and its work with UChicago. Although the firm worked with both Ascension and UChicago under a legal agreement, the public damage was already done.
- HIPAA violations are rampant among provider organizations as fines stack up.
Compliance / HIPAA: Data requires stringent measures involving lots of red tape. To be HIPAA compliant, healthcare data specifically requires patient de-identification through one of the following methods under the Privacy Rule:
- Safe Harbor, a complete redaction of the 18 data fields containing patient health information (“PHI”) – AKA, all of the useful information like age, dates, locations, etc; or
- Expert Determination (“ED”), which requires a partial redaction of data, then an expert determines (get it lol) whether the data is now appropriate to share. ED is problematic since no universal explicit standard exists for healthcare data sharing.
Any use of healthcare data, whether for commercial or research purposes, has to be extremely secure, which limits an organization’s ability to test products or accelerate research projects and collaborations.
Data Complexity: Different data standards and formats exist. Databases are inconsistent and lack normalized structures. Valuable engineering time is wasted on the idiosyncrasies involved with traditional healthcare data. As the Tuva Project puts it, “Compared to other disciplines, doing healthcare data engineering and data science requires a tremendous amount of domain knowledge.”
Incumbent Status Quo Issues: Data infrastructures at provider organizations are closed and by default do not communicate with one another. Silos of data exist across organizations. Pricing is prohibitive for new entrants and favors incumbents in its current form.
Lack of Data Representation: Not only is data access broadly difficult, but healthcare also suffers from a lack of data representing diverse populations. Especially in the current health tech boon, many groups are underrepresented in the data used to train AI/ML models. For example, currently available datasets often do not have enough representation of rare disease patients to allow for effective predictive modeling, meaning the model’s impact in a real-world setting will be subpar and insufficient for the patients it is meant to help.
I’ll be discussing how synthetic data addresses these healthcare-specific problems later on.
But first, let’s talk about what synthetic data even is.
What is Synthetic Data? Definitions & Examples.
Realistic but not real: Synthetic data looks and acts like real data. It reflects the statistical properties of an underlying real dataset or multiple datasets.
At the same time, synthetic data is completely fictional and does not contain any actual patient information.
It’s 100% developed by machine learning and natural language processing algorithms (buzzwords, I know) but again, entirely based on real-world data. Although both are used to create synthetic data, firms are over time leaning toward natural language processing for this purpose (Syntegra uses natural language processing).
For you movie fans out there, I’d compare synthetic data to living in a simulation like the Matrix or an Inception-like dream. AKA, it’s so realistic that you can’t tell the difference between the simulation/dream and actual reality.
In the same way, the data generated doesn’t actually exist in our world, but it still creates actionable insights that happen in the world.
Here’s another great synthetic data analogy: imagine a computer ‘teaching’ self driving car software how to drive.
By creating arenas like dense, urban environments with lots of turns, curbs (RIP my old Mazda CX-9) and cars, the software can be ‘taught’ without ever actually leaving a laboratory setting. At this point, over 10 billion simulated/synthetic miles have been driven!
Flight simulators are another great example, as they’re commonly used to train pilots prior to putting anything or anyone at risk in the real world. In fact, synthetic training of jet fighter pilots reached the point of sophistication where pilots describe real world environments as flying ‘exactly like the simulator.’
Synthetic data has been around for a while. In the early 90’s, Donald Rubin created the first framework for synthetic data by generating a dataset of anonymous U.S. census responses based on real Census data. In doing so, he successfully created a new synthetic dataset that matched high-level population statistics of the real census data.
Use of synthetic data is gaining steam. Accenture notes it as one of the top trends to watch in the life sciences + medtech space, while Gartner predicts that 60% of data use will be synthetic by 2024.
It’s a compelling complement to real patient data for a few different reasons.
It alleviates most privacy concerns. Because fictional patients are used and generated, privacy concerns are largely obsolete.
- Firms using synthetic data worry much less about HIPAA (or GDPR, for that matter) compliance. Is this what freedom feels like?
It’s flexible and expandable. Healthcare data engineers can design datasets for their specific use case and balance demographics to avoid algorithmic bias in the dataset. Data teams can expand the datasets to increase their size and overcome a lack of data (e.g., more miles for self-driving cars or more types of rare patients).
- Once they’ve received the desired dataset, teams can build out models for testing and iteration. Without synthetic datasets, data teams would have to rely on rigid real-world data and expend lots of time and resources to access it. Further, firms still wouldn’t be able to access enough data for specific patient populations with limited available data.
It’s accessible and economical. Since synthetic data is procedurally generated, the only big cost involved is training the dataset – a computer intensive process – which means that the cost to access synthetic data compared to actual data is orders of magnitude lower.
- Like I mentioned before, sourcing real data in contrast is hella expensive. For instance, getting data from a single patient in a clinical trial can cost upward of $20,000 while licensing real-world data can cost $100,000 into the millions of dollars.
Synthetic Data in Healthcare.
In the same way that a fighter pilot trains in a simulated environment, healthcare organizations can harness synthetic data to validate and iterate clinical workflows or set baselines for drug development in clinical trials (or even discover what treatments are working better than others).
The Syntegra team helped me outline a few specific use-cases:
Digital Health and Interoperability: A digital health company building interoperability infrastructure is leveraging Syntegra synthetic data for building and testing its offerings first in a non-HIPAA environment. The use of synthetic data here reduces development costs and risks associated with working with real data to build products.
Life Sciences, Real-World Evidence and Clinical Trial Design: A global pharma company engaged Syntegra to access EU partner datasets to improve and accelerate real-world evidence (RWE) research and health economics outcomes research. In addition to RWE research, there are also several use cases for synthetic data in clinical trial design, particularly looking at how to design trial eligibility and intervention / control arms. Beyond clinical trial design, there’s an appetite to use synthetic patient-level data for commercial use cases, such as post-launch market surveillance for label expansion and diagnostic / risk stratification algorithms to identify under-treated patients. As we all know from those annoying cookie notifications in browsers, GDPR is notoriously strict. Consequently, synthetic data becomes even more valuable in EU nations.
Academic Medical Centers and Research and Education: A top academic university created a synthetic version of its EHR dataset to enable more secure research and mitigate privacy risks with less IRB oversight. The data is also being used in the academic setting to teach machine learning classes.
As synthetic data adoption continues to grow and organizations understand its value in saving time and money, Syntegra is well-positioned to respond to and capture increasing demand as the first mover into the space. Based on internal estimates, the synthetic data market is estimated at $30 billion, just a 15% slice of the large and rapidly expanding healthcare data market.
But who is Syntegra?
Syntegra’s Origins in Synthetic Data.
Surprise surprise – Dr. Michael Lesh founded Syntegra after running into data privacy pain points while working at UCSF.
At the time, Dr. Lesh was an established clinician who had led the cardiac electrophysiology at UCSF in addition to founding and running several successful medical device companies. Prior to Syntegra, he found a position as UCSF’s Executive Director of Health Technology Innovation and facilitated a project in collaboration with Google and UCSF.
The project sought to develop machine learning software that predicted patient outcomes using existing medical records data. But because the team couldn’t clear data privacy and de-identifying issues with legal, the project fell through.
After initial frustration, Dr. Lesh understood that current data privacy standards created serious, debilitating limitations for organizations working on life-saving issues across healthcare.
Determined to solve this unmet need, Dr. Lesh founded his next company, Syntegra . He and co-founder Ofer Mendelevitch recognized how effectively attention-based transformer models could pick up on longitudinal relationships in natural language processing tasks, and realized that they could apply the same principle to healthcare data.
By learning the “language” of health and disease, the models could pick up on the meaningful cause & effect relationships that govern the world of healthcare.
Syntegra outlined an ambitious mission: “Democratize data access to accelerate innovation for improved patient care and outcomes.”
Syntegra is aiming to solve one of the biggest pain points in healthcare outlined above – access to high-fidelity data.
By unleashing synthetic data to the world, Syntegra changes the healthcare innovation game. Innovation in care delivery and drug development can increase at an exponential rate as companies iterate processes over months instead of years.
If the Syntegra mission is successful, it’s a game-changer in healthcare. An unsexy, data-driven one, but a game-changer nonetheless.
How the Syntegra Model Works.
Syntegra acts as an infrastructure layer for healthcare organizations who want to access synthetic data. I want to be clear here since this tripped me up just a hair – Syntegra is the data provider and is use-case agnostic – the firm leaves actual use of data up to its clients.
Syntegra offers the following services to healthcare organizations:
- Data Licensing. For clients who need data but don’t already have it (think: early stage health tech startup), Syntegra licenses its own data (which has been trained on and is representative of 6M+ patient records) on a monthly basis via its API, or via a direct data delivery.
- Custom Generation. When engaging with a client that has its own data or wants to work with partner data (like a health system or pharma company), Syntegra generates synthetic datasets from those patient records. This is typically an annual license that includes regular updates to the synthetic data, but can also take the form of a monthly license with the custom synthetic data Syntegra generates delivered via Syntegra’s API.
- Augmentation. As a supplement to both Data Licensing and Custom Generation Licensing listed above, Syntegra also offers highly customized datasets for specific use cases, a process called augmentation, and charges a flat rate for the service. Augmentation can include expanding population size (such as for rare cohorts) or balancing populations to address bias.
In addition to the above offerings, Syntegra is building and developing new products, like its latest launch of its Synthetic Data API, which is a pioneering offering not only in allowing quick and easy access to patient-level data, but also its ability for users to build patient cohorts to fit a variety of needs.
Syntegra’s current pricing strategy reflects flexible pricing as a SaaS player. The firm offers varying ranges of annual (and month-to-month) licenses for access to its API:
I’m mentioning pricing and flexibility here because it’s a key differentiator for a synthetic data player versus a more traditional data vendor who might charge you in the hundreds of thousands of dollars up to the millions of dollars for access. Even then, the vendor might limit your access or what you can do with their data.
Syntegra’s Key Differentiators and Moat.
Now, as the risk-averse valuation guy in a past life, I had to ask the team what they thought about its moat and potential competition. Syntegra’s secret sauce lies in a few key differentiators. The startup is one of the first movers into the commercial synthetic data industry, allowing it to build out robust, longitudinal data relationships (e.g., tracing patients across different datasets, including EHR data, claims data, patient registries, lab and diagnostic data, and provider / medical CPT codes).
In addition to the depth of its longitudinal data, Syntegra’s perceived moat lies in its machine learning model — the Syntegra Medical Mind, an adapted transformer-based language model that learns the “language” of patients’ clinical trajectories. As it continues to train datasets, Syntegra develops a distinct flywheel.
The model gets smarter over time, the fidelity and representativeness of the data improves, and clients will be able to access more granular datasets to fit their needs. Through partnerships with academic medical centers, pharma, payors, and research orgs, Syntegra’s ML model has already trained on millions of patient records and continues to grow.
Economically, Syntegra seems pretty straightforward as far as SaaS firms are concerned. Its revenue streams stem from its Data Licensing and Custom Generation segments mentioned above at varying price points, including augmentation and customizing considerations. On the expense side, its biggest costs relate to dataset training (server intensive) and synthetic data generation (cloud computing).
While Syntegra’s model still exists in the early stages of development, I’d speculate it to develop into a high gross margin business based on the setup of its unit economics and scalability across digital health players, health systems, payors, life sciences, and miscellaneous healthcare organizations.
That being said, although the use case for synthetic data and Syntegra is clear, there’s still a long road to travel in order to achieve a wider level of adoption.
Challenges & Threats to Synthetic Data Adoption.
Of course in any assessment of any company or market that I dive into, I want to bring a balanced discussion. Here are some of the roadblocks that Syntegra and synthetic data will likely face.
Adoption & Education. Despite the potential in healthcare, its biggest threat to growth comes from understanding what synthetic data is and what it shouldn’t be used for. “Is synthetic data even possible at the level of quality that I need? What are the possibilities with synthetic data? Is it worth it?” As we’re all painfully aware, healthcare as an industry is conservative and slow-moving, which may continue to give synthetic data players fits.
Fidelity and Integrity. Since datasets are generated and processed by machines, one of the biggest roadblocks for synthetic data will continue to be whether commercial users trust the fidelity of the data. Is the data reliable? When using synthetic data, Syntegra and other players must continually confirm that the insights gleaned from synthetic and real data are the same. When I pressed on this, the Syntegra team was aware of this challenge and is working hard to ensure high data fidelity. The firm instituted a robust data QA process as well as other comparative measures, but a big challenge remains in getting stakeholders comfortable with the data.
- I should add that even in a doomsday scenario where synthetic data turns out to be unreliable, datasets still have use cases in the sense that data teams can still leverage the formats and build out around 90% of what they need to do prior to validation with real-world data.
Ability to Synthesize Real-World Data. The biggest threat to healthcare innovation, like any large industry with behemoth incumbents and rampant regulatory roadblocks, is always the status quo. Syntegra and other synthetic data players will need access to real-world EHR and claims data to continue to generate synthetic data. Will existing incumbents – health systems, payors, academic centers – be willing to share their data in the near future with potential competitors (perhaps encroaching digital health players)? On top of that, will new, more stringent privacy regulations meddle with synthetic data? Healthcare is by default an industry shrouded in mystery and trade secrets, and I could see data-sharing following a similar path.
I hope you guys found this dive into the synthetic data world as interesting as I did. Privacy and utility of healthcare data no longer have to be at odds, and Syntegra (and others!) is working to make that future a reality.
Synthetic data and increasing data accessibility are game-changers when it comes to the acceleration of innovation in healthcare, and we’re just getting started.
If you’re interested in chatting more about the synthetic data space, or just want to join the healthcare conversation, give me a follow on Twitter! People disagree with me on there all the time 🙂
Thanks for reading! Til next time,
Synthetic Data Resources.
- Here are some links to Syntegra for folks that are interested in learning more about them: Syntegra databases, API, whitepapers, and miscellaneous resources
- Here are some links to the articles and ideas discussed around synthetic data more generally:
Join 8,700+ smart, thoughtful healthcare folks and stay on top of the latest trends in healthcare. Subscribe to Hospitalogy today!