Federated AI: Unlocking the 97% of Healthcare Data That AI Has Never Seen

The central paradox of AI in life sciences is this: healthcare generates more potentially valuable AI training data than almost any other domain — genomic sequences, clinical outcomes, medical imaging, real-world evidence from patient populations numbering in the tens of millions — and yet life sciences AI models are among the most data-constrained in existence. The models used to discover drugs, design clinical trials, predict patient outcomes, and identify biomarkers are trained on tiny fractions of available data, because the vast majority of that data cannot legally, ethically, or commercially be moved to a centralized training environment.

This is not a temporary limitation waiting to be resolved by better data governance frameworks or more permissive regulation. It is a structural feature of the domain that reflects real and legitimate concerns: patient privacy, competitive sensitivity between pharmaceutical companies, regulatory requirements across jurisdictions, and the practical impossibility of obtaining consent from millions of historical patients whose data sits in systems built decades before AI was a consideration. The data is not going to become centralizable. The AI infrastructure has to adapt to the data.

At Neuron Factory, we invest in the infrastructure layer of AI. We look for companies building foundational technologies that enable AI capabilities that would otherwise be impossible — not applications sitting on top of existing AI infrastructure, but the infrastructure itself. Federated AI for life sciences fits this thesis precisely. It is foundational, it addresses a structural constraint rather than an incremental inefficiency, and it has a network-effect moat that becomes more powerful as the installed base grows.

The Decentralized AI Thesis

The dominant paradigm in AI infrastructure today is centralized: data is moved to the model, models are trained in large data centers with massive GPU clusters, and inference is served from centralized APIs. This paradigm has produced remarkable results, and we do not expect it to disappear. But we believe that a substantial portion of AI computing — particularly in regulated industries and applications involving sensitive personal data — will need to shift to a decentralized approach over the coming decade.

The forces driving this shift are structural and reinforcing:

Data fragmentation: The vast majority of the world's most valuable data is private, proprietary, and fragmented across thousands of organizations that have no incentive to share raw data with centralized aggregators. Public data — which has powered most foundation model training to date — is reaching exhaustion as a training resource. The next generation of AI capability improvements will come from accessing private data that cannot be centralized, or it will not come at all.
Regulatory trajectory: GDPR, HIPAA, the EU AI Act, and a growing wave of national data sovereignty requirements are making data centralization legally riskier and more expensive across jurisdictions. The regulatory trend is clearly toward stricter data localization requirements, not more permissive ones. Companies building AI infrastructure that operates within these constraints rather than against them are building with the regulatory wind at their backs.
Data routing economics: Moving petabytes of medical imaging, genomic, and clinical data to centralized training infrastructure is expensive at current volumes, and the cost grows faster than Moore's Law shrinks it. As AI adoption expands beyond Big Tech to mid-market pharmaceutical companies, regional hospital systems, and specialty healthcare providers, the economics of raw data centralization become increasingly unviable.

The 97% Problem

The statistic that grounds our investment thesis in federated life sciences AI is striking: it is estimated that 97% of all healthcare data worldwide remains completely unused in AI model training. Not because it does not exist — healthcare is one of the most data-intensive industries in the world — but because it cannot be accessed with centralized AI infrastructure. Drug discovery models are trained on public data and small curated datasets when they could be trained on the longitudinal records of tens of millions of patients. Protein structure prediction models lack the proprietary assay data from pharmaceutical research archives that could accelerate their accuracy by orders of magnitude. Clinical trial design is still largely manual when it could be AI-assisted using real-world evidence from millions of previous patients. Federated AI is the key that unlocks this data for AI use without requiring its centralization.

The Federated Architecture: Simple Principle, Complex Implementation

The core principle of federated AI is elegant: instead of moving data to the model, you move the model to the data. Computations are executed locally at the data source — in the hospital's data center, the pharma company's research environment, the insurance company's claims database — and only the outputs of those computations (model parameter updates, aggregated statistics, prediction outputs) are shared with the central coordinating infrastructure. The raw sensitive data never leaves its source.

This principle was initially developed and deployed internally by large technology companies — Google's federated learning for mobile keyboard prediction being the canonical early example — but has since evolved into a robust ecosystem of third-party and open-source frameworks that are beginning to enable broader market adoption. The technical implementation is more complex than the principle suggests:

The Secure Gateway Challenge

The most difficult engineering problem in federated AI deployment is not the federated training algorithm — those are reasonably well understood — but the secure gateway infrastructure that sits between local data and central models. This gateway must authenticate data access requests, verify that only approved computations are executed against local data, enforce differential privacy and other privacy-enhancing techniques, manage cryptographic key operations, audit all data access for regulatory compliance, and integrate seamlessly with the legacy data infrastructure of organizations that built their data systems decades before federated AI existed. Building a gateway that does all of this reliably, at enterprise scale, across the heterogeneous IT environments of global pharmaceutical companies and hospital networks, is a genuinely hard infrastructure engineering problem.

Privacy-Enhancing Technology Integration

The most sophisticated federated AI deployments combine the federated learning paradigm with complementary privacy-enhancing technologies: homomorphic encryption (enabling computation on encrypted data), differential privacy (adding calibrated noise to outputs to prevent reverse-engineering of individual records), and synthetic data generation (creating realistic but non-identifying training data from sensitive originals). The companies building federated AI infrastructure that can orchestrate these techniques together — rather than requiring data owners to implement each separately — are providing a qualitatively higher level of privacy assurance that is increasingly necessary for regulatory compliance and enterprise customer confidence.

The Neutral Infrastructure Advantage

One of the most important architectural decisions in federated AI platform design is neutrality: the infrastructure that coordinates computation across competing pharmaceutical companies or hospital systems cannot be owned or operated by any of the data contributors. A pharma company will not allow its clinical data to participate in a federated training run coordinated by a competitor's infrastructure. The companies building genuinely neutral, third-party federated AI infrastructure — positioned as trusted middleware between data owners with competing commercial interests — have a go-to-market advantage that vertically integrated approaches cannot replicate. Neutrality is not just a positioning choice; it is a structural prerequisite for cross-institutional collaboration at scale.

Life Sciences as the Beachhead: Why This Market First

Federated AI has potential applications across multiple industries where data fragmentation and privacy sensitivity create centralization barriers: financial services, defense and intelligence, manufacturing quality control, and telecommunications. At Neuron Factory, we focus our federated AI investments on the life sciences beachhead for specific reasons that go beyond market size.

First, the pain is acute and quantifiable. Drug discovery is measured in billion-dollar R&D programs with decade-long timelines and catastrophic failure rates. A federated AI platform that demonstrably improves the probability of clinical trial success or accelerates the identification of viable drug candidates by even small percentages delivers economic value so large that pricing and procurement are relatively straightforward conversations. The ROI case does not require elaborate modeling — it is visible and immediate.

Second, the early customer base is concentrated and reachable. The top 20 global pharmaceutical companies represent an enormous share of total pharma R&D spending, and they are sophisticated technology buyers with established processes for evaluating and deploying enterprise infrastructure. A federated AI company that wins production deployments at several top-20 pharma accounts has proof points that are credible to every other large pharma buyer and to the hospital systems and biotech companies that follow.

Third, the regulatory environment in life sciences is ahead of other industries in establishing clear frameworks for what privacy-preserving computation is legally sufficient. GDPR Article 89 provisions for research data, FDA real-world evidence guidance, and EMA data sharing requirements all create regulatory clarity that reduces compliance risk for federated deployments — a contrast with less-regulated industries where the legal status of federated computation remains ambiguous.

Applications Driving Near-Term Commercial Adoption

The federated life sciences use cases generating the most commercial momentum today are concentrated in three areas that represent different points on the regulatory and technical complexity spectrum:

Drug discovery and molecular AI: Training protein structure prediction and molecular property models on federated datasets from multiple pharma companies' proprietary assay archives. The performance improvement from accessing previously siloed training data is measurable and commercially compelling even at early stages of deployment.
Real-world evidence studies: Running federated analyses across hospital networks and insurance databases to generate evidence about drug efficacy and safety in real patient populations — evidence that cannot be generated from clinical trials alone and that regulatory agencies are increasingly requiring for post-approval commitments.
Clinical trial design optimization: Using federated access to historical trial data from multiple sponsors to improve patient selection, site selection, and protocol design for new trials — reducing costs and improving success probability.

The Network Effect Moat in Federated Infrastructure

The long-term competitive advantage of federated AI platforms is grounded in a network effect that operates differently from conventional platform network effects. In a consumer platform, network effects arise because users want to interact with other users. In federated AI, network effects arise because data contributors want access to models trained on data from other contributors — but only if those contributors cannot access their raw data. The more contributors join the network, the more powerful the shared models become, and the more valuable network membership is to each individual contributor. This is a collaborative intelligence network effect, not a social network effect, and it creates a fundamentally different moat dynamic.

The practical implication is that the first federated AI platform to achieve critical mass in a specific data domain — drug discovery assays, electronic health records for a specific disease area, pharmacovigilance data from regulatory submissions — has a structural advantage over late entrants that is difficult to overcome. The models trained on the larger network are materially better than models trained on smaller networks, and the privacy-preserving nature of the architecture means that the training data advantage is not replicable by a competitor who simply raises more money. The moat is the data network, and the data network is built on trust relationships that take years to establish.

"The AI era that everyone has been building toward assumes that data can be collected, centralized, and fed to models at will. In life sciences, finance, and most regulated industries, that assumption is wrong — and has been wrong from the start. The infrastructure that enables AI to work on data that cannot be centralized is not a niche capability. It is a prerequisite for AI to deliver its full potential in the most consequential domains." — Kai Nakamura, Neuron Factory

Investment Criteria: What We Look for in Federated AI

Our evaluation of federated AI infrastructure investments focuses on six factors that we believe differentiate companies capable of achieving category leadership from those that will remain limited to specific use cases:

Enterprise-grade production deployment: Federated AI is technically complex, and the distance between a working prototype and a production deployment at a global pharmaceutical company is substantial. We prioritize companies with documented production deployments at demanding enterprise customers over companies with compelling demos and promising pilots.
Use-case and model agnosticism: The platforms with the largest addressable markets are those that can serve multiple AI use cases across multiple data types — not point solutions optimized for a single application. Architectural generality is a prerequisite for the network effects that create long-term competitive advantage.
Privacy-enhancing technology depth: The regulatory requirements for federated AI deployments in life sciences are becoming more sophisticated, and the companies that offer integrated PET stacks — federated learning plus homomorphic encryption plus differential privacy plus synthetic data — will have compliance advantages over competitors offering federated learning alone.
Neutral positioning: We are skeptical of federated AI companies that are subsidiaries or close affiliates of large data holders, because potential customers with competitive sensitivities will not trust their data to a platform with conflicting interests. Genuine neutrality — in ownership structure, governance, and go-to-market approach — is a prerequisite for cross-institutional adoption.
Foundation model hosting roadmap: The most powerful expression of the federated AI value proposition is hosting large foundation models — protein structure predictors, clinical language models, genomic analysis models — on the federated infrastructure and allowing contributors to fine-tune them using their private data. Companies with a credible roadmap to this capability are building toward a platform position, not just a services business.
Team depth at the research-engineering interface: Federated AI requires founders who can bridge academic machine learning research (where many of the core algorithmic advances originate) and enterprise software engineering (where production deployments are won and maintained). This combination is rare, and when we find it, it is a strong signal of the team's ability to simultaneously advance the technical frontier and deliver commercial value.

Beyond Life Sciences: The Longer Arc

Life sciences is our focus for federated AI investment today, but our long-term conviction is that federated AI infrastructure will become a critical layer in multiple industries where data fragmentation and privacy sensitivity create centralization barriers. Financial services — where model training on cross-institutional transaction data could dramatically improve fraud detection, credit risk assessment, and regulatory compliance — is the next domain we expect to see significant federated AI adoption. Manufacturing quality control, defense and intelligence applications, and telecommunications network optimization are each compelling medium-term opportunities.

The companies building genuinely general-purpose federated AI infrastructure in life sciences today are positioning themselves for this broader arc. Life sciences provides the demanding early customer base, the regulatory pressure that forces architectural rigor, and the proof points that validate the technology for adjacent industries. The platform companies that win in life sciences will be well-positioned to expand horizontally, taking their architectural advantages and trust relationships into new verticals with structural data fragmentation problems that conventional AI cannot address.

If you are building federated AI infrastructure, privacy-preserving computation, or the data collaboration layer for regulated industries, we want to hear from you. The window for establishing leadership in this category is open, the technical challenges are real and interesting, and the commercial opportunity — unlocking the vast majority of the world's most valuable data for AI use — is as significant as any we have seen in our investment history.

Kai Nakamura

Partner, Neuron Factory. PhD Computer Science (distributed systems), MIT. Former researcher in privacy-preserving machine learning. Leads Life Sciences AI, Federated Infrastructure, and Healthcare DeepTech investments at Neuron Factory Capital.

All Insights Our Investment Thesis