Artificial Intelligence

How NIH Is Translating 70 Years of Health Data to Speak the Same Language

A cross-agency effort to harmonize 12 petabytes of biomedical data is laying the groundwork for AI-driven health research at the National Institutes of Health.

Ashley Hackett

Ashley Hackett is a Senior Editor of FedTech magazine.

The National Institutes of Health sits on one of the largest collections of biomedical research data in the world. Decades of federally funded studies on health issues such as heart disease, lung conditions, sleep disorders and genomics have generated petabytes of information. But for most of that history, the data was collected by different institutes and under varying standards. Historically, those disparate formats haven’t been easy to integrate.

That dissonance is a problem when the goal is to train AI models that can accelerate medical research. Three NIH offices — the National Heart, Lung, and Blood Institute (NHLBI), the National Library of Medicine (NLM) and the Office of Data Science Strategy (ODSS) — are now working together to fix it. In a recent Bethesda AFCEA Health IT Summit panel on operationalizing data at scale in federal health, leaders from the three offices described the interoperability infrastructure they're building and why it matters for AI.

12 Petabytes and a Converter Box

At the center of the effort is BioData Catalyst, a cloud-based ecosystem developed by NHLBI in collaboration with NLM and ODSS.

"[NIH has] over 12 petabytes of data that's all multimodal — everything from genomics, clinical imaging, sleep, sensor data, all different modalities," said Sweta Ladwa, chief of the Scientific Solutions Delivery Branch, Information Technology and Applications Center (ITAC), at NHLBI. That data spans long-running studies such as the Trans-Omics for Precision Medicine (TOPMed) program, which tracks about 180,000 individuals.

But access alone doesn't make data AI-ready. The more difficult problem is interoperability — ensuring a cardiovascular variable from, say, a 1990s Framingham Heart Study cohort means the same thing as one from a recent pulmonary fibrosis study. NHLBI has built a LinkML pipeline to solve this — what Ladwa called a "converter box approach, where you can plug in the data from the source, put it in, and it will put it in that [format] for your analysis." The pipeline maps data across multiple standards, including LOINC (Logical Observation Identifiers Names and Codes), FHIR (Fast Healthcare Interoperability Resources) and HPO (Human Phenotype Ontology).

NHLBI pairs the automated mapping with clinical validation. "We have pulmonologists who we're working with to really clinically determine the concepts," Ladwa said, "because we want to make sure that this fancy hypertensive medication is the same as this other one. They're all in the same class, and they're all the same ontological concept." She added that the AI-assisted mapping uses "publicly available metadata" rather than patient data.

Click the banner below to start modernizing your agency’s IT ecosystem.

From Research Standards to the Electronic Medical Record

While NHLBI focuses on making existing research data interoperable, ODSS is working on a complementary problem: getting research-grade standards into the clinical systems where new data is generated daily.

Susan Gregurick, NIH’s associate director for data science and director of ODSS, described a push to map NIH research standards into the United States Core Data for Interoperability (USCDI) — the interoperability standard that electronic medical record systems use for accreditation. Gregurick described efforts to map NIH research standards into clinical interoperability frameworks, noting work that began in oncology and is expanding to other disease areas, including a cardiovascular partnership with NHLBI.

The implication: When cardiovascular phenotypes appear in a patient encounter — even outside a formal study — EMR systems can capture them in a format researchers can use.

"The impact for that sort of cross-agency collaboration is really huge," Gregurick said. "I think that that's almost apart from AI, but it's going to be something that drives AI in the future."

ACT WITH URGENCY: Read CDW’s report on the pace of AI evolution

Curating Data for Machines With Common Data Elements

Underpinning much of this work is the National Library of Medicine, the world's largest biomedical research library. Lisa Federer, acting director of NLM's Office of Strategic Initiatives, described the library as providing "the substrate for future work in AI" through assets such as Medical Subject Headings (MeSH) and Common Data Elements, which standardize how research data is described and collected across NIH.

But what's shifting, Federer said, is who consumes the data.

"We're not just thinking about humans. We're thinking about how machines are consuming data as well," Federer said. "You do have to consider not just a human consuming that, but how is agentic AI going to be consuming this information?"

How NLM curates data "for machine consumption is different from how we would curate it for human consumption," Federer said, a recognition that AI agents querying NIH resources don't read context clues the way a researcher does.

This distinction makes standardizing Common Data Elements all the more important, especially as NLM has noted an uptick in the number of bots and other AI agents crawling their digital repositories.

What's at Stake for NIH and the Future of Health Research

NIH invested nearly $400 million last year in AI-related research grants, according to Gregurick. But the less visible investment — the pipelines, ontologies and standards that make data usable across institutional boundaries — may matter more.

"When you're able to [connect] that data with the real-world data, with other data that exists out in the research space, or in the health data fabric across the nation, or even internationally, you just increase the power of that data to be able to do more," Ladwa said, "ultimately to help those affected by diseases and disorders."

Without interoperable standards, 70 years of health data stays locked in the formats it was born in. With the standards, it becomes the foundation for a new generation of AI-driven medical research.

Photograph by Bethesda AFCEA