Federal agencies are the custodians of massive amounts of data. The National Oceanic and Atmospheric Administration alone gathers about 20 terabytes of data per day, for example.
There was a time when such large data sets could only be accessed or analyzed by powerful on-premises computers, but the arrival of cloud computing has shifted the dynamic.
Today, it’s entirely possible to work with terabytes of data without downloading a local copy or laying claim to a supercomputer. Agencies such as the National Institutes of Health, NASA and the Energy Department are exploring ways to use the cloud to share their wealth of data efficiently with researchers outside of government.
“Cloud provides a practical mechanism for distributing data more easily, quickly and to a wider audience,” says Adelaide O’Brien, research director of government digital transformation strategies at IDC.
NIH Turns to Cloud to Enhance Research Partnerships
In late 2017, NIH launched the Data Commons Pilot to test-drive commercial cloud services as a way to store, access and share biomedical data and associated tools. The program involved about 20 universities and independent research institutes, and three test case data sets: Genotype-Tissue Expression (GTEx), Trans-Omics for Precision Medicine (TOPMed) and the Model Organism Databases (MODs). “TOPMed is very large and very complex, in the petabyte range,” says Vivien Bonazzi, senior adviser for data science at NIH and the Data Commons project lead. “MODs has highly used assets, but is not very big. GTEx is somewhere in between, at several hundred terabytes.”
“They’re also related to each other scientifically, in the sense that researchers want to look across those data sets and get much more informed answers,” she explains.
The Data Commons Pilot is not the NIH’s only foray into the cloud. The Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative consists of public-private partnerships with commercial cloud service providers. STRIDES provides computational storage, computing, tools and training for data-intensive biomedical research in a cost-effective manner, project lead Nick Weber says.
NIH has made agreements with two commercial cloud service providers, including Google Cloud Services, to support the STRIDES Initiative; however, both the Data Commons Pilot and the STRIDES Initiative are being designed with an architecture that will function across cloud platforms and providers to ensure interoperability.
MORE FROM FEDTECH: Find out which cloud model is right for your agency.
NIH Overcomes Both Tech and Cultural Cloud Hurdles
Bonazzi and Weber faced hurdles in standing up their projects, which fall under the NIH Common Fund’s New Models of Data Stewardship program, and those hurdles have been both technological and cultural.
On the tech side, finding a cloud services provider is only the start, they say. Data needs to be uploaded (and when it comes in petabytes, that takes time), it needs to be compartmentalized into buckets, and researchers need the right permissions to access it.
Most biomedical researchers are not technologists, so a robust user interface is a must. Increased personal use of cloud resources can mask the scale of the challenge faced by the Data Commons Pilot, Bonazzi says: “When data is already in the cloud, there’s a misconception that you can use it right away.”
Making the test case data sets accessible and interoperable but still secure in a cloud environment is a daunting task. Bonazzi explains that this approach to data management is a cultural shift for NIH stakeholders, and requires time and buy-in to be successful.
There’s also the competitive nature of research to contend with. Because researchers need to battle for scarce funding, they are often territorial — but if each group builds its own service stacks on top of the data, that creates silos, Bonazzi warns.
“If we, in the long term, plan on having a collection of NIH data sets in the cloud, and if we have systems that operate individually on that data, and they can’t talk to each other — that’s a really big problem,” she says.
Helping outside researchers to work together was key to the success of the pilot, and Bonazzi is pleased with the progress made toward a collaborative approach. Even in its early stages, the NIH’s work on data stewardship is attracting interest.
“I’ve had quite a few federal agencies contact me over the last year asking for more information,” Bonazzi says.
MORE FROM FEDTECH: Find out how your agency can successfully migrate data to modern architectures.
NASA Gains Access to Scalable Computing Resources
Moving from pure storage to cloud-based analytics — those stacks of services NIH is building — has significant benefits, allowing researchers to both save time and work across data sets in a way that can be difficult or impossible using only on-premises computing resources.
NASA began seriously investigating the use of the cloud to host its vast earth sciences holdings in 2012, and is prototyping, testing and evaluating the use of cloud computing for distributing and processing earth observation data.
This image of Tropical Cyclone Riley off the Australian coast was created with data from NASA’s Worldview app, which lets scientists download satellite imagery layers collected in near real time to create their own maps and imagery. (The black slashes are data gaps.) Source: NASA
Two major subsystems, the Common Metadata Repository and Earthdata Search, have been migrated to the cloud and are operating there, says NASA’s Karen Petraska, program executive for computing services in the Office of the CIO, and Andrew Mitchell, manager of the agency’s Earth Science Data and Information Systems Project.
Petraska and Mitchell cite the scalability of the cloud as a key research-enabling factor, allowing NASA scientists and affiliated researchers on-demand access to computing power on a needs-driven basis. Hosting data in the cloud also helps the organization move closer to the goal of making NASA’s earth science data accessible to the public, they say.
MORE FROM FEDTECH: Discover how agencies can take advantage of CASBs.
Cloud Enable DOE Labs to Enhance Collaboration
Pacific Northwest National Laboratory has the distinction of being the first federal entity to negotiate enterprise cloud computing contracts with a commercial cloud provider such as Microsoft. Since that beginning in 2011, PNNL has supported strong integration among cloud-based and on-premises resources, says CISO Jerry L. Cochran.
“Overall, we try to take the approach in IT that the cloud is just an enabling technology — we don’t put it in a box,” he explains.
Based in Richland, Wash., PNNL uses the cloud to drive cross-country collaboration, such as its atmospheric research project with Tennessee-based Oak Ridge National Laboratory.