How Agencies Can Seamlessly Manage Data
Agencies may be able to use a commercial cloud- and on-premises architecture to internally streamline data management.
Before the National Institutes of Health began disseminating data through cloud services, potential research collaborators in separate locations shared information via FTP or as an email attachment.
If the file was too large to email or transfer over the internet, some researchers shared data on thumb drives or CDs, which resulted in a lot of data duplication, says Nick Weber, program manager of NIH’s Science and Technology Research Infrastructure for Discovery, Experimentation and Sustainability (STRIDES) initiative.
Rather than having collaborators shuffle items between research centers — or submit NIH-affiliated data that relates to more than one discipline to several applicable repositories — contributors can now send the information to a general repository, allowing other researchers to access it through commercial cloud services, along with the computational tools the providers offer.
“Cloud has really flipped data sharing on its head,” Weber says. “Major data sets are located in cloud environments to allow people to use the data and collaborate with others there.
“That was a major driver for the STRIDES initiative and our partnership with Amazon, Google and Microsoft — to be able to say, how can we make this even simpler for researchers? How can we bring some additional ways to use the technologies the cloud offers to accelerate their research?”
Prior to implementing the Open Data Dissemination Program, the National Oceanic and Atmospheric Administration used a number of segmented paths to distribute information, according to CTO Frank Indiviglio.
Certain climate-related metrics and computations, for instance, were published to the Climate Change Portal web interface by NOAA’s Physical Sciences Laboratory. Weather forecasting models were available in another location.
Now, through an online NOAA repository list and commercial cloud offerings from Microsoft, Google and Amazon, the general public, academic institutions and other entities can access data culled from satellites, ships and other sources. These range from observational system readings to information that needs to be processed before being doled out, such as model data generated on a supercomputer.
“On the modeling side, there’s more opportunity to share that data,” Indiviglio says. “We’re talking about pretty big data sets; it’s not a straightforward thing to just make them available. You have to package them up and get them out to folks.
“The value is, we can present it in a very consumable way, and people can get it without having to go through the hoops of ‘I have to have the right network connection,’ or ‘I have to have this type of infrastructure.’”