FEDTECH: Can you talk about the importance of data to the NCI and the role your team plays in data management?
CORTNER: Data are the primary output of NCI’s research programs and the key resource our investigators use to bring forth the discoveries, diagnostics and therapeutics produced by the institute to advance its mission. From this perspective, scientific data and the people who spend their time extracting knowledge from it can be viewed as NCI’s greatest assets.
Optimizing the ability of our investigators to analyze the scientific data they generate is essential to the institute’s success and a major way NCI’s IT groups support the mission. We do it by providing a highly secure environment that maximizes data access and interoperability with cutting-edge computational resources.
FEDTECH: How do you approach data storage, and how has this strategy evolved over time?
CORTNER: In 2018, we determined that our investigators lacked the tools they needed to easily manage their data and instead were spending too much time managing storage and data transfers. The net result was that a large portion of NCI’s high-value scientific data was stored without appropriate backup and was disassociated from the experimental detail and sample-level metadata required to understand and analyze the data. Besides being frustrating and inefficient for investigators, this situation placed our HVSD at risk.
Initially, we were interested in on-prem [Amazon] S3 storage because of the low cost and the possibility that a solution like Cloudian could enable programmatic access. We subsequently determined that it is more efficient for us to work directly with Amazon Web Services to create tiered storage solutions that incorporate cloud rather than adding an additional layer of fees and getting locked into services provided by the on-prem S3 storage provider. This is especially true given how rapidly AWS services evolve. On-prem S3 does retain its cost advantage over other on-prem storage hardware, but there are performance trade-offs that limit the use cases on-prem S3 can realistically support.
NCI has sought to leverage the utility of on-prem S3 while minimizing the performance impacts by using it primarily as an archive for master copies of our HVSD. Operational copies of that data are made available on more performant storage juxtaposed with the computational resources used for data processing and analysis.
FEDTECH: You’re clearly concerned about keeping data safe. What part do these master copies play in your overall data backup strategy?
CORTNER: In the broadest sense, safeguarding data requires both data protection and disaster recovery. Data protection strategies aim to mitigate data loss caused by unintended changes introduced by user error or file corruption by periodically replicating, or backing up, the data. Disaster recovery refers to the ability to fail over to a replicate version of user data within the framework of their working file structure in the event of a hardware failure in the primary data center.
We use a multipronged storage approach to safeguard our data. Read-only master copies of instrument-generated HVSD — genomic data are one example — are co-stored with basic curation-level metadata in a noncommercial virtualized storage system that can also retain the sample-level experimental data that are essential to support both primary studies and secondary reuse.
We call this virtualized storage system the data management environment. It consists of an iRODS database and graphical user interface that is compatible with multiple storage types and enables data to be programmatically moved between storage tiers. Operational copies of HVSD stored in the DME can be downloaded to local storage at computational facilities while the data are undergoing active analysis and then be deleted when no longer in use. DME master copies effectively serve as the backup for operational copies. Master copies of processed data and data analysis outputs can be sent back to the DME and associated with the original HVSD from which they were derived.
FEDTECH: You mentioned disaster recovery. What would you do if you suddenly experienced a hardware failure in your on-prem data center?
CORTNER: Our on-prem S3 is geographically redundant, with data automatically backed up using S3 protocols between two sites. We basically keep another copy of our data in a different data center.
The drawback with this method is that it’s hardware-intensive, and the reality is that it’s very rare to have a disaster that’s going to wipe out an entire data center. With that in mind, we’re exploring using Amazon’s new service for disaster recovery. What they do is get replicas of your virtual machines, and you can spin them up really fast in the event that you need them.
We’ve known for three or four years that cloud for disaster recovery was going to have the potential to be incredibly cost-effective, but up until recently we weren’t really there. Now, I think with some of the new features in AWS, we’re actually at the point where they can start delivering.