When the National Cancer Institute stores its genome data, naming a file after the type of cancer it contains isn’t enough. The study of cancer is so precise these days that scientists need other identifying characteristics to better target a possible cure, says Jeff Shilling, the agency’s acting CIO.
“It’s got to go past, ‘Where did you get it from?’ and ‘What is it?’” he says.
NCI adheres to the notion that data is only as valuable as its metadata. As TechTarget notes, metadata “summarizes basic information about data, which can make finding and working with particular instances of data easier.” The more granular the metadata, the more information agencies can store about it, and the easier it is to catalogue and analyze it.
In the past, a file was associated with its name, the date it was created and when it was last edited. None of that information is useful for identifying the relative value of that data.
This is why analysts say modern metadata management will become critical as agencies look to glean more information and benefit from their data. Artificial intelligence and machine learning are at the core of this trend.
Using metadata, agencies can set archive and storage policies more easily and create more consistency, so data that was once unusable can be accessed, analyzed and shared.
“Metadata that’s captured can then be used to identify files and to establish policy around them,” explains Steven Hill, senior analyst for applied infrastructure and storage technologies at 451 Research, an IT research and advisory firm. “And the cool thing is that it’s virtually unlimited in terms of scalability.”
The more information an agency has about its data, Hill tells FedTech, the more flexibility it has in handling and automating it.
“This is really about the re-emergence of object storage as the ideal framework for policy-based management because of its metadata capabilities, as well as its massive scalability,” he says.
The Role Metadata Plays in Data Lakes
Metadata is a key element that makes data lakes so valuable. Data lakes are repositories with flat architectures that can hold data from a wide variety of data formats, including unstructured data, allowing users to transform and visualize the data into new structures when needed.
Cameron Chehreh, CTO and vice president of pre-sales engineering at Dell EMC Federal, has told FedTech that data lakes enable agencies to take the data that drives information and insights for them and put the data into “a consolidated and scalable agile repository.”
Chehreh notes that another key benefit to data lakes is that they can ingest any type of data. They then create a mechanism for agencies to add metadata around the data so that it can be tagged and easily searched by any user that has secure and proper access to the data lake. “This allows people the opportunity to drive those deeper insights,” he says.
Agencies also need to strongly consider security when putting data into data lakes, Chehreh says. However, agencies can control access to the data in the data lake through the same security functions and authentication methods they used before, he says. “You control the access to the data through the same security functions you would use today, and then also have it correlate to the metatags and the metadata that is created around your core data sources, so that you can still protect the sovereignty of the core information you would protect in today’s world,” he says.