It is unclear exactly how much data the federal government generates, but it’s likely in the single-digit petabytes. (A petabyte is equivalent to a million gigabytes.)
In 2013, Informatica estimated that U.S. federal agencies alone “currently store an average of 1.61 petabytes of data, a figure projected to rise to 2.63 petabytes by 2015.” Writing in Nextgov in 2018, Dan Tucker, the vice president at Booz Allen Hamilton: Digital Solutions, and George Young, the vice president of U.S. public sector at Elastic, noted that the "the petabytes-on-petabytes of data that agencies generate, collect and retain, is typically scattered across IT silos."
And writing at DigitalGov, William Brantley, the training administrator for the U.S. Patent and Trademark Office’s Global Intellectual Property Academy, notes that the government “is probably the one of the biggest (if not the biggest) producers of data. Every day, thousands of federal workers collect, create, analyze, and distribute massive amounts of data from weather forecasts to economic indicators to health statistics.”
Increasingly, much of that data, however much of it there is, is being put not into data warehouses but data lakes, which represent a different kind of data repository for agencies. Data lakes are repositories with flat architectures that can hold data from a wide variety of data formats, including unstructured data, allowing users to transform and visualize the data into new structures when needed.
MORE FROM FEDTECH: Find out how your agency can successfully migrate data to modern architectures.
What Is a Data Lake?
Data lakes are acknowledgement that the data that agencies need to use and track to execute on their missions is more than just transactional data that can be stored in databases and in structured file folders in data warehouses, says, Dominic Delmolino, CTO at Accenture Federal Service. Those data formats are varied, unstructured and variable, and include documents, photos, videos and audio files.
“We need to acknowledge reality and find a good place to put all of them, so we can get a good inventory of our data assets, we can tag that data with where it came from, we understand the lineage, the pedigree, who may have looked at, what they may have done with it,” Delmolino says. “And it’s a lot easier if we have an idea of a central place to put all of that data.”
The imagery of the lake comes into play in that users can put fish (data) into the lake, and then get out fish that are important to them, Delmolino explains. It is a large, fluid, natural resource fed by lots of sources, and inside the lake are insights and resources that can be explored and exploited.
Data warehouses are structured. Data lakes, by their nature, are less formal in their organizational structure. “The idea in a data lake is that you can apply structure to that data for a particular purpose when you need to do so,” Delmolino says. In a data warehouse model, users need to apply structure to data before it is placed in the warehouse — the figurative shelf and pallet it needs to sit on. In data lakes, that specific decision is deferred until users need to access the data.
Data lakes store data in its original format from its originating system, according to Delmolino. But if a user wants to use the data and combine it with other systems’ data for a new purpose or for new insights, the user will then decide upon a new structure for the data. Data lakes give users the computing horsepower and space to create different versions of the data.
Cameron Chehreh, COO and CTO of Dell EMC Federal, says data lakes enable agencies to take the data that drives information and insights for them and put the data into “a consolidated and scalable agile repository.”
Agencies have vast missions, Chereh notes, and for a long time built data siloes that were specific to each mission component. “What we’re finding is, to drive new levels of insight and deeper knowledge about these mission segments, it requires more fusion of the data set itself,” he says. “So really, at its core, a data lake is about building a consolidated and scalable agile repository that promotes the ability to build to derive greater insights into your existing mission data sets.”
MORE FROM FEDTECH: Find out how agencies can use government data to drive innovation!
What Are the Benefits of Data Lakes?
There are numerous benefits of data lakes for agencies, experts say. A major benefit is that data lakes allow agency IT leaders and staff to put data from different systems next to each other so that they can correlate data activity across systems.
For example, two different systems may report different numbers of items in inventory. Agencies then spend a lot of time reconciling those systems. “One of my friends used to say, ‘Let’s get all the liars in a room together and come to the truth’” Delmolini says. “A data lake is great for that, because when I put data from different systems in the same place, I can run those comparisons in place and determine, across systems, a real, unified version of agency information.”
IT leaders may discover that one system did not record a transaction because it was not germane to its business process, whereas another system did. Data lakes let agencies understand how they tracked data in the past and if they should continue to do so.
Data reconciliation and the ability to get a unified idea of the status and pedigree of data is much easier to achieve in a data lake, Delmolino says, and agencies can do so in a controlled and secure manner.
Chehreh notes that another key benefit to data lakes is that they can ingest any type of data. They then create a mechanism for agencies to add metadata around the data so that it can be tagged and easily searched by any user that has secure and proper access to the data lake. “This allows people the opportunity to drive those deeper insights,” he says.
For example, the IRS could use data lakes to tie together databases and get a “better line of sight” on waste, fraud and abuse of taxpayer dollars, Chehreh says. The intelligence community can use data lakes to combine data sources and more easily find a terrorist group or other adversary.
“The value is almost immeasurable for an agency, regardless of its mission,” he says. Data lakes help agencies “recapitalize” on their existing data sets and drive more insights.
Data has also become a product unto itself, Delmolino adds. For example, the Census Bureau, a component of the Commerce Department, produces data packages of statistical information that others can use. Data lakes let agencies gather data elements and then define a new structure for that data for new use cases — since the structure of the data is not pre-defined.
“The data lake is the place where I combine and transform the data for my new purpose, whether it’s making data open and available to the public, whether it’s making data available for a new system or new mission requirement in my agency, or for a data-sharing agreement with another agency,” Delmolino says. “The lake gives me that flexibility and the place to track the lineage. Here is the source data, here is the data I am releasing I am making available, and here is the linkage and traceability from my source system to this new use of the data.”
Data lakes give agencies more control over the data they want to share and visibility on where the data came from, according to Delmolino. By their very definition, data lakes have extremely large storage and processing capabilities, he adds, so agencies no longer have to build so-called “data marts,” smaller repositories of data attributed derived from raw data.
Additionally, data lakes can exist in the cloud and can have elastic storage and computing horsepower capabilities that cloud architectures provide. Further, agency IT leaders can define different levels of storage capabilities, and cloud service providers often offer varying tiers of storage. If users do not need to regularly access data, it can be placed into a “colder” part of the lake, in storage that is less expensive but also less responsive. If users need to quickly access data often, it can be placed into a “warmer” part of the lake in storage that is more expensive but more responsive.
MORE FROM FEDTECH: Find out which federal IT trends you need to keep an eye on in 2019.
How Can Data Be Moved into a Data Lake?
As agencies retire legacy systems and migrate data to more modern architectures, they often must consider how to maintain records they are required by law to retain, some of them in legacy systems that are 30 to 40 years old.
They can place that data into data lakes, Delmolino says. “A data lake is a great place to retain that data format from a legacy system even as I retire that system,” he says. Migrating data to a data lake gives agencies an easier way to get access to and assert ownership of that historical data, as well as to transform it into new formats.
When moving data to a data lake, agencies need to create as accurate a picture as possible of their existing data inventories, Delmolino says. Most agencies will not have a complete picture of their data.
Then, agencies must decide which pieces of data they want to move to the data lake. Generally, data lakes can handle data from lots of different formats, including relational databases and unstructured data. “What technologies are you most comfortable managing, using and have the confidence that you can maintain security control and backup of the data?” Delmolino says. “There is some technical assessment of, ‘I need a durable location that I can manage and secure to store lots of different kinds of data formats.’”
Delmolino does recommend agencies look at cloud options for data lakes, since agencies have more flexibility, and they can also get an understanding from working with a cloud provider around what kinds of security they need for their data.
MORE FROM FEDTECH: Discover the keys to digital transformation in government.
What Tools Can Agencies Use to Set Up Data Lakes?
Most data lakes are based on Hadoop, or the Hadoop File System. There are several Big Data vendors that provide hosted instances of HDFS and then augment that with different kinds of data transformation technologies or integration with relational technologies, Delmolino notes.
These include IBM, Microsoft and Oracle, he adds. “In many cases, for our clients, it’s a question of how to work with your existing vendors to augment your physical landscape to support a data lake,” Delmolino says. “And then consider that that’s going to be working in a hybrid fashion with one of the major cloud vendors. And oftentimes they are the same vendor.”
Agencies also need to strongly consider security when putting data into data lakes, Chehreh says. Dell EMC builds data lakes on top of its Isilon storage and elastic cloud offering on top of that. Then, agencies can use whatever analytics tool they want to use, be it a Software as a Service tool or a SQL tool, and apply that analytic fabric to the new data lake.
However, agencies can control access to the data in the data lake through the same security functions and authentication methods they did before, Chehreh says. “You control the access to the data through the same security functions you would use today, and then also have it correlate to the metatags and the metadata that is created around your core data sources, so that you can still protect the sovereignty of the core information you would in today’s world,” he says.