It is unclear exactly how much data the federal government generates, but it’s likely in the single-digit petabytes. (A petabyte is equivalent to a million gigabytes.)
In 2013, Informatica estimated that U.S. federal agencies alone “currently store an average of 1.61 petabytes of data, a figure projected to rise to 2.63 petabytes by 2015.” Writing in Nextgov in 2018, Dan Tucker, the vice president at Booz Allen Hamilton: Digital Solutions, and George Young, the vice president of U.S. public sector at Elastic, noted that the "the petabytes-on-petabytes of data that agencies generate, collect and retain, is typically scattered across IT silos."
And writing at DigitalGov, William Brantley, the training administrator for the U.S. Patent and Trademark Office’s Global Intellectual Property Academy, notes that the government “is probably the one of the biggest (if not the biggest) producers of data. Every day, thousands of federal workers collect, create, analyze, and distribute massive amounts of data from weather forecasts to economic indicators to health statistics.”
Increasingly, much of that data, however much of it there is, is being put not into data warehouses but data lakes, which represent a different kind of data repository for agencies. Data lakes are repositories with flat architectures that can hold data from a wide variety of data formats, including unstructured data, allowing users to transform and visualize the data into new structures when needed.
MORE FROM FEDTECH: Find out how your agency can successfully migrate data to modern architectures.
What Is a Data Lake?
Data lakes are acknowledgement that the data that agencies need to use and track to execute on their missions is more than just transactional data that can be stored in databases and in structured file folders in data warehouses, says, Dominic Delmolino, CTO at Accenture Federal Service. Those data formats are varied, unstructured and variable, and include documents, photos, videos and audio files.
“We need to acknowledge reality and find a good place to put all of them, so we can get a good inventory of our data assets, we can tag that data with where it came from, we understand the lineage, the pedigree, who may have looked at, what they may have done with it,” Delmolino says. “And it’s a lot easier if we have an idea of a central place to put all of that data.”
The imagery of the lake comes into play in that users can put fish (data) into the lake, and then get out fish that are important to them, Delmolino explains. It is a large, fluid, natural resource fed by lots of sources, and inside the lake are insights and resources that can be explored and exploited.
Data warehouses are structured. Data lakes, by their nature, are less formal in their organizational structure. “The idea in a data lake is that you can apply structure to that data for a particular purpose when you need to do so,” Delmolino says. In a data warehouse model, users need to apply structure to data before it is placed in the warehouse — the figurative shelf and pallet it needs to sit on. In data lakes, that specific decision is deferred until users need to access the data.
Data lakes store data in its original format from its originating system, according to Delmolino. But if a user wants to use the data and combine it with other systems’ data for a new purpose or for new insights, the user will then decide upon a new structure for the data. Data lakes give users the computing horsepower and space to create different versions of the data.
Cameron Chehreh, COO and CTO of Dell EMC Federal, says data lakes enable agencies to take the data that drives information and insights for them and put the data into “a consolidated and scalable agile repository.”
Agencies have vast missions, Chereh notes, and for a long time built data siloes that were specific to each mission component. “What we’re finding is, to drive new levels of insight and deeper knowledge about these mission segments, it requires more fusion of the data set itself,” he says. “So really, at its core, a data lake is about building a consolidated and scalable agile repository that promotes the ability to build to derive greater insights into your existing mission data sets.”
MORE FROM FEDTECH: Find out how agencies can use government data to drive innovation!
What Are the Benefits of Data Lakes?
There are numerous benefits of data lakes for agencies, experts say. A major benefit is that data lakes allow agency IT leaders and staff to put data from different systems next to each other so that they can correlate data activity across systems.
For example, two different systems may report different numbers of items in inventory. Agencies then spend a lot of time reconciling those systems. “One of my friends used to say, ‘Let’s get all the liars in a room together and come to the truth’” Delmolini says. “A data lake is great for that, because when I put data from different systems in the same place, I can run those comparisons in place and determine, across systems, a real, unified version of agency information.”
IT leaders may discover that one system did not record a transaction because it was not germane to its business process, whereas another system did. Data lakes let agencies understand how they tracked data in the past and if they should continue to do so.
Data reconciliation and the ability to get a unified idea of the status and pedigree of data is much easier to achieve in a data lake, Delmolino says, and agencies can do so in a controlled and secure manner.
Chehreh notes that another key benefit to data lakes is that they can ingest any type of data. They then create a mechanism for agencies to add metadata around the data so that it can be tagged and easily searched by any user that has secure and proper access to the data lake. “This allows people the opportunity to drive those deeper insights,” he says.
For example, the IRS could use data lakes to tie together databases and get a “better line of sight” on waste, fraud and abuse of taxpayer dollars, Chehreh says. The intelligence community can use data lakes to combine data sources and more easily find a terrorist group or other adversary.