Getting a Handle on Big Data
From his vantage point as the leader of a 17-agency effort to put “big data” to work at a national level, George Strawn already sees the impact of what many believe is the next big thing in government IT.
“Big data technologies are allowing us to expand our horizons,” says Strawn, director of the federal Networking and IT Research and Development (NITRD) Program. “They’re introducing new capabilities that generate new opportunities. And, agencies are beginning to share science in ways they were unable to before.”
To expand on this trend and effectively leverage big data more broadly, agencies across the federal landscape must determine what processes will benefit from big data and deploy the appropriate technologies to address their needs.
One agency utilizing big data is the National Oceanic and Atmospheric Administration. In addition to hundreds of petabytes of weather and climate information, NOAA is also harnessing unstructured data.
One example is deep-ocean high- definition video. On its exploratory missions, NOAA’s vessel Okeanos Explorer gathers and stores hundreds of minutes of video at a rate of approximately 1.3 gigabytes per minute. Simultaneously, a scientist narrates what’s being seen into an audio file as both the video and audio feeds stream live over the web.
All captured video is stored on the ship’s 168-terabyte storage area network. Leveraging two large shipboard servers, video stills are created from the raw video. A desktop computer connected to the SAN is used to create highlight videos.
During the process, “all the metadata is being collected, the footage is being tagged and the feeds are being streamed in real time,” says David Michaud, director of NOAA’s High Performance Computing and Communications.
However, NOAA’s big data challenges aren’t limited to capacity and structure issues. “Data management, analytic tools and real-time data movement are additional immediate challenges,” Michaud says.
The General Services Administration (GSA) and its hosted search site USASearch provide big data services to federal, state and local agencies at no cost.
USASearch has moved from using traditional commercial database products to open-source solutions to spur innovation and reduce costs.
According to USASearch Program Manager Ammie Farraj Feijoo, the resulting improvements in features and functionality tripled the adoption of USASearch, to more than 1,000 government websites in just two years.
“We’re processing around half a billion data points each day,” she says. “One hundred percent of it is available for analytics within 15 minutes.”
As other organizations develop big data strategies, the pioneers at NITRD, GSA and NOAA offer a variety of best practice suggestions derived from lessons they’ve learned.
According to Adelaide O’Brien, research director for IDC Government Insights, such advice comes none too soon. “Government spending for big data solutions will begin to expand rapidly,” she says. “Due to a combination of open-source software and decreasing hardware prices, systems previously affordable only to the largest agencies are becoming available to all.”
Adopt an Experimental Mindset
“No single hardware or software manufacturer has all the answers at this time,” says Strawn. “We’re all embarking on a very fine journey that will take a decade or two.”
Nearly 8 Zettabytes
The estimated size of the “digital universe” in 2015
SOURCE: “Business Strategy: Business Analytics and Big Data — Driving Government Business” (IDC Government Insights, June 2012)
Loren Siebert, senior architect for USASearch, concurs. “My advice is to break down problems into a few discrete use cases,” he says. “Then work on ferreting out the technologies that are designed for that use case and completing a proof of concept to demonstrate to yourself and others that the technology can address the use case. Next, put something simple into production. Lather, rinse and repeat.”
Evaluate Emerging Technologies
Agencies are exploring a number of enabling technologies. NOAA is using GridFTP for high-performance, secure, reliable data transfer. It’s also experimenting with the integrated Rule-Oriented Data System (iRODS), developed with federal sponsorship, to “provide an abstraction layer to manage data across multiple data centers and systems,” Michaud says.
Gather Data Lifecycle Requirements Rigorously
Without policies and clearly identified data lifecycle requirements, agencies can quickly pile up petabytes of storage, warns Michaud. On the other hand, discarding raw or computed data too early can result in unnecessary IT costs or even have considerable legal and scientific ramifications.
“Know your organization’s data lifecycle and aging policies for raw data and computed data,” he says. “Ask whether a model run will be more cost-effective to store intact or rerun later. And, for rerunning computations, make sure you consider whether the software will still be available to regenerate it.”
Prepare for Each New Input
Whether it’s a scientific instrument or a roadside traffic sensor, big data is driving nontraditional data-gathering equipment into IT, says Strawn and others.
When deploying new input devices, Michaud emphasizes considering the whole value chain. “Gathering more data just for the sake of it doesn’t net you any societal benefit,” he says. “Ensure infrastructure is already in place before new data streams start transferring massive amounts of information.”
Monitor, Monitor, Monitor
The inherent complexity of distributed computing setups makes it imperative to monitor the health of every component of a big data environment. “The most common pitfall I’ve seen is getting a big data environment running and then not knowing what to do when it dies,” says Siebert. “Monitor both the hardware and the software.”
One way USASearch accomplishes monitoring is by folding new data into historical data. This not only enhances searchers’ experiences on the front end, but also ensures sustainability on the back end. “For example, if search latency goes up beyond a certain metric, then we know we might need more CPU memory or more disk storage,” Siebert says.
Understand the Trade-offs
Finally, regardless of the implementation, every application of data has its own set of risks and costs, notes Michaud. “If you don’t understand the trade-offs, you can sacrifice data security for performance, or vice versa. So, to be successful, remember there is no one-size-fits-all.”