Federal Agencies Dive into Big Data
Federal agencies are making big investments in Big Data in an effort to speed scientific discoveries and to address some of the nation’s most pressing issues, from improving education and healthcare to strengthening national defense and cybersecurity.
Society is drowning in huge volumes of heterogeneous data, from research studies; large-scale simulations; scientific instruments such as telescopes, sensors from bridges and other critical infrastructure; and even email, videos and social media, says Farnam Jahanian, assistant director of the National Science Foundation’s (NSF) Computer and Information Science and Engineering (CISE) Directorate.
The goal is to analyze the data to find correlations and discover trends, patterns and other important information, he says. “The bottom line is that there are enormous opportunities to harness these large-scale, diverse data sets and extract knowledge from them, and to provide new approaches to drive discovery and decision-making and make accurate predictions.”
In the future, Big Data analytics could enable biomedical discoveries and help doctors predict the onset of diseases, allow educators to analyze student behavior and performance to improve learning, and lead scientists to more accurately predict natural disasters, he says.
“We are just barely scratching the surface of what is possible,” says Suzi Iacono, deputy assistant director for NSF’s CISE Directorate and co-chair of the Big Data Senior Steering Group, which coordinates activities on Big Data across the government. “If we can make giant strides, we know we can benefit our society tremendously.”
To kick-start the effort, the Obama administration in March 2012 launched the Big Data Research and Development Initiative, with six federal departments and agencies announcing more than $200 million in new projects. Many federal agencies are expected to expand their Big Data efforts in 2014.
The Big Data market is still emerging, but the federal government is an early leader in the effort, as its investments begin to yield innovation, says Svetlana Sicular, a Gartner research director of data management strategies. “We’re still early in the cycle for adoption, but the federal government is pretty much ahead of many industries.”
$250 million
The amount the Defense Department is investing annually in Big Data research projects
SOURCE: Office of Science and Technology Policy, Executive Office of the President
NSF Plays a Key Role
Technology vendors are building the server, storage and data analytics software tools to help agencies tackle their Big Data problems. At the same time, agencies must invest in research and development to create new technology tools and techniques to meet Big Data challenges across many sectors, Iacono says.
The cornerstone of the White House’s Big Data announcement last March is a joint effort by the NSF and the National Institutes of Health (NIH) to fund research, Jahanian says. The two agencies awarded nearly $15 million to eight teams of university researchers last fall and will announce a second set of awards in early 2013.
The agencies want researchers in different disciplines to work together to develop new algorithms, statistical methods and technology tools to improve data collection and management, bolster data analytics and help create collaborative environments, Iacono says.
“We are trying to incentivize statisticians, mathematicians and computer scientists to work together and come up with new algorithms that can scale to hundreds of millions of data points,” she says.
Because the nation faces a shortage of workers with analytics expertise, the NSF is also funding training programs for college students.
The NSF is collaborating with NASA and the Department of Energy’s Office of Science to create a series of competitions to give all current researchers, students and citizens opportunities to come up with new ideas for Big Data projects in health, energy and earth science, and produce software code and technology tools. Winners will receive monetary prizes.
“We want everyone to have the chance to come up with new ideas and new solutions,” Iacono says.
Treating and Curing Cancer
Photo: Joshua Roberts
“In the future, clinicians will be able to query databases to provide targeted and personalized treatments for their patients.”
—Jack R. Collins
Advanced Biomedical Computing Center
The goal of Big Data analytics is to combine structured and unstructured heterogeneous data, so they become more homogenous and more useful, Iacono says.
For example, the Frederick National Laboratory for Cancer Research in Maryland, which is funded by the National Cancer Institute, has developed a prototype that proves it’s possible to analyze vast amounts of structured and unstructured biological data quickly, which will allow for advances in cancer research and treatment.
In one case, the infrastructure cross-referenced 20 million medical abstracts — unstructured data — with simulated gene expression data from 60 million patients and simulated microRNA gene expression data from 900 million patients, which is structured data, says Robert Stephens, director of the bioinformatics support group at Frederick National Lab’s Advanced Biomedical Computing center.
In another test scenario, researchers ran a query on 20 million medical abstracts, with a combination of 30,000 genes and 150 disease terms. The query, which totals 4.5 million term combinations, took only 40 seconds to run, Stephens says.
This means the research community can analyze large amounts of data to better understand how genes behave in certain situations and the effect certain drugs might have on them — perhaps discovering, Stephens says, that one specific drug doesn’t work well with patients with a specific gene mutation.
In the future, clinicians will be able to query databases to provide targeted and personalized treatments for their patients, says Jack R. Collins, director of the Advanced Biomedical Computing center.
“This will make a big difference in diagnosis and treatment. Clinicians are faced with a huge amount of information — clinical sources, drug sources, testing and trial sources, and research sources. No one can keep up with all that,” Collins says. “When the clinician has 30 to 45 minutes before the next patient rolls in, this can provide an efficient, automated way of narrowing all the information down to a human-decipherable format.”
The infrastructure is built using an Oracle Big Data Appliance and an implementation of Apache Hadoop, which allows for large-scale data analysis, says Uma Mudunuri, manager of the computing center’s core infrastructure group.
Overall, the proof of concept the lab created can be used for all types of biomedical research, Mudunuri says. “The major setback with many projects in the past was that performance was slow. We were not able to scale. Now, they can test hypotheses in a timely manner.”
Enabling Fast Research
Gartner’s Sicular says one trend in Big Data is the cross-pollination of industries — applying the experience and best practices of one industry to another to meet Big Data needs. For example, e-commerce websites provide good user interfaces to help shoppers quickly find what they need, she says. The government can now apply those strategies to its projects.
The National Archive of Computerized Data on Aging (NACDA), for example, houses the nation’s largest library of electronic data on aging, providing a central online repository for researchers to retrieve data for their projects.
The archive, funded by the National Institute on Aging (NIA), houses more than 100,000 files, including 50 years of mortality data, 1,600 studies from researchers and 40,000 publications. Researchers can do searches on topics, such as “depression,” or by year, and simultaneously look at multiple sets of raw data or previous reports.
“It works like Amazon. You find the product you are interested in, put it in a cart and download it. The only difference is you don’t pay,” says NACDA Director James McNally.
The NACDA staff focuses on creating metadata, which is a description of the data, to make the files easily searchable for its 22,000 registered users.
As technology has improved throughout the decades, NACDA, a 30-year-old organization, has evolved with it. It has gone from paper records to PDF documents to making the data accessible over the Internet in a variety of formats, McNally says.
“The pace of change, and the ability to dig into the data, is just phenomenal,” he says. “What makes the Big Data movement practical is that as technology evolves, we increasingly have better tools to manage this information.”
The data is housed within the University of Michigan’s Inter-university Consortium for Political and Social Research (ICPSR), a repository of social science data that includes other federal data, such as the National Archive of Criminal Justice Data, the National Addiction & HIV Data Archive Program and the Substance Abuse and Mental Health Data Archive.
Photo: Joshua Roberts
The Frederick National Laboratory for Cancer Research’s Big Data project will speed analysis of massive data sets, say Uma Mudunuri, Robert Stephens and Jack R. Collins.
ICPSR stores data on an EMC Celerra NS-120 network-attached storage (NAS) array. Researchers download the data quickly over the high-speed research network of Internet2, an advanced networking consortium led by the research and education community, says Bryan Beecher, ICPSR’s director of computing and network services.
Copies of the data are replicated to an offsite cloud provider twice daily. This enables access to the data even when ICPSR suffers a power outage or when the IT department performs system maintenance, Beecher says.
Big Savings
Elsewhere, the Treasury Department’s Bureau of Engraving and Printing (BEP) hopes to use Big Data to improve its manufacturing process and save money.
The agency just spent more than two years retiring its 25-year-old mainframe system and moving to Oracle’s Enterprise Resource Planning software, called the eBusiness Suite. In doing so, the agency, which prints $7 billion to $9 billion in new currency each year, has unified all its business data and raw manufacturing data from its equipment.
BEP has manufacturing plants in Washington, D.C., and Fort Worth, Texas, and all the data from the equipment in those plants previously resided in several dozen disparate data repositories. With the data consolidated in one place, CIO Peter Johnson now wants to use data analytics tools to find relationships in the data to help the agency improve manufacturing quality and increase efficiencies.
“We want to sift through all the manufacturing data so we can detect quality problems that we did not see before,” he says. “For example, maybe we can correlate that if we run a piece of equipment ‘x’ number of cycles and if it hits a certain number of cycles, then we start having quality problems two or three weeks later.”
BEP collects tens of thousands of data elements from its manufacturing equipment, from the temperature of a piece of machinery and cycle processing times to error code information and color consistency. The goal is to analyze the data in real time and spot trends and potential problems before they occur.
“If something happens in one part of the manufacturing environment, and we are able to correlate it with something else that happens in another part of manufacturing, we could prevent a problem from happening in the future,” Johnson says.
All the agency’s data — totaling millions of records — is housed in an Oracle 11g database. Agency officials currently use this data warehouse to run hundreds of reports for day-to-day operations. The next step, Johnson says, is to do “big-time analytics.” The agency is currently testing data analysis tools from several vendors and hopes to standardize on one in the spring.
Overall, the federal government’s investments will pay huge dividends now and in the future, says NSF’s Jahanian. “We are laying a foundation to improve our competitiveness for decades.”