Oak Ridge National Lab’s Piranha system gives scientists a deeper look at their data, says Thomas E. Potok, Group Leader of the lab’s Computer Data Analytics Group.

Apr 30 2013

Big Data Can Save Lives But Is There Room in the Budget?

Agencies are compiling and analyzing more and more data, helping to address a variety of touchy questions.

Each year, about 15,000 people die due to complications from aortic aneurysms. The condition often has no symptoms; the aneurysms simply rupture, causing internal bleeding and leading to death, often within seconds. But what if a sign could alert doctors to aortic aneurysms before they become fatal?

The Oak Ridge National Laboratory (ORNL) Computational Data Analytics Group set out to answer that question with help from Piranha, the lab’s data analytics system. The lab’s analysts searched for keywords within radiology reports of patients with aortic aneurysms and found a correlation with diverticulitis, a condition that affects the intestines. They ran the finding past doctors, who initially dismissed the theory but eventually came around, admitting that there might be something there.

“That was very rewarding,” CDA Group Leader Thomas E. Potok says of the doctors’ validation. The team currently has proposals out to further study the link between the two medical conditions.

ORNL’s aortic aneurysm study is just one of a growing number of potentially lifesaving projects being driven by Big Data. It’s been one year since the White House Office of Science and Technology Policy launched its Big Data Research and Development Initiative, aimed at developing new technologies to collect, share and analyze massive quantities of data across government. The White House furthered that goal in February when it called on all agencies to make federally funded research available publicly. Agencies have responded with a steady stream of projects that are making sense of the world’s massive — and growing — stores of information.

"The volume of the data is getting larger and larger, and I don’t think there’s any end in sight."

“The volume of the data is getting larger and larger, and I don’t think there’s any end in sight,” says Potok. People want to study as much information as possible before taking action. “But especially with huge volumes of data, it would be nearly impossible to pull information out and make sense of it. You start getting into analysis paralysis.”

The challenge, says Doug Laney, Gartner vice president for business analytics, is for agencies to create systems and processes to harness available data in a tough budget situation. He suggests looking at innovative ideas from others. In addition to ORNL, several other federal agencies, such as the U.S. Geological Survey, are taking innovative approaches to their big data initiatives.

“The imperative is there to improve the safety, services and opportunities for citizens,” says Laney.

What’s the Word?

Piranha got its start long before the White House’s Big Data initiative. Back in 2000, ORNL was working with the Defense Department’s Pacific Command to analyze news stories. They started by having the system look through 30 to 50 articles and bring back key points, then built to 10,000 articles and eventually set a target of as many as a million articles a day. Pacific Command wanted the system to cluster data based on keywords and create visual representations of the results.

Since then, ORNL has used Piranha to search resumes and match job candidates and hiring managers. It has analyzed millions of files on hard drives to help law enforcement agencies look for incriminating information. It has even studied the most effective users of ORNL’s high-performance computing program.

The system is driven by software that performs two key functions: It assigns weight to terms based on how frequently they are repeated in documents, then it clusters data based on those keywords to see groupings and patterns across large sets of documents.

Traditionally, term-weighting methods are needed to analyze entire sets of documents to generate weights, but Piranha can look at one document and generate a weight for each word. That means that researchers can run multiple documents on multiple processors instead of just one and increase the speed of searches. Piranha can run on anything from a notebook computer to a supercomputer.

Come Together

$200 million

Total of commitments from six federal agencies for projects announced in accordance with the White House’s Big Data Research and Development Initiative, launched one year ago

SOURCE: White House Office of Science and Technology Policy

Scientists within the U.S. Geological Survey’s Core Science Systems mission are working to let users investigate beneath the Earth’s surface.

A project co-funded by the USGS John Wesley Powell Center for Analysis and Synthesis and the National Science Foundation’s EarthCube initiative are creating a four-dimensional digital crust of North America. The team’s goal is to assemble the layers of the Earth digitally so scientists can understand complex earth system processes over time. The project could bring new insights to topics as diverse as groundwater for cities, crops and wetlands; hazards such as earthquakes, volcanoes and landslides; and exploration for energy and mineral resources.

“Imagine being able to digitally zoom in to the Earth’s surface, go below and see what’s happening there in terms of rock formations, geochemical processes, aquifers and the movement of water,” says Sky Bristol, co–principal investigator for the project.

Like other Powell Center projects, Big Data is a driver, but the goal is to bring together not just data, but also the people who can analyze that data. “In science, we have small pockets of data that characterize various phenomena,” explains Bristol, “but we need new creative ways to interact with more of those data together.”

Project members are working to develop new ways to connect to large data streams. The goal, says Bristol, is to move from downloading data to a single workstation to more of a streaming model in which researchers can tap into the stream, asking discrete questions of very big data.

Much of the data and systems needed for the project already exist. “A lot of our challenge will be bringing all of those disparate pieces together so they can interact and form a more complete picture,” Bristol says. As with all Powell Center projects, data and information products will be available publicly so others can use it in new and innovative ways.

“We really want to promote data and scientific synthesis writ large,” says Bristol, “because that’s where the good ideas come from.”

Tamara Reynolds