May 08 2010

Built-In Dedupe

Data deduplication adds value to backup software by relieving bloat.

At the United States Geological Survey’s Fort Collins Science Center in Colorado, scientists pore over data and images compiled from satellites, field workers and researchers around the world to understand the changing climate, the impact of wildfires and other natural disasters, the effect of invasive species on local ecosystems, and everything else about the country’s biological resources. They use and generate a mountain of data, all of it vital to national policy and scientific legacy.

That poses a challenge for Jeffrey Schafer, the lead IT specialist for the Fort Collins Science Center who’s charged with protecting all of that information. “The amount of data is only increasing, and that means the backup windows get longer and longer,” he says.

When backup time started eating into the team’s ability to do its work, Schafer started examining backup solutions that included data deduplication to reduce the volume of data. “With deduplication,” Schafer says, “the first backup takes the usual time, but after that, backing up is super fast. Without deduplication — holy cow!”

Responsible for a small agency with a limited budget, Schafer plans to deploy backup software with deduplication capabilities built in, rather than a separate appliance or server designed strictly for deduplication. The software approach is less expensive to deploy and maintain. “I don’t want to have to spend a lot of time managing storage,” says Schafer. “Deduplication has to be easy, or you’ll spend whatever you save on the software hiring someone to manage it.”

The affordability and ease of use offered by backup software with integrated deduplication features has made the combination quite popular in recent years. Today, nearly all major backup software packages offer deduplication, including Acronis Backup & Recovery 10; BakBone NetVault: Backup and NetVault: SmartDisk; Barracuda Networks Backup Service; CA ARCserve; and EMC Avamar.

Eliminating Redundancy

Deduplication is a relatively new technology, but the principle is fairly simple. In every organization, there are pieces of data that are repeated dozens, hundreds, even thousands of times across all the files stored on a network. These could include whole files, such as a memo sent to everyone in the organization and saved to every hard drive on every computer, but much of the replication occurs within files — for instance, a signature block appended to every outgoing e-mail or a logo embedded in every PowerPoint.

6.6 months

Average time in which a data deduplication system pays for itself in reduced storage needs, improved IT productivity and shorter backup and restore times
Source: IDC, 2010

Rather than save these scraps of data over and over again, deduplication scans every file for redundancy and replaces repeated data with a pointer to the original. “It’s like a bouncer at a club,” says Mike Fisch, senior contributing analyst at The Clipper Group. “To get in, you have to be original.”

Deduplication offers a number of benefits when integrated with a backup strategy. First, it reduces the size of individual backups by eliminating redundant data. It also reduces the storage capacity required for subsequent backups because today’s backup image likely shares much, if not most, of its data with yesterday’s. With deduplication, backups can store exponentially more data over time than the actual space they take up. “You can easily get to 20 times, or even 50 times [the amount of data],” says Fisch. “So you can back up a lot more data to disk.”

For the U.S. Department of Agriculture’s Agriculture Marketing Service, deduplication’s ability to drastically reduce the size of backups offers an important side benefit: Reducing the amount of data makes it possible to send backups from offices around the United States over the Internet to a single facility.

“All our system administrators are managing their own storage,” says Jaime Canales, senior datacenter and infrastructure architect at the agency. “So we thought, what’s the point of that, when it would be easier to centralize our storage to one location?”

Further, Canales plans to replicate that central data store between the agency’s Denver and Washington, D.C., offices — and again, deduplication can drastically reduce the amount of data involved. “We’re going to be deduping at the primary site before sending it to the secondary, so that will reduce the amount of data that will be going through the pipe.”

Deduplication is catching on for good reason — it saves time, money, and hassle, allowing agencies to focus on serving the public instead of managing their disk libraries. “Before, we used to make decisions based on backup windows and backup size and the number of tapes we needed,” says USGS’ Schafer. “I want backup to be a side issue, so it’s important that this is really efficient and automated.”