A global data redundancy plan lets USGS ramp up services when unexpected events such as earthquakes drive up demand for the agency’s data, says CIO Kevin Gallagher.

May 05 2010

The Double-Duty Infrastructure

Redundancy doesn't just offer disaster recovery and COOP benefits, it also lets agencies scale services when data demands run high.

The U.S. Geological Survey maintains data centers in the United States, but its technology reach extends to stream gauges, volcano sites and earthquake reporting stations around the world.

The agency collects and shares data using a combination of redundant servers at its data centers and commercial content delivery network (CDN) services.

Data coming into USGS — such as earthquake sensor information — travels multiple paths from many stations to ensure data delivery. And data going out — to the public NatWeb site, to the USGS emergency notification system and to other hosted sites such as local science centers — is served from three data centers across the country. Each center’s servers are configured identically for backup purposes, with data replicated between all servers.

The approach supports both disaster recovery and the agency’s continuity of operations plan, but it also allows USGS to continue to provide access to natural resources data — no matter the demand.

“In all phases of the information cycle, you need to build for and account for redundancy,” says Kevin Gallagher, CIO and associate director of the USGS Geospatial Information Office. “We can bring up individual systems and communication lines without systemwide downtime.” The current server infrastructure, says Gallagher, along with the use of a CDN, lets the USGS gather data and keep it available even when demand peaks, as it does during natural disasters.

Setting a High-Availability Plan

Using a CDN is a best practice for an organization that has a lot of content to push to the server, says Mike Gualtieri, senior analyst at Forrester Research. “The main value of the CDN is to overcome geographic latency,” he says. The CDN pushes the information closer to where users are, taking the burden off of central data centers. And, in a localized emergency, CDNs can often do intelligent routing from other data centers to ensure that the local data center doesn’t become a bottleneck, Gualtieri says.

USGS uses a round-robin strategy for its CDN, which sends user web page requests to the CDN first, then routes the request to the next available USGS server. “They’re not actually looking at the load on each server,” Gallagher says, “but naturally fanning out the requests across available servers.”

The USGS website experienced a peak of 2,837 requests per second during the Haitian earthquake and 2,319 requests per second during the Chilean earthquake.

Content delivery networks handle what are called flash crowds — huge, unexpected spikes in web user requests, such as those that struck USGS on the days of both the Haitian and Chilean earthquakes. What the CDN does in that case is “bursting” — replicating requested data through the global network, serving out more and more bandwidth to users until peak demand is satisfied. Then, the CDN eases off replicating as demand drops back to more typical levels.

USGS pays service fees both for DNS and bursting charges, according to Gallagher, totaling about $75,000 a year. Gualtieri recommends that potential CDN users do a test of latency before committing to a plan and to make sure they’re paying only for the geographical coverage they need.

Virtually Everywhere

When planning for quick scalability capabilities, “the biggest mistake people make is they buy too much hardware,” says Gualtieri. “They’re really worried about peak demand, so they overspend.” But buying big servers for a data center to handle one event doesn’t make sense, he says. “Peak demand is not about CPU. It has to do with bandwidth and geographic latency.”

For instance, the H1N1 flu outbreak taxed networks, too.

The Centers for Disease Control and Prevention’s need for quick scalability tied to flu traffic began in late April 2009, when H1N1, or swine flu, first made headlines. In a two-day period, CDC websites experienced a sevenfold increase in activity over the previous year, up to 14.1 million page views per day.

“H1N1 was unprecedented,” says Howard Smith, associate director for technology services at CDC. “We tripled web server capacity in a couple of days” by creating new virtual servers quickly.

CDC uses best-of-breed content delivery networks depending on the type of content being served, whether it’s static or dynamic pages, such as those that include video. But what really saved CDC from widespread downtime or performance deterioration was its virtual server farm, implemented several years ago.

“You never know exactly what the system will run like until it’s fully loaded,” says Smith. “We didn’t expect rapid growth, but that’s where the strategy really paid off, bringing additional resources online rapidly.”

Agencies can set virtualization up in a number of ways, says Shawn McCarthy, director of research at IDC Government Insights. “At the high end is a virtual system that also does symmetric processing and dynamic reassignment.”

Data loss due to user error
SOURCE: Gartner

McCarthy recommends intelligent consolidation. “Consolidate servers if you have something that is capable of performing up to the level it’s needed,” he says, and consider including a buffer on each server to allow for spikes in traffic.

When going virtual, “build the farm with the proper hardware,” adds Smith, “considering fault tolerance and the types of loads you’re going to have.” CDC uses clustering capabilities on its virtual platforms, and dynamically relocates virtual servers between physical servers even while they are in use. CDC’s centralized IT management and centrally located servers helped to make virtualization a possibility for the agency, says Smith.

Along with maintaining its own site, CDC syndicates its information to other websites, such as those for state health departments, from its main data center in Atlanta. Smith says CDC, which is now about 50 percent virtualized, will continue to convert physical servers to virtual as they reach their end of life.

Be Ready, Test Often

Test systems thoroughly before a demand spike does it for you, cautions Smith.

The H1N1 hit led CDC to look at more formal testing programs beyond the load and stress testing the agency was already doing. “You don’t want to stress-test your production servers,” Smith says, “but if we don’t do it now, we know that a real-life event will do it for us.”

“You should always plan for your worst case,” says Des Wilson, CEO of BreakingPoint, a network testing device provider. “That doesn’t mean buy for your worst case, but you should know what your upper bound is.”

Wilson says that a common mistake organizations make when testing is to run a simulation that fails to closely approximate demand levels they will actually experience. “It’s not just running a simple web-based protocol through there,” says Wilson. “Run 100 different applications through there, just like you see it on the network.”

USGS’ Gallagher cautions that human error can be the greatest threat to an infrastructure and is one of the chief reasons for testing. Find out where and how the human factor plays out during a surge.

“I’m big on really proper and thorough testing of something before you’ve got to live with it,” Gallagher says. “When you do, you should positively have a rollback strategy, so if something goes wrong with that new release, you can very quickly roll it back.”

Remember, humans make the changes, and they are fallible, he says. And new software could add unexpected wrinkles. “Hardware is pretty reliable, and power is pretty reliable. Once you’ve got an operating system set up, making changes to that production operation is probably the No. 1 threat to your availability.”

<p>Photo: Gary Landsman</p>

aaa 1