Data Center

Private Sky

DOE labs use a test bed to perfect cloud services for high-end research.

Google+ Twitter

Dan Tynan is a freelance writer based in San Francisco. He has won numerous journalism awards and his work has appeared in more than 70 publications, several of them not yet dead.

At Energy Department labs outside Chicago and San Francisco, researchers are sequencing genomes, analyzing microbial populations in soil samples, exploring carbon sequestration and modeling the spread of pandemics — among scores of other projects. Just a typical day at the labs, with one large exception: They’re conducting these projects using a cloud computing test bed called the Magellan Project.

Argonne National Laboratory and Lawrence Berkeley National Laboratory designed Magellan to test the feasibility of cloud computing for computational science — an experiment in experimentation, if you will, that could ultimately change how science is performed across the globe.

Funded by the American Recovery and Reinvestment Act, the two-year project is slated to run through September 2011, says Susan Coghlan, who’s heading up Magellan for Argonne in Illinois.

“We’re hoping to find out what scientific domains already work well in the cloud, which domains can work with modifications and what won’t work at all,” says Coghlan, associate division director for the Argonne Leadership Computing Facility. “We’re trying to answer the questions: Is the cloud computing paradigm a reasonable one for doing DOE high-performance computing? If not, why not? If so, what path should be taken to bring it up to the level where it needs to be?”

No Guts, No Glory

At Argonne, the Magellan team has amassed a private cloud consisting of more than 4,000 Intel Nehalem computing cores, with 40 terabytes of solid-state drive storage, 133 graphics servers and 15 dedicated memory servers, connected via a high-speed QDR InfiniBand communications link. Berkeley’s National Energy Research Scientific Computing Center (NERSC) in California has constructed a similar cloud environment.

Yet at these facilities, where high-performance computing typically involves multimillion-dollar supercomputers cranking at petaflops (quadrillions of floating point operations per second), this cluster is considered a sporty economy model, operating at only 151 teraflops (trillions of FLOPS).

And that’s a big part of its appeal. The Magellan clusters aren’t intended to compete with the supercomputers: They’re designed to find out what could happen when midrange computing is widely available to the lab-coated masses, Coghlan says.

Because cloud computing environments can be provisioned quickly and accessed from virtually anywhere, this infrastructure could expand access to raw computing power to more scientists when they need it, without massive upfront investments in expensive hardware or queuing up for limited time on the big iron, says Katherine Yelick, division director of NERSC.

“We have more people who want time on our high-performance computers than can actually get it, so for the most part those facilities are reserved for large, high-end scientific problems,” Yelick says. “A centralized resource like the cloud is best suited for researchers with very spiky workloads — say, access to 500 computer nodes for 24 hours, once a month. That can be very expensive to do on your own.”

Plus, notes Coghlan, most scientists working out in the field don’t have the funds to procure large systems, “let alone the ability to pay for administration, support, power and cooling. If they can take advantage of the cloud when they need to run large problems, that opens up a world that might not be available to them if they had to plop down $2 million for a computer system.”

Be Selective

Besides the cost savings, cloud computing also provides a different kind of speed — the swiftness with which lab personnel can provision a server share for a user. For example, when the DOE’s Joint Genome Institute had a sudden need to double its computing resources last March, it turned to Magellan. Within three days, researchers had gained the services of a cluster of several hundred nodes identical to JGI’s local computing cluster.

Yet, as Magellan researchers have learned, not all scientific applications benefit from a cloud environment. Experiments that can be sliced into multiple parts and run independently, such as sequencing strands of DNA, work very well in the cloud, Coghlan says.

But computing jobs that need to run in parallel, communicate frequently or synchronize among different nodes are a poor choice, says Yelick. For one thing, the cloud nodes are likely to be virtualized, which means each physical machine may be running an unknown number of other processes and applications. Some may run more slowly than others, which means the entire test system has to wait for the slowest machine to catch up.

“Faster interconnects and synchronization between processors are what make supercomputers super,” explains Jared Wilkening, a software developer at Argonne who worked on the JGI cloud project. “Cloud computing is better for more CPU-intensive tasks.”

In addition, high-end scientific applications such as climate simulation also require higher rates of data input and output than cloud configurations can muster.

“Data transfer and movement are key things to think about before you move to the cloud,” Wilkening says. “You don’t want to have your cloud nodes spending 90 percent of their time getting the data they need in and out. That wastes a lot of what you’re paying for.”

Making the Right Paradigm Shift

The cost benefits of cloud computing can be compelling, especially as commercial providers begin to roll out cloud offerings more suitable for the scientific community, says Ron Ritchey, a principal with strategy and technology

consulting firm Booz Allen Hamilton. The lessons learned from Magellan could eventually lead to the cloud becoming as common as test tubes and telescopes in a scientist’s arsenal.

“There are lots of opportunities in academia for a cloud system to be applied that not only reduces cost but makes the setup of the experiment much more efficient and attractive,” Ritchey says. “If you have an experiment that would benefit from having a few hundred CPUs chew on the same data for an hour, and you put in a purchase order for them to perform that experiment, you’d probably get turned down. But doing the same thing in Amazon’s EC2 environment would cost maybe $25.”

Just as inexpensive Internet access has sped the flow of information, cloud computing could spread the reach of science.

“To me, what’s important about the cloud is that it lets scientists have access to computing resources when they really need them, instead of parceling them out by a certain number of hours of use per year,” Yelick says. “Hopefully the expertise we gain from Magellan will influence what commercial providers, NERSC and others offer in terms of cloud services for their users.”

Experiment with the cloud, not with your data

Moving high-performance computing operations to a cloud cluster isn’t like moving them to a data center in the basement or even one across town. Here are some of the key differences:

Storage: Storing data in the cloud can be riskier, warns Booz Allen Hamilton’s Ron Ritchey, because you may lose access to it for technical or other reasons. He advises taking a data-centric approach to any potential cloud solution.

“Take a look at the data you’ll be processing and figure out whether you’re doing something where you can afford to have your data out of your administrative control,” he says. “And make sure the cloud isn’t the only container for your data. You need to have your raw data and results stored in a repository you can access and control.”

Software stability: Experiments that have been running for years may require older, very specific versions of operating systems and additional software that may be difficult to find, especially when using machines scattered across multiple data centers.

Hardware stability: There tends to be a higher rate of hardware failure in the cloud, says Argonne’s Jared Wilkening. “If a node suddenly disappears, the software needs to be able to handle that,” he says. “You have to build in failure recovery as part of the initial design.”

Security: Cloud computing raises a host of security issues, in part because you may be outsourcing physical and digital security of your data to third parties. But there are other concerns as well, Ritchey says.

“If you’ve got two servers sitting in the same data center that communicate with each other, you probably wouldn’t bother encrypting communication channels between them,” he notes. “In the cloud, those two related systems may no longer be in the same state. From an architecture standpoint, when you inherit virtualized systems, you inherit different security challenges that may not exist in a nonvirtualized world.”

Regulatory: When your data is no longer stored onsite, that could affect who’s legally responsible for ensuring the safety and privacy of that data, Ritchey notes.

“Technical, regulatory and legal changes occur when you’re not the organization that physically holds the data,” he says. “For example, if you have e-mail messages stored on your server at your location, the rights you have for protecting that information are different if that data is being stored by a third party.”

<p>Photo: Bob Stefko</p>