Dec 31 2009

Beyond Red Sandstone

To serve up access, Smithsonian will expand its digitization and IT infrastructure.

The Smithsonian Institution is pushing ahead to make its vast collections available anywhere electronically — an effort that by 2010 will, according to its estimates, require at least 569 terabytes of storage.

The Smithsonian’s holdings range across 19 museums and galleries, the National Zoological Park and nine research centers. In addition, it has affiliations with hundreds of other educational and cultural organizations worldwide.

The Smithsonian has a plan to provide unified access to digital assets at individual locations and expand its enterprise storage environment. CIO Ann Speyer, among others at the institution, would like to see a major transformation in how people access the collections. Although the brick-and-mortar facilities will always play an essential role in providing information to researchers and visitors, Speyer wants the institution to provide access to people who cannot visit the facilities. “We want to extend our reach to where people live, study and work,” she says.

Big, Big, Big

Less than a year ago, the Smithsonian had about 46TB of usable enterprise storage dedicated for e-mail, office files and databases, along with 17TB for its collections-related digital assets.

Enterprise applications consumed roughly 75 percent of usable storage. This spring, the Smithsonian has been bringing online an additional 84TB of storage to relieve some pressure and make more space available for collections data.

The last storage analysis, done in November 2005, estimated that the Smithsonian would need that 569TB of storage within three years solely for collections-related digital assets.

But George Van Dyke, director of information technology operations at the Smithsonian, says, “If we actually succeed in digitizing everything we want to provide online, the requirement will be much higher.” He points out that the report only looked at digitizing the collections at the Smithsonian, not its research data. The Smithsonian Astrophysics Observatory, for instance, will require many hundreds of petabytes, he says.

The Smithsonian must ramp up migration from standalone storage silos at the individual museums and units to an enterprise environment. Van Dyke notes that because resources needed for in-house applications and general use will remain relatively constant over time, a higher percentage of storage capacity will be available for collection and research-related digital assets as the IT organization beefs up capacity.

Real Access

The Smithsonian’s storage area network provides real-time access to files stored on redundant arrays of independent disks, both RAID Level 5 and RAID Level 10 systems.

For near-real-time data, it houses files using content-addressed storage (CAS). The CAS approach, which assigns a digital fingerprint or logical address to each item, ensures that the institution doesn’t waste storage by duplicating content. To add low-cost capacity for its less-used items, the institution also plans to add offline storage, Speyer and Van Dyke say.

The institution has built out its network architecture to handle heavy traffic in support of efforts to digitize its collections. The 100-megabit-per-second Ethernet network serves 36 sites in five states, the District of Columbia and Panama. The wide area network includes a mix of T1 lines, T3 lines, Gigabit Ethernet fiber and 100Mbps Transport Layer Service. The internal backbones are all 1 gigabit per second and 100Mbps to the desktop. To provide redundancy, multiple links serve most sites.

For feeding data from its back-end storage arrays, the institution’s storage area network currently uses 2-gigabyte Host Bus Adapters but will soon upgrade to 4GB.

The Smithsonian’s enterprise storage must support museum-specific collection systems, library standards-based research systems and its Digital Asset Management (DAM) system. Currently running as a pilot at four locations, DAM supports cataloguing, storing, and retrieving image, audio and video digital assets.

“Our goal is to expand this solution to the entire enterprise,” says Speyer. “The big challenge facing the institution is developing a data access layer and associated processes that will provide a single unified view into multiple underlying systems.”

The Smithsonian has developed initial blueprints for the access layer that would work in conjunction with an enterprise DAM and the individual museums’ Collection Information Systems. The amount of storage added is a prime factor in determining how many Smithsonian units and how many digital assets the system can handle. “When you’re dealing with video, sound and image files, which are generally very large, storage is a serious concern,” Speyer says.

To make efficient use of enterprise storage, Speyer’s staff has begun defining processes to determine whether an asset requires real-time or near-real-time access. The current thinking calls for maintaining infrequently accessed items in archival storage. The first person who accesses a file from archival storage will have a few-second delay for data retrieval. But then the digital asset will remain cached in the real-time storage array for a period of time.

Subsequent viewers will be able to access it at Internet speed.

“If something in the news creates a lot of interest in one subject, for instance, the system should be able to accommodate the needs of users through the caching process,” says Speyer. While the Smithsonian project is ambitious and must compete with other national priorities for funds, Speyer believes the direction is right and necessary for the institution as a leader of education,

science, history, art and culture.