Super-sizing a community data trove

by Lynda Lester
NCAR Scientific Computing Division

Tucked away at the remote end of the Computer Room of NCAR’s Mesa Laboratory are five giant data silos, the heart of the world-class Mass Storage System (MSS)—arguably the largest archive of atmospheric data on earth. The scope of this virtual Fort Knox transcends mere kilobytes, megabytes, gigabytes, and even terabytes. The MSS now measures its data holdings in petabytes—1.2 and rising.

The MSS hit the petabyte mark (one thousand trillion bytes) in November 2002. “It took us nearly 16 years to reach the one-petabyte mark,” says Al Kellie, director of NCAR’s Scientific Computing Division (SCD), which designed the MSS in the mid-80s and has maintained it ever since. “But right now we’re adding 40 to 50 terabytes of data per month. At this rate, it will take us less than two years to hit the second petabyte.”

The MSS is a storehouse of computational analyses and observational data used by scientists around the world for long-range and long-term atmospheric research. The bulk of the data is generated by global climate-simulation models, mesoscale weather models, and other Earth science models executed on supercomputers. The MSS also contains irreplaceable historic records and data from satellites and field experiments.

“NCAR has long been known as a supercomputing center,” Al says, “but what we now see is the emergence of NCAR as a ‘superdata’ center.” Holding the core of those data, the MSS provides access to enough information to fill the Library of Congress 100 times—the equivalent of 1.2 billion 500-page paperback novels.

“The MSS processes over a million requests for data per month,” says Gene Harano, manager of SCD’s High-Performance Systems Section. “We’ve got about 1,200 registered users, nearly 38 percent of them at universities—and how much of that data they pass along to colleagues, who knows.”

Data silos in the Mass Storage System tower over Gene Harano and John Merrill, both of NCAR’s Scientific Computing Division. (Photo by Lynda Lester.).

Robin Tokmakian, a researcher at the Naval Postgraduate School in Monterey, California, passes along plenty of data to colleagues. With accounts on several NCAR supercomputers, she is currently using National Centers for Environmental Prediction (NCEP) reanalysis data from the MSS to compute a 40-year, high- resolution simulation of the global ocean circulation. Participating in several formal and informal collaborations, she has supplied model output to more than 30 institutions.

“One of the benefits of the mass store is it’s fairly easy to use—much easier to use than storage capabilities at other facilities,” Robin says. “I’m able to give 3-D simulated ocean fields to many external researchers to analyze for their own particular interests.”

Where the data are harvested

NCAR’s Mass Storage System is a sophisticated ensemble of hardware and software, its various physical components spread out from one end of the 14,000-square-foot (1,300-square-meter) NCAR Computer Room to the other. Files sent to the MSS are stored in one of three locations: the disk farm, the robotic library, or the offline tape archives.

Down on the disk farm. Looking less like a farm than two rows of blue refrigerators, the disk farm is comprised of 60 IBM 3390 Model 3 disks. Together, they provide 180 gigabytes of storage with lightning-fast access to NCAR supercomputers.

Cyber-silos. The robotic library consists of five StorageTek Powderhorn Automated Cartridge Systems. These are the enormous data silos, each standing almost eight feet (2.4 meters) tall and containing up to 6,000 tape cartridges.

In each silo, cartridges are housed in rows from floor to ceiling in two concentric circles—the first around the circumference of a central column, the second inside the outer wall. A long mechanical bar rotates, propeller-like, around the center axis of the silo. At each end of the bar, a robot hand and laser camera are affixed to a vertical post.

When a user sends a request for data, the bar spins around and the camera scans cartridge barcodes. The robot hand sweeps up or down on the post, swiveling in or out to position itself in front of the correct cell. Its robot “fingers” then grab the cartridge. The bar spins again, positioning the hand in front of a tape drive. The robot inserts the cartridge into the drive, which transmits the data to the supercomputers. This happens at blinding speed—up to 350 cassette mounts per hour.

The human factor. The offline tape archives are open shelves, like bookshelves in a library, containing IBM and StorageTek cartridges that are manually mounted by SCD staff in stand-alone tape drives. These are used mainly for second copies of MSS files (if requested by users) and periodic backups of system files from SCD supercomputers, filesystems, and other servers.

The IBM cartridges are remnants from the mid-1980s, when the MSS consisted solely of library shelves. About a year ago, SCD began transferring data from these cartridges onto the StorageTek cartridges in a process called “data ooze.” SCD has transferred data from 110,000 IBM cartridges, with 30,000 yet to go; most of these data have been moved onto StorageTek cartridges in the silos. Between routine duties and the data ooze, SCD operators currently mount six to ten cartridges per hour. That rate will drop to less than five per hour once the ooze is complete.

Command central. The “brain” of the MSS is the Mass Storage Control Processor, an IBM 9672-R24 computer, which keeps track of every file in the MSS—information such as file location, owner, project number, retention period, and number of times accessed. The MSCP migrates data from the disk farm to the silos and back again depending on file size and access frequency: smaller, frequently accessed files move to the disk farm, while larger, less frequently accessed files stay in the silos. Including these migrations and user requests, some 200 terabytes per month pass in, out, or through the MSS.

Users access data in the MSS via UNIX-like commands (for example, msread and mswrite); files are password-protected and organized in directories similar to UNIX filesystems. A new tool will soon allow users to access and manage their data via the Web. Debuting early this summer as a self-contained utility, the tool will later be incorporated into the SCD Portal, a customizable browser interface to SCD computing resources.

Absorbing future shock

Dedicated terascale (and soon even petascale) computing, delivering trillions and soon thousands of trillions of calculations per second, is needed for today’s large-scope geoscience models. The output from these simulations places an ever-increasing load on the MSS, which in some estimates could require between one and two dozen data silos by 2005.

Gene, however, is optimistic. “Vendors are continuing to increase the capacity of a single tape cartridge. We currently have 20- and 60-gigabyte cartridges. By the end of 2003, new StorageTek tape drives will allow us to put 200 gigabytes of data on those same 60-gigabyte cartridges.”

John Merrill, head of SCD’s MSS Group, notes that a larger disk farm is in the works, capable of holding 20 to 40 terabytes of data. Also planned is a shared front-end fileserver that would function as an external disk cache. These innovations would greatly reduce the need for cartridges.

Even so, the MSS archive size may be growing faster than storage capacity; thus, SCD is lobbying for additional space to expand the Computer Room.

One final factor to consider in MSS development is network speed. Whereas transferring megabytes of data may take only a few seconds at today’s transfer rates, moving petabytes could take days—or even years! Fortunately, vendors are increasing file transfer speed as well as capacity.

Whatever the future has in store, as technology evolves and computational output multiplies, SCD staff remain committed to providing optimum, cost-effective mass storage for the atmospheric sciences community—as they have since 1978, when a rudimentary archival system held just 0.0001 petabyte of data.

 


Also in this issue...

How random is our winter weather?

North America's ozone: a closer look

Chasing mesoscale monsters

Larry Winter: NCAR's new Deputy director

President’s Corner: University roles in the weather and climate services partnership

UCAR Community Calendar

Web Watch

Governance Update