by Lynda Lester
NCAR Scientific Computing Division
Tucked away at the remote end of the Computer Room of NCARs
Mesa Laboratory are five giant data silos, the heart of the world-class
Mass Storage System (MSS)arguably the largest archive of atmospheric
data on earth. The scope of this virtual Fort Knox transcends mere kilobytes,
megabytes, gigabytes, and even terabytes. The MSS now measures its data
holdings in petabytes1.2 and rising.
The MSS hit the petabyte mark (one thousand trillion bytes) in November
2002. It took us nearly 16 years to reach the one-petabyte mark,
says Al Kellie, director of NCARs Scientific Computing Division
(SCD), which designed the MSS in the mid-80s and has maintained it ever
since. But right now were adding 40 to 50 terabytes of data
per month. At this rate, it will take us less than two years to hit
the second petabyte.
The MSS is a storehouse of computational analyses and observational
data used by scientists around the world for long-range and long-term
atmospheric research. The bulk of the data is generated by global climate-simulation
models, mesoscale weather models, and other Earth science models executed
on supercomputers. The MSS also contains irreplaceable historic records
and data from satellites and field experiments.
NCAR has long been known as a supercomputing center, Al
says, but what we now see is the emergence of NCAR as a superdata
center. Holding the core of those data, the MSS provides access
to enough information to fill the Library of Congress 100 timesthe
equivalent of 1.2 billion 500-page paperback novels.
The MSS processes over a million requests for data per month,
says Gene Harano, manager of SCDs High-Performance Systems Section.
Weve got about 1,200 registered users, nearly 38 percent
of them at universitiesand how much of that data they pass along
to colleagues, who knows.
Data silos in the Mass Storage System tower over Gene Harano
and John Merrill, both of NCARs Scientific Computing Division.
(Photo by Lynda Lester.).
Robin Tokmakian, a researcher at the Naval Postgraduate School in
Monterey, California, passes along plenty of data to colleagues. With
accounts on several NCAR supercomputers, she is currently using National
Centers for Environmental Prediction (NCEP) reanalysis data from the
MSS to compute a 40-year, high- resolution simulation of the global
ocean circulation. Participating in several formal and informal collaborations,
she has supplied model output to more than 30 institutions.
One of the benefits of the mass store is its fairly easy
to usemuch easier to use than storage capabilities at other facilities,
Robin says. Im able to give 3-D simulated ocean fields to
many external researchers to analyze for their own particular interests.
Where the data are harvested
NCARs Mass Storage System is a sophisticated ensemble of hardware
and software, its various physical components spread out from one end
of the 14,000-square-foot (1,300-square-meter) NCAR Computer Room to
the other. Files sent to the MSS are stored in one of three locations:
the disk farm, the robotic library, or the offline tape archives.
Down on the disk farm. Looking less like
a farm than two rows of blue refrigerators, the disk farm is comprised
of 60 IBM 3390 Model 3 disks. Together, they provide 180 gigabytes of
storage with lightning-fast access to NCAR supercomputers.
Cyber-silos. The robotic library consists
of five StorageTek Powderhorn Automated Cartridge Systems. These are
the enormous data silos, each standing almost eight feet (2.4 meters)
tall and containing up to 6,000 tape cartridges.
In each silo, cartridges are housed in rows from floor to ceiling
in two concentric circlesthe first around the circumference of
a central column, the second inside the outer wall. A long mechanical
bar rotates, propeller-like, around the center axis of the silo. At
each end of the bar, a robot hand and laser camera are affixed to a
vertical post.
When a user sends a request for data, the bar spins around and the
camera scans cartridge barcodes. The robot hand sweeps up or down on
the post, swiveling in or out to position itself in front of the correct
cell. Its robot fingers then grab the cartridge. The bar
spins again, positioning the hand in front of a tape drive. The robot
inserts the cartridge into the drive, which transmits the data to the
supercomputers. This happens at blinding speedup to 350 cassette
mounts per hour.
The human factor. The offline tape archives
are open shelves, like bookshelves in a library, containing IBM and
StorageTek cartridges that are manually mounted by SCD staff in stand-alone
tape drives. These are used mainly for second copies of MSS files (if
requested by users) and periodic backups of system files from SCD supercomputers,
filesystems, and other servers.
The IBM cartridges are remnants from the mid-1980s, when the MSS consisted
solely of library shelves. About a year ago, SCD began transferring
data from these cartridges onto the StorageTek cartridges in a process
called data ooze. SCD has transferred data from 110,000
IBM cartridges, with 30,000 yet to go; most of these data have been
moved onto StorageTek cartridges in the silos. Between routine duties
and the data ooze, SCD operators currently mount six to ten cartridges
per hour. That rate will drop to less than five per hour once the ooze
is complete.
Command central. The brain
of the MSS is the Mass Storage Control Processor, an IBM 9672-R24 computer,
which keeps track of every file in the MSSinformation such as
file location, owner, project number, retention period, and number of
times accessed. The MSCP migrates data from the disk farm to the silos
and back again depending on file size and access frequency: smaller,
frequently accessed files move to the disk farm, while larger, less
frequently accessed files stay in the silos. Including these migrations
and user requests, some 200 terabytes per month pass in, out, or through
the MSS.
Users access data in the MSS via UNIX-like commands (for example,
msread and mswrite); files are password-protected and organized in directories
similar to UNIX filesystems. A new tool will soon allow users to access
and manage their data via the Web. Debuting early this summer as a self-contained
utility, the tool will later be incorporated into the SCD Portal, a
customizable browser interface to SCD computing resources.
Absorbing future shock
Dedicated terascale (and soon even petascale) computing, delivering
trillions and soon thousands of trillions of calculations per second,
is needed for todays large-scope geoscience models. The output
from these simulations places an ever-increasing load on the MSS, which
in some estimates could require between one and two dozen data silos
by 2005.
Gene, however, is optimistic. Vendors are continuing to increase
the capacity of a single tape cartridge. We currently have 20- and 60-gigabyte
cartridges. By the end of 2003, new StorageTek tape drives will allow
us to put 200 gigabytes of data on those same 60-gigabyte cartridges.
John Merrill, head of SCDs MSS Group, notes that a larger disk
farm is in the works, capable of holding 20 to 40 terabytes of data.
Also planned is a shared front-end fileserver that would function as
an external disk cache. These innovations would greatly reduce the need
for cartridges.
Even so, the MSS archive size may be growing faster than storage capacity;
thus, SCD is lobbying for additional space to expand the Computer Room.
One final factor to consider in MSS development is network speed.
Whereas transferring megabytes of data may take only a few seconds at
todays transfer rates, moving petabytes could take daysor
even years! Fortunately, vendors are increasing file transfer speed
as well as capacity.
Whatever the future has in store, as technology evolves and computational
output multiplies, SCD staff remain committed to providing optimum,
cost-effective mass storage for the atmospheric sciences communityas
they have since 1978, when a rudimentary archival system held just 0.0001
petabyte of data.