UCAR > Communications > Quarterly > Fall 1999 Search


Fall 1999

IBM computer begins NCAR's move to clustered shared-memory supercomputing



Al Kellie. (Photo by Carlye Calvin.)

An IBM RS/6000 SP supercomputer arrived at NCAR's Scientific Computing Division on 11 August. With an initial delivery of 160 nodes, each containing two processors, the computer offers a peak speed of 204 gigaflops, more than doubling SCD's peak capacity. The $6.2 million machine also includes 512 megabytes of memory with each processor and 2.5 terabytes of disk space. Use of the new computer will be equally divided between NCAR's community users and those in the Climate Simulation Laboratory.

For NCAR, the arrival of the SP "begins an aggressive transition to clustered shared-memory processors, following the computer industry," says SCD director Al Kellie. "What we are doing is increasing capacity not only to meet our immediate needs but to set the stage for more efficient science."

The green bars on this chart show the life spans of major SCD computers from the 1960s onward. Arrows at right indicate machines that will be in operation at the start of fiscal year 2000 this October. On the right-hand side of the chart, the area shaded green shows sustained gigaflops (billions of floating-point operations per second) attained by all SCD machines from 1986 to the end of FY 99. The division aims to bring its collective computing power to 100 Gflps by the end of FY 00, 200 Gflps in FY 01, and 1 teraflop by FY 03.

The architecture of the IBM computer differs from that of vector computers such as the CRAY C90 now housed at NCAR, and also from that of a massively parallel machine like NCAR's SGI Origin 2000/128. In a vector computer, all processors have access to the same shared memory. The Origin is a distributed-shared-memory computer: memory is physically distributed among the many processors, but can be programmed as if it were shared. The SP and other distributed-memory clustered machines consist of many nodes in which memory is shared by the processors within the nodes. Between nodes, information is exchanged by message passing, using MPI (message passing interface, an industry standard) or other means. Programming a single node can be done with the same shared-memory techniques used on the CRAY C90 and the SGI Origin. However, programming for the multiple nodes of the IBM SP requires distributed-memory techniques.

The IBM SP, housed in SCD's computer room. (Photo by Carlye Calvin.)

Why change architecture?

This IBM computer is only the first step in a transition toward clustered computing that NCAR management expects will send the center's computing power skyrocketing. NCAR director Bob Serafin says, "I think we'll have, within a year, a very substantial increase in computing capacity at NCAR--a factor of ten above what we have now."

Kellie points out that all U.S. vendors are currently offering clusters of shared-memory systems to do the biggest computing jobs. According to an SCD memo to users, "This trend is independent of whether such systems are composed of vector processors or microprocessors, and it is driven by both market forces and limits on scalability for building large systems." The cost of computation on the new architecture is as much as a factor of ten cheaper than on vector supercomputers: about $100,000 to $150,000 per sustained gigaflop. "At that price we can't afford not to use these machines," Kellie says.

In choosing which computer to buy, Kellie adds, "We wanted to capitalize on the vendor that had the most mature support for coding, porting, optimization, and conversion. IBM has the richest support structure out there. They wanted NCAR as a client."

Several other research centers have acquired similar IBM systems in the past year, including NOAA's National Centers for Environmental Prediction and the U.S. Department of Energy's Oak Ridge National Laboratory. NCAR has also signed a memorandum of understanding with NSF's two Partnerships for Advanced Computational Infrastructure centers (the San Diego Supercomputing Center and the National Center for Supercomputing Applications). The collaboration includes evaluation of computers offered by various vendors, training on the new architecture, and outreach to users.

Rodney James and Steve Hammond. (Photo by Carlye Calvin.)

The bottom line is that clusters of shared-memory computers represent the highest level of supercomputing currently available, according to Rodney James, a software engineer in SCD's Computational Science Section. Sally Haerer, head of SCD's Technical Support and Development Section, is leading NCAR's training efforts. She says, "There are users who ask, 'Why don't you give us the same old thing but with more power?' If that existed we probably would, but it doesn't."

Supporting the porting

NCAR has been preparing for months to support those inside and outside the institution who must transfer their computer programs to the new machine. "We're converting 1,500 users to this system," says Haerer, "everything from high-visibility, high-impact codes [such as NCAR's climate system model] to single users writing a code to do their research at their institution. We'll do everything from hand-holding to broad lectures."

SCD has crafted a three-tiered training approach. First, users will get an overview of the system, architecture, computing environment, and programming strategies and paradigms. Then, Haerer says, "They're going to have to know what they need to do to prepare to port their codes [transfer them to the IBM]. Some codes are more standardized than others, some codes have more 'Crayisms' than others." Finally comes the actual porting. The support that users will require at that stage "is going to vary greatly depending on the code, its complexities, and how parallelized it already is. Parallelization has been happening in codes for a long time, and I think most everybody has taken some looks at their code to see how this can be facilitated."

According to Steven Hammond, head of SCD's Computational Science Section, about 90% of the jobs that currently run on SCD's CRAY J90s are single-processor jobs. For these users, conversion may be relatively painless. "If they're happy with their performance now," he says, "they'll be happy on the IBM." On the other hand, James says, "The people who are doing the grand-challenge problems will have to use a cluster of shared-memory machines. Otherwise, it may be sufficient to use one shared-memory node. You [will] have to decide if you want to do the work of converting."

Hammond and his group are collaborating one on one with scientists who need to convert big codes. Their goal is to train the scientists to the point that they can continue model development on the new architecture, not just to "fix" the old codes, Hammond says. "We cannot redo 100,000 lines of code."

Like the complexity of their codes, the users' technical know-how also varies greatly. "Some users just need the exact syntax for the IBM and they're ready to go," Haerer says. Others need an introduction to the basic philosophy of this kind of programming, training in message passing, and other groundwork. Haerer has already offered two introductory classes for users who ran more than one general accounting unit (an allotment of supercomputer time) on NCAR computers last year. Since not all users will go through the transition at the same time, classes will be offered as long as needed.

All the support staff emphasize that their goal is to transition users to the new architecture in general, not only to the IBM. SCD plans to upgrade the IBM to four processors per node next spring, and some additional new machines are also in the long-term plan.

"Fraught with angst"

No one is expecting the changeover to be a snap. "Change is fraught with angst," says Kellie. "There's a lot of promise in these systems, but realizing that promise represents some work. The most significant issues are that the science community understand the need to make the transition, and [that the change] not retard the progress of the science."

Some scientists are actively looking forward to using the new architecture. For example, NCAR scientist Mark Rast (High Altitude Observatory) will now be able to run his solar magnetohydrodynamics code here; he formerly used the now-defunct Pittsburgh supercomputing center. "This is scalably parallel code, so you can run it across as many processors as you can get your hands on," he says. "Recently we discovered that sunspots are surrounded by bright rings. The problem I want to do next is to model sunspots to see if there's a convective origin for that radiation surrounding sunspots. We're pretty much set to go."

Haerer sums up, "We're trying to move all of our community forward in the best way we can, with everybody pitching in for this critical period. It's going to be a jolt, but once [we're] done, our user community will be extremely well poised to run on whatever our country comes up with in the way of supercomputers."

What about NCAR's other supercomputers?

Kellie notes that SCD cannot afford to keep maintaining the CRAY C90 indefinitely, but its decommissioning is probably still a few months off. Two J90 classics are also being examined for gradual decommissioning. "We're sampling the project experiments that are running on them and trying to time their disappearance so there's no precipitous gap in the science effort," says Kellie. This will leave NCAR with--besides the new IBM--the Origin 2000/128 and two CRAY J90SEs.

Haerer explains, "We're trying to hang on to the traditional systems for a long term so we can offer a long-stretch conversion period. So though we're pushing hard right now, we're not saying everybody has to make this transition in the next two months. People definitely need to carve out the time to do it, but they can fit it into their schedules."

SCD consulting staff gets up to speed

Sally Haerer. (Photo by Carlye Calvin.)

SCD's highly rated consultants have long prided themselves on knowing NCAR's Cray computers from the machine hardware up. With the addition of the new IBM, they're beginning that process all over again.

"It's not quite SCD as usual," says Sally Haerer. "We're still on a steep learning scale. The programming paradigm is so different. We may not be able to answer even the simple problems immediately, as we did before. We're going to lean a lot on the programmer-scientists. Right now Steve Hammond's group has the numerical analysis and practice on the new paradigm, but the consulting office is growing that expertise."

SCD has also enlisted IBM's assistance in training staff and helping with such internal issues as tuning the system to be most effective for the greatest number of people. Haerer explains, "IBM understands that the more help they can give us and the fewer bumps [we encounter], the happier our community will be with the final product."

Haerer emphasizes that users can help. "As I have the opportunity to talk to users, I'm trying to get them to join the team. We need to team together, with both of us working to share the answer."

Converting NCAR's big models

William Skamarock and John Michalakes. (Photo by Carlye Calvin.)

The coupled ocean-atmosphere climate system model (CSM) will be migrating to the IBM as quickly as possible. The Climate and Global Dynamics Division, which helped develop the CSM, "is committed to exploiting the new IBM system to the fullest possible extent," says James Hack, a senior scientist in CGD who works with the model. Hammond points out that the CSM is innately parallel, to some degree; the atmosphere and ocean components operate asynchronously but in parallel. The atmospheric component of the CSM, known as the community climate model or CCM, includes parallel message-passing capabilities and already has been run on the new platform.

Because of the CSM's complexity and computational demands, most of its users run the model on NCAR computers (although the atmosphere, ocean, land, and sea-ice component models are run elsewhere on a wide variety of machines). The MM5 mesoscale model, on the other hand, is run on a dozen different platforms worldwide, from workstations to distributed machines to vector supercomputers. John Michalakes, a long-term visitor at NCAR from Argonne National Laboratory, says, "For us, this is not a difficult transition, since we've had to do it elsewhere for this code. The parallel MM5 that we're running right now was developed on an IBM SP at Argonne. It's currently running in production on an IBM SP with the multiprocessing nodes at the Air Force Weather Agency [at Offutt Air Force Base in Omaha, Nebraska]. At least as far as MM5 is concerned, I think we're pretty well positioned."

William Skamarock (NCAR Mesoscale and Microscale Meteorology Division) notes that the MM5's successor-to-be, the weather research and forecasting (WRF) model now under development, "is being designed from the ground up to be adaptable to this kind of architecture. We are designing it so it's flexible enough that it should run reasonably well on all conceivable platforms, both vector and distributed-memory, cache-based systems."

What if you hooked together a bunch of PCs . . .

Could somebody save a lot of money by creating a distributed architecture from individual, smaller machines--say, a group of PCs with an Ethernet connection? A number of scientific groups are already trying this, including SCD. And yes, it is very cheap.

The concept is known as Beowulf. "It's taking a roomful of PCs hooked together on Fast Ethernet or Myrinet [a commercial product]," says John Michalakes, a visitor from Argonne National Laboratory to NCAR's Mesoscale and Microscale Meteorology Division (MMM). "You run them like a distributed parallel machine. SCD has one. Penn State is putting one together. Hong Kong University is running the MM5 model on a Beowulf cluster. We've got one at Argonne. The appeal of those things is that they're scalable, and the cost-performance ratio is not through the roof but below the floor."

Richard Loft (SCD) works on SCD's Beowulf cluster, which consists of eight two-processor, 300-megahertz Pentium 2 processors, ordered piecemeal from a mail-order firm and assembled in-house. The machines use the Linux operating system and are connected by Ethernet. At other institutions, Loft says, "People have used these commodity-PC clusters to do things that don't use the network very much. We wanted to see how they perform on algorithms that we really use to solve atmospheric problems."

Loft chose the spectral form of the shallow-water equations, which simulate the motion of the atmosphere on a sphere, to create "a toy model of the dynamics of the real atmosphere that could be easily optimized for the PC cluster," as he describes it. "This is a difficult algorithm for systems without high-performance interconnecting networks to run, because the spectral method involves all-to-all communication of data around the system. This is not a big deal on vector supercomputers where you have lots of memory bandwidth; in a cluster of PCs connected by Ethernet you just don't have that." To surmount the Beowulf cluster's limitations, Loft "cache-blocked" the algorithm, adapting it to the machines' small memory caches, and minimized the need for communication through the use of a transposition algorithm.

"To our surprise, this system, with the optimized spectral algorithm, was running at high resolutions at about $25 a megaflop with an Ethernet network," says Loft. This is more than a factor of ten cheaper than costs scientists commonly pay to use vector supercomputers.

With these results, it's not surprising that Loft's model has been picked up by another user. Lorenzo Polvani of Columbia University is now developing the model to run high-resolution simulations of turbulence on the planet Jupiter. And SCD plans to acquire a new Beowulf cluster with faster processors and a Myrinet connection that, Loft notes, "will be much more capable of handling climate-simulation resolutions."

If off-the-shelf clusters show so much promise, why do we need the IBM? "Doing these Beowulf Pentium things is a very scary proposition because Linux is evolving rapidly right now," says Loft. "There's not a lot of software." MMM's Bill Skamarock sums up, "The obstacles to Beowulf-type clusters are reliability--who maintains the thing?--and scalable administration. You can't have somebody running around with a floppy disk and putting it in 64 different drives to install new software. So there are some problems that the Linux community is going to have to address if this is going to take off. But you can't beat the price."


In this issue... Other issues of UCAR Quarterly
UCARNCARUOP

Edited by Carol Rasmussen, carolr@ucar.edu
Prepared for the Web by Jacque Marshall
Last revised: Tue Apr 4 15:11:20 MDT 2000