Scientific Information Bulletin

Since only the listed symptoms are detected, unanticipated errors will not be detected and could cause system failure. The second problem is that there is no recovery from recovery failures. In other words, if a fault occurs during the process of recovering from a fault, the system could fail. This problem was recognized by Fujitsu and led to the definition of a critical route, i.e., any procedure whose failure would lead to a recovery failure. The design team took special care to inspect these critical routes, reduce their number, and simplify their design. It is safe to say that the measures described in the paper will serve to greatly increase the reliability of the SXO operating system. There is still a good chance, however, that unanticipated bugs will cause system failure or seriously impaired performance in the early stages of system life. [One attendee from Tandem pointed out to me that one of the important features of Fujitsu's system was the incorporation of an artificial fault generation mechanism built into the operating system. Each SXO component can have 10 or more interrupt points, and several thousand have been built in already. The final product can be shipped with the traps embedded but not in operation. In this form there is negligible overhead, but it is very easy to bring them into effect. Traps are written as C macros, and hence are easy to program. DKK]

TECHNICAL TOURS

At Fujitsu, we were shown the new Fujitsu parallel computer, called the AP1000. Fujitsu has produced a small number of machines of this type ranging from 64 processors to 512 processors. These machines are used inside Fujitsu and at various universities in Japan and other countries for parallel processing research. At this time, Fujitsu has no plans to try to market this computer. Highlights of the AP1000 architecture include three communication networks, nodes composed of SPARC processors with Weitek floating point co-processors, and a very fast interprocessor communication mechanism. The three networks consist of one for broadcasting to the nodes from either the host or another node, one to handle barrier synchronization, and a third for interprocessor point-to-point communication. The point-to-point communication network has a torus configuration and the routing algorithm combines wormhole routing with a structured buffer pool approach to avoid deadlock. The Fujitsu representative reported that they had achieved a processing rate in the range of 2-3 GFLOPS using their 512-processor machine on the LINPACK benchmark. This places

the AP1000 below both the Intel Delta machine and the Thinking Machines Connection Machine. The lower performance as compared to the Delta machine probably results mainly from the performance difference between the SPARC processor with a Weitek floating point co-processor used by the AP1000 and the Intel i860 used by the Delta machine. I should mention that the AP1000 possesses no fault tolerance features. Fujitsu has also recently developed a fault tolerant computer called the SURE SYSTEM 2000, but we were not shown anything related to this machine during the tour.

The second stop on the tour was at a Hitachi research laboratory specializing in expert systems. Here, we were shown a videotape detailing the development and use of a Hitachi expert system for planning construction of large tunnels. We were also shown statistics concerning the numbers of expert systems in place at 261 Japanese companies. The data showed the percentages of these systems that were diagnostic type (about 40%), planning type (about 27%), design type (about 16%), control type (about 6%), and other (about 11%). Again, we were not shown anything related to fault tolerant computing.

PARALLEL PROCESSOR FOR

MANY-BODY CALCULATIONS, GRAPE

Project GRAPE, a parallel computer for many-body calculations, which
was developed at the University of Tokyo, is summarized. The current

INTRODUCTION

Classical many-body systems play a key role in astrophysics, star clusters, galaxies, and clusters of galaxies when their interactions are described as a system of particles interacting through gravity. At the other end of the mass spectrum, plasmas and systems of molecules are also treated as manybody systems that interact through Coulomb and van der Waals force. In astrophysics, the applications are primarily to basic science, but at the molecular level applications are to drugs, smart material design, and many other practical problems.

The basic idea is very simple: a large collection of particles start at an initial configuration and interact with each other according to Newton's law, the acceleration of each particle determined by the forces acting on it, those generated by the other particles, and any other external forces. The detailed form of the force between two particles is determined by the system but depends nonlinearly on the distance between them. (The major assumption here is that the particles are point masses.) This leads to a system of N second order ordinary differential equations (if there are N particles), all of them looking very much alike.

In the simulation of these systems one thus needs to solve the equations and their solution will describe the

version operates at 10 GFLOPS.

by David K. Kahaner

subsequent motion of the particles as time progresses. In principle this is very simple; the differential equations are solved numerically by taking finite, but small, time steps. At each time step the forces are computed and the particles moved an appropriate distance. If there moved an appropriate distance. If there are N particles, there are N^2 force interactions that need to be computed, usually by pairwise summation. For large systems this portion dominates all other aspects of the computation, as the time for other parts is usually linearly dependent on N.

N-body computations are one of the major absorbers of supercomputer cycles. For example, at the Australian National University the two scientific fields using the most supercomputer time during 1990 were astrophysics and molecular dynamics [see D.K Kahaner, "Computing and Related Science in Australia,” 16(4), 47-68 (1991)]. Most of this was for N-body calculations. For example, one 2,000-body simulation on a Fujitsu VP-400 took 40 hours, although very clever algorithms can reduce this. Molecular dynamics simulations are extensively used to study threedimensional structures of protein molecules and the phase transition of materials.

The basic calculation is inherently parallel. Many clever techniques have been developed to speed up these calculations including extremely sophisticated data structures for linking

information about the particles and the use of fast algorithms. For example, in some situations forces die off rapidly with distance, and so it is only necessary to compute the effect of nearby particles. Thus particles could be grouped into near and distant, leading to efficient algorithms. For short-range force problems of this type, linked-list algorithms can reduce the force calculation to something proportional to N. Tree algorithms can be used in those cases when the required accuracy need not be very high. These latter algorithms can be quite complicated and might be difficult to program on fine grained supercomputers, although the amount of computational work can be reduced from about N^2 to about N*In(N) for the force calculation. For example, a force calculation for the i-th particle could be written as (a cell is a collection of nearby particles)

subroutine treeforce(i,cell)

if( cell and particle i are well separated) force = force from the center of mass of cell

else

force sum of forces from the children of cell

endif end

Of course, this has to be implemented either in a language allowing recursion

or in Fortran by an appropriate data

list.

For systems on a uniform spatial mesh, fast Fourier transform (FFT) methods can also be applied. FFT algorithms also reduce computation from N^2 to N*In(N) and can be very effective, but they only work with a uniform mesh. There is still a limit of a few tens of thousands on the number of particles that can be easily dealt with. The number of atoms in a large protein molecule exceeds 10,000, and when immersed in water the system can easily exceed 100,000 particles. Direct pairwise summation of the forces can lead to excessive computation but is very general and applicable to virtually any physical structure.

The GRAPE project (GRAPE-1, GRAPE-1A, GRAPE-2, GRAPE-2A, GRAPE-3, and GRAPE-4) has been evolving only over the past 3 years. This is a university project that has had collaborative assistance and hardware support from Fuji-Xerox (just in case you thought that they only built copiers). GRAPE-3 is complete and Makino states that its effective speed is 10 GFLOPS. A series of papers, in English, has been submitted (to mostly physics and astronomy journals) and some papers have already appeared. some papers have already appeared. A bibliography is included below. (Most of these papers have multiple authors, including at least Sugimoto or Makino; thus, authors' names are omitted.) GRAPE is known in the astronomy community. In fact, Makino has just returned from a substantial visit to the Cambridge Institute of Astronomy. But it is not well known otherwise, perhaps because, until recently, the major applications of the project have been to astronomical rather than molecular systems. Because so many of the project's papers are in English, and also since the principals are easily available via E-mail, the purpose of my report is to briefly summarize the history and current status of the project.

The fundamental idea behind GRAPE is that direct summation of forces is simple and parallel. Very excellent work in this area has been demonstrated by users of Connection Machine, but the GRAPE view is that with the advent of VLSI it ought to be possible to build a special purpose chip of modest cost which is still flexible ENOUGH. This is not the only project with this goal. The Delft Molecular Dynamics Processor (DMDP) is specifically designed for these problems, too. The major difference is that Delft's machine does all the computations, while GRAPE systems only perform the force computation. The latter allows much simpler hardware and software. (The fact that the first design for GRAPE-1 began in April 1989, and there is already

a machine running at 10 GFLOPS, is a pretty good endorsement of this approach.) The hardware only performs one function; the software is simple to interface with existing Fortran or C. GRAPE is attached to a Unix host by a VME bus (the earliest version used an IEEE-488 interface). The host sends the particles' positions to GRAPE, which calculates the force on any requested particle and sends this back to the host. The total user documentation is available as a few "man" pages. (I have read several examples of GRAPE programs; most users will be able to write their own applications with very little "spin-up" time.) Of course, this means that to some extent performance of a GRAPE system is limited by the host and the communications, while this is not the case for DMDP. For molecular forces, the force computation is less expensive (because only nearby particles need be considered), and thus GRAPE will be less appropriate than DMDP, but for long range forces the opposite will be true. Nevertheless, when costs are considered, GRAPE might be a good choice for either.

The sequence of GRAPE systems shows clearly the evolution of the scientists' thinking about applications, as well as the availability of better hardware. GRAPE-1, -1A, and -3 are "low accuracy." Subtraction between position coordinates is in fixed point using 16, 16, and 20 bits, respectively. Accumulation of the force is 48, 48, and 56 bits, respectively. In GRAPE-2 and -2A, subtraction between coordinates and force accumulation is 64-bit floating point (double precision), while the remainder of the calculation is single precision. The GRAPE-3 system uses a custom LSI chip; up to 48 chips (strictly speaking, 46 chips for the force calculation) can be installed, allowing force calculation on a similar number of particles, in parallel. At a 10-MHz clock, GRAPE-3 can perform about 300 MFLOPS for each chip, thus about

13.8 GFLOPS when fully configured.
The full system is in operation now and
the measured speed is at present
9.9 GFLOPS. Newer machines have
clever pipelining, calculate not only
the force but also the potential, use
interpolation tables for forces, etc. They
also can implement some tree algo- BIBLIOGRAPHY
rithms. A great deal of detail concern-
ing the hardware of GRAPE systems is
given in the papers listed below, so we
omit this here. Figures 1 through 6
illustrate different GRAPE systems.

approach is ad hoc, but they have taken
this to the stage where new science is
being done daily. It is a very impressive
job and it will be interesting to see how
well the systems work on molecular
dynamics problems.

Current plans are to develop GRAPE-4, which is targeted at 1 TFLOP. This will be done by using 1,024 1-GFLOP chips in a three-level tree. The host is connected to four controller boards and each board is connected to 16 processor boards, each with 16 pipeline chips on it. The processor board is similar to that of GRAPE-3's, but the wordlength will be increased to allow higher accuracy calculations. The designers hope that the entire package will consume about 5 to 10 kW and thus will be air cooled. They estimate it will cost about $2M to build.

COMMENT

GRAPE systems were designed mostly by astronomers who have real problems to solve. In some way their

"A special-purpose computer for gravi-
tational many-body problems,” Nature
Japan 345(6270), 33-35 (3 May 1990).

"GRAPE: Special purpose computer
for simulation of many-body problems,"
International Symposium on Supercom-
puting, Fukuoka, Japan, 6-8 November
1991.

"Treecode with a special-purpose pro-
cessor," to appear in Publications of the
Astronomical Society of Japan (PASJ)
(1991).

"Project GRAPE: Special purpose
computers for many-body problems,"
computers for many-body problems,"
to appear in High Performance Com-
puting: Research and Practice in Japan,
edited by R. Mendez (John Wiley and
Sons, 1992).

"A special-purpose computer for gravitational many-body systems GRAPE-2," PASJ 43, 547-555 (1991).

"GRAPE-2A: A special purpose computer for simulation of many-body systems with arbitrary central force," to appear in HICSS-25, 25th Hawaii International Conference on System Sciences, Koloa, Hawaii, 7-11 January

1992.

"A modified Aarseth code for GRAPE and vector processors," in press, PASJ (1991).

"A special-purpose N-body machine GRAPE-1," Computer Physics Comm 60, 187-194 (1990).

"GRAPE-1A: Special-purpose computer for N-body simulation with tree code," to appear in PASJ (1991).

"GRAPE-3: Highly parallelized specialpurpose computer for gravitational many-body simulations," to appear in HICSS-25, 25th Hawaii International Conference on System Sciences, Koloa, Hawaii, 7-11 January 1992.

« iepriekšējā Turpināt »

Grāmatas