Observations on Computational Mathematics in Japan ......... The International Symposium on Computational Mathematics is summarized, The German National Research Center for Computer Science in Tokyo The Tokyo liaison office of the German National Research Center for Computer Fourth Institute for Supercomputing Research Supercomputing Workshop Page 49 55 ........ Some observations on the trends and characteristics of parallel supercomputing Japanese Database Activities David K. Kahaner The activities of the Japan Database Promotion Center are described and the Hitachi's Various Research Laboratories David K. Kahaner Hitachi's research plan for the 1990s and computer-related research in the Observations on Neural Network Research and Development in Japan Japan, with its very advanced VLSI technology capabilities, is in a very good 59 65 69 75 Page X-Ray Lithography in Japan Kenneth L. Davis Electronics A study panel was formed by the Office of Naval Research and DARPA to evaluate General A Changing Paradigm for Industry-University Cooperation in Japan Toyohashi University of Technology and its sister university in Nagaoka, estab- Life Science Marine Biology and Biotechnology in Japan Aharon Gibor Japan's economy is a prime example of the dependence on marine biomass The First International Conference on Brain Electromagnetic Topography This conference focused on two major areas: the recording and displaying of 79 83 85 101 Materials Science Trends in Materials Processing in Japan T.W. Eagar This paper briefly describes the latest trends in materials processing in Japan as Ocean Science The Republic of Korea Navy Ocean and Underwater Medical Research and Training Center Neal A. Naito Page 111 115 This article describes the Korean Ocean and Underwater Medical Research and Seminar on Autonomous Underwater Vehicles Autonomous underwater vehicles (AUVs) are a new and rapidly developing tool TECHNO-OCEAN '90 TECHNO-OCEAN has emerged as the premier ocean technology conference for Satellite Remote Sensing in Japanese Oceanography Because Japanese scientists are just starting to focus on global problems in earth 121 .... 125 129 Cover: Drawing of the three-compartment saturation dive chamber complex at the Korean Navy Ocean and Underwater Medical Research and Training Center (OUMRTC). The complex consists of a main chamber that houses the living quarters for mission personnel, a subchamber that contains shower and toilet facilities, and a wet diving pot. Courtesy of Neal A. Naito. See his article on OUMRTC on page 115. SIBRIEF Scientific Information Briefs /40 is multiprocessor. Their nomenclature is mildly confusing, as the designation /x0 corresponds to the number of scalar rather than vector units, even though the latter determine peak performance. Fujitsu is deeply interested in multiprocessing; one indication has been their Ministry of International Trade and Industry (MITI) sponsored joint research with NEC and Hitachi, called informally the HPP project, involving four VP2600s each operating as a uniprocessor attached to a very large shared buffer memory. Fujitsu claims that such a large multiprocessor was developed mainly to demonstrate their success with room temperature HMET devices (see below) as the communications drivers between the computers and memory. Nevertheless, using this, a NEC researcher was able to solve a very large system of 32K linear equations in less than 11 hours. Fujitsu is probably experimenting on a /40 multiprocessor for the VP2600, but has not released any public information about this. Without a /40 for the VP2600, Fujitsu's VP2000 series peak performance (however unrelated to actual performance) will fall short of current competition from NEC as well as new machines from Cray, and perhaps others. In the meantime though, the VP2000 series comes in a variety of colors, including Elegance Red, Future White, and Florence Green. Peak performance values of the /10 and /20 models in any line are the same, as this is determined entirely by vector processing. Peak performance can easily be computed once the machine cycle time and the maximum possible number of simultaneous floating point operations are known. For example, the VP2400/40 and VP2600 each have cycle times of 3.2 ns. To achieve the advertised 5.0 GFLOPS peak implies 16 simultaneous floating point operations. For the VP2400/40 this requires 8 per vector unit, while for the VP2600/20 16 simultaneous operations are required. Each of Fujitsu's vector units is described as having two arithmetic pipes, but in reality they are more complicated. Each pipe is capable of simultaneously performing both an addition and a multiplication. In addition, the pipes effectively deliver twice (VP2400/40) or four times (VP2600/ 20) as much data. Thus each pipe on the VP2600/20 can produce four floating point additions and four floating point multiplications per cycle. This is similar to the "superword” concept on the ill-fated Cyber 205. Of course, if a calculation is dyadic, that is, does not involve both a multiplication and addition, then the peak performance will be reduced by 50%. By studying the performance of VP2000 machines on typical job streams, it has been observed that when the scalar unit is 100% in use, the vector unit is about 50% to 75% busy. Thus, the addition of a second scalar unit can significantly increase throughput, and was presumably Fujitsu's reason for adding it. However, for any single user problem it might not be possible to keep the vector unit constantly busy. Thus, the most practical environment for such a setup would be a computing center or other multiuser job shop, where several user jobs can be run simultaneously. Kyoto University, a typical busy university computing center, will be getting a VP2600/10 soon. We asked why they were getting only one scalar processor. Although the university made a very strong case for two scalar processors, the Ministry of Education decided (based on budgetary, or other, grounds) to only support the one scalar processor system. However, it is an easy field upgrade to add the second scalar unit. The choice of a VP2600/10 rather than a VP2400/40 was a matter of policy, Kyoto has always tried to purchase the fastest machine available. It is also possible that they would like to upgrade eventually to a multiprocessor 2600 when this is avail able. DO 4000 J=1,4096 DO 4000 I=1,2048 4000 CONTINUE Then the actual multiplication is as follows. DO 5000 L=0,1 DO 5000 J=1,4096 A(I,J)=A(I,J)+B(I,K) 5000 CONTINUE *C(K,J)+B(I,K+1) *C(K+1,J) +B(I,K+2)*C(K+2,J) +B(I,K+3)*C(K+3,J) In this case the matrices are large enough that there is significant memory to register to memory traffic. Nevertheless, Fujitsu's FORT77/VP compiler is able to vectorize this effectively and generate 4.8 GFLOPS, 96% of peak performance. As is the case with most of today's vector supercomputers, data to and from the vector arithmetic units need to pass through vector registers. In the VP2600 these registers have a capacity of 128 KB (64 elements times 256 registers times 8 byte data) but can be concatenated in various ways, for example, as 2048 times 8 times 8 instead. Thus, the organization of the registers is very flexible. To get data between memory and the vector registers, Fujitsu only provides two load/store pipelines. This could be a bottleneck, although the NEC register flexibility may alleviate it to a certain extent. Memory to register bandwidth has been criticized in the VP2000 series, but at least one new benchmark, given below, suggests that Fujitsu has been making efforts to deal with this. The computation of interest is that of multiplying large matrices A=B*C, each of which is 4096 by 4096, with real 64-bit floating point components. The source program is written in 100% standard Fortran but is organized to take advantage of the two-pipe structure of the VP2000 architecture in a very clear way. The essential segment of the source program consists of first zeroing the target array. I visited this factory in March 1990 and reported on the SX-3 in a previous issue of the Scientific Information Bulletin ["NEC's new supercomputer, the SX-3," 15(3), 4-6 (1990)]. Then the only running system had one processor. Now, several one-processor machines are being tested prior to shipment and a two-processor system has been set up and is being debugged. Chief designer Watanabe stated that a one-processor system, depending upon peripheral options, would cost in the neighborhood of $10M. He claimed that the four-processor system will be up in a few months, and we have heard estimates that it will cost roughly $25M. Peak performance of a uniprocessor system is 5.5 GFLOPS, based on a cycle time of 2.9 ns and 16 simultaneous operations (16/2.5=5.5). The vector unit in such a system consists of one, two, or four sets of vector pipelines. Each vector pipeline set consists of two add/ shift and two multiply/logical functional pipelines. Each of the functional pipelines can be operated simultaneously; thus, the arithmetic processor in a uniprocessor system with four vector pipeline sets can execute up to 16 floating point operations per machine cycle. To get near peak performance, all 16 pipes must be kept busy. Data are fed to and exit from the arithmetic pipes to vector registers, with a maximum capacity of 144 KB. It is unlikely that an SX-3 system would be purchased without all four pipes in each processor. The four-processor system is thus capable of 22 GFLOPS peak, although this assumes that all the data can be kept in the vector registers. To the extent that data must be brought from main memory to the registers, performance may degrade. The bandwidth between memory and the registers depends on the memory hardware technology, and on how the data are arranged in the memory banks, but serious applications must keep data in registers to get good performance. Further, 22 GFLOPS requires 64 simultaneous operations, and this will mean that different operations have to occur simultaneously. Also, unless the user program can be divided up into simultaneous, independent tasks that use the same data in the vector registers, arrays will have to be quite long to absorb the startup penalty of being parcelled out to several processors. The most effective environment for such multiprocessors is a busy multiuser computer center, similar to that for other large multiprocessors. Most computer |