Scientific Information Bulletin

THE INTERNATIONAL SYMPOSIUM ON SHARED MEMORY MULTIPROCESSING (ISSMM)

Introduction

The International Symposium on Shared Memory Multiprocessing (ISSMM) was held in Tokyo on 24 April 1991. ISSMM was sponsored by the Information Processing Society of Japan (IPSJ), with Professor Tadashi Masuda from the University of Tokyo serving as general chairman and Dr. Norihisa Suzuki serving as program chairman.

Dr. Norihisa Suzuki
Director, IBM Tokyo Research
Laboratory (TRL)
5-19 Sanban-cho
Chiyoda-ku, Tokyo 102, Japan
Tel: 81-3-3288-8300
Email: nsuzuki@trlvm1.vnet.ibm.com

Both Japan and U.S. industry and academia were represented on the program committee. The proceedings were published by the Information Processing Society of Japan, and a version of them is expected to be published by MIT Press, with Dr. Suzuki serving as the editor. The current plan is to have the second instance of the conference in northern California in late 1992.

The conference included presentations of 23 refereed papers. Of the papers, 5 were from Japan and the remaining

SIBRIEF

Scientific Information Briefs

18 were from abroad. The attendance was approximately 130, consisting of about 100 Japanese and 30 foreigners.

There was also a panel on the feasibility of large-scale shared memory multiprocessors. Thacker (DEC SRC) started off by making four observations:

(1) Single instruction/multiple data (SIMD) machines work well for large, regular problems.

(2) Multiple instruction/multiple data (MIMD) machines work well for less regular problems.

(3) Shared memory machines work well when they don't share and don't synchronize.

(4) Programming large, shared MIMD machines is much harder than programming SIMD machines.

Koike (NEC) commented that the history of processor design indicates that shared memory machines consisting of thousands of processors are not feasible in the near term. In addition, he recommended that we focus carefully on specific application domains and on the kind of parallelism they require to get the most from existing (and near-future) parallel machines. (Note that this is consistent with the (Note that this is consistent with the discussion of the Cenju system, described below, in which a special-purpose machine was built that after-the-fact was viewed to be more general-purpose than intended.)

Mr. Nobuhiko Koike
Manager, Computer System
Research Laboratory

C&C Systems Research Laboratories
NEC Corp.

1-1, 4-chome, Miyazaki, Miyamae-ku
Kawasaki City, Kanagawa 213, Japan
Tel: (044) 856-2127
Fax: (044) 856-2231
Email: koike@csl.cl.nec.co.jp

Baskett (Silicon Graphics) argued that for many problems, computational complexity is greater than read/write complexity, and so large problem sizes need relatively less bandwidth than small problem sizes. He concluded that building large-scale machines is feasible if we intend to run large problem sizes on them.

Gifford (Massachusetts Institute of Technology) said that such machines are feasible, but that machines with 1,000+ processors cannot have balanced memory access and, therefore, we need some kind of remote procedure call or message-passing protocol. For this reason the term "multiprocessor" must be changed to "multicomputer." For programming ease, he stated that it was critical to name entities that are both local (in shared name spaces) and remote (in nonshared name spaces) uniformly; the performance consequences of this are acceptable, he suggested, if you don't in fact share very much in practice. The reason is that, in order to get machines on the order of 100 or more processors, we will have to manage nonuniform access

times. Overall, Gifford saw the merging of two research streams: (1) the flow uniform access time machines (like TOP-1) to nonuniform access time machines (like Dash and Alewife) and (2) the flow of multicomputers (like hypercubes) to high-performance multicomputers (like J-Machine). The central issue in this merging is the programming model.

to nonshared memory algorithms and
synchronization approaches. He felt
that the communications problem is
not bandwidth but latency, and that
shared memory multiprocessors
(SMMP) must minimize misses (need
HUGE caches; must avoid sharing data
that are private; must use appropriate
algorithms) and accommodate latency
(need to pay special attention to pro-
cessor architecture; must have an effi-
cient pre-fetch, efficient synchroniza-
tion, and nonblocking reads and writes).
Wong (Japan Research and Develop-
ment Corp.) said that we don't have a
ment Corp.) said that we don't have a
suitable large problem to run on a huge
SMMP. Now we build special-purpose
machines to solve special problems.
Halstead (DEC) pointed out that we
need cleaner communications proto-
cols. An attendee from IBM Japan asked,
"Why do we need large-scale multipro-
cessors anyway?" Goodman answered,
"Because big machines make innova-
tions for small machines."

Y. Muraoka (Waseda University)
said to look towards machines with
hundreds of thousands of processors,
based on wafer-scale technology. He
observed that we should not forget
Amdahl's Law when considering the
relationship between increasing prob-
lem size and the increasing number of
processors. He finished by stating that
parallelizing compilers aren't working
well enough, and that the need for dis-
tributed memory and programming
paradigms is clear. He also pointed out
that if a hypercube were to have more
than 10 processors, we would need at
least 1,000 wires per processor. Speed-
ups achieved with current software Discussion of Some Papers
paradigms are far less than the relative
hardware speedups.

Prof. Yoichi Muraoka
Waseda University
3-4-1 Okubo

Shinjuku, Tokyo 169, Japan
Tel: (03) 3203-4141, x73-5187
Fax: (03) 3200-1681
Email: muraoka@jpnwas00.bitnet

Goodman (University of Wisconsin) concluded the panel by observing that shared memory machines are easy to program but hard to build, while nonshared memory machines are easy to build and hard to program. He believes that the trend is to merge the two by building nonshared memory machines with hardware primitives that support cache-coherent shared memory. He believes the challenge is to exploit the shared memory programming paradigms, perhaps extended in some ways,

"Cenju: A Multiprocessor System with a Distributed Shared Memory Scheme for Modular Circuit Simulation," T. Nakata, N. Tanabe, N. Kajihara, S. Matsushita, H. Onozuka, Y. Asano, and N. Koike (NEC). Cenju is an experimental multiprocessor system with a distributed shared memory scheme developed mainly for circuit simulation. The system is composed of 64 processing elements (PEs), which are divided into 8 clusters. In each cluster, 8 PEs are connected by a cluster bus. The cluster buses are in turn connected by a multistage network to form the whole system. Each PE consists of MC68020 (20 MHz), MC68882 (20 MHz), 4 MB of RAM, and a floating point processor, WTL1167 (20 MHz). In this system, a distributed shared memory scheme in which each PE contains a part of the whole global memory is adopted. The simulation

algorithm used is hierarchical modular simulation in which the circuit to be simulated is divided into subcircuits connected by an interconnection network. For the 64-processor system, a speedup of 14.7 and 15.8 was attained for two DRAM circuits. Furthermore, by parallelizing the serial bottleneck, a speedup of 25.8 could be realized. In this article, the simulation algorithm and the architecture of the system are described, along with some preliminary evaluation of the memory scheme. The picture of the system showed three cabinets, each cabinet having four bays, the bottom bay of which is used for power. The interconnections were done with many ribbon-cables. [Kahaner comments that Cenju was reported on as a machine for transient analysis of circuits, for which it was originally built [I.S. Duff and D.K. Kahaner, "Two Japanese Approaches to Circuit Simulation," Scientific Information Bulletin 16(1), 21-26 (1991)]. Recently though, NEC researchers have been studying other applications and reported on a magnetohydrodynamic computation at the annual parallel processing meeting in May 1991.]

"MUSTARD: A Multiprocessor Unix for Embedded Real-Time Systems," S. Hiroya, T. Momoi, and K. Nihei (NEC). MUSTARD is a portable multiprocessor Unix for microprocessor embedded real-time systems. This Unix is a two-layered operating system consisting of a real-time kernel and a Unix kernel. It is operated on a tightly coupled multiprocessor without a dedicated kernel processor. In addition, to simplify the structure of the fault-tolerant system, MUSTARD supports the addition/separation of a processor during system operation. This paper presents the features, implementation, some performance measurements, hardware construction to evaluate MUSTARD, and user programming tools for MUSTARD. This machine is commercially available in

Japan now. It currently uses eight NEC V70 32-bit processors. It also makes use of "redundant" CPUs and has a 16-ms CPU switch time.

"Throughput and Fairness Analysis of Prioritized Multiprocessor Bus Arbitration Protocols," M. Ishigaki (Nomura Research), H. Takagi (IBM TRL), Y. Takahashi, and T. Hasegawa (Kyoto). Performance characteristics of bus arbitration protocols for multiprocessor computer systems are studied by queuing theoretic approach as an alternative to the previous method based on generalized Petri nets. Bus utilization of each processor is calculated numerically for a fixed priority, a cyclic priority, a batching priority, and a modified Futurebus protocol. Plotting utilizations against the rate of service requests reveals the fairness characteristics of these protocols. For instance, in the modified Futurebus protocol with statistically identical processors, the bus utilization is evenly divided to all processors at both light and heavy load conditions, while it is allotted unevenly in accordance with their priority order at medium load conditions.

"Design and Evaluation of SnoopCache-Based Multiprocessor, TOP-1," S. Shimizu, N. Oba, A. Moriwaki, and T. Nakada (IBM TRL). TOP-1 is a snoop-cache-based multiprocessor workstation that was developed to evaluate multiprocessor architecture design choices as well as to conduct research on operating systems, compilers, and applications for multiprocessor workstations. It is a 10-way multiprocessor using the Intel 80386 and Weitek 1167 and is currently running with a multiprocessor version of AIX, which was also developed at IBM's Tokyo Research Laboratory. The research interest was focused on the design of an effective snoop cache system and quantitative evaluation of its performance. One of the unique aspects of TOP-1's design is that the cache supports four different original snoop

protocols, which may coexist in the system. To evaluate the actual performance, a hardware statistics monitor, which gathers statistical data on the hardware, was implemented. This paper focuses mainly on a description of the TOP-1 memory system design with regard to the cache protocols and its evaluation by means of the hardware statistics monitor mentioned above. Besides its cache design, the TOP-1 memory system has three other unique architectural features: a high-speed bus-locking mechanism, two-way interleaved 64-bit buses supported by two snoop cache controllers per processor, and an effective arbitration mechanism to allow a prioritized quasi-round-robin service with distributed control. These features are also described in detail. [Kahaner comments that several researchers at IBM's TRL told him that there were no plans for commercialization, and the project is very much for experimentation and education.]

"The Kyushu University Reconfigurable Parallel Processor--Cache Architecture and Cache Coherence Schemes," S. Mori, K. Murakami, E. Iwata, A. Fukuda, and S. Tomita (Kyushu). The Kyushu University Reconfigurable Parallel Processor system is an MIMD-type multiprocessor that consists of 128 PEs interconnected by a full (128x128) crossbar network. The system employs reconfigurable memory architecture, a kind of local/remote memory architecture, and encompasses a shared memory TCMP (tightly coupled multiprocessor) and a message-passing LCMP (loosely coupled multiprocessor). When the system is configured cessor). When the system is configured as a shared memory TCMP, memory contentions will be obstacles to the performance. To relieve the effects, the system provides each PE with a private unified cache. Each PE may have the cached copy of shared data in its cache whether it accesses to local or remote memory and, therefore, the

multicache consistency, or inter-cache coherence, problem arises. The cache is a virtual-address direct-mapped cache to meet the requirements for the hit time and size. The virtual-address cache implementation causes the other consistency problem, the synonym problem, called the intra-cache coherence problem. This paper presents four cache coherence schemes for resolving these cache coherence problems: (1) cacheability marking scheme, (2) fast selective invalidation scheme, (3) distributed limited-directory scheme, and (4) dualdirectory cache scheme. Cache coherence protocols and their trade-offs among several aspects are also discussed. [Kahaner comments that the Kyushu work was described in the electronic report parallel.904 (6 Nov 1990). He commented that given the resources available at Kyushu, a project like this might be best thought of as a mechanism for experience building and training. Another paper on this was also presented at the annual parallel processing meeting in May 1991. One of the project's principal investigators, S. Tomita, has recently moved to Kyoto University.]--David Notkin, University of Washington; John Cowles, Convex Computer Corp.; and David K. Kahaner, ONRASIA

ADVANCES IN CERAMIC POWDER PROCESSING SCIENCE

Production of high quality powders of the right shape and size and processing of these powders to form defectfree ceramic bodies are important to the development of advanced ceramics for high tech components with high reliability. These are, therefore, research areas of high priority in the field of ceramics. To review the progress in the area of powder synthesis and processing, a series of conferences called the

International Conferences on Ceramic Powder Processing Science was initiated 8 years ago. At the fourth in the series, which was held in Nagoya on 13-15 March 1991, each of the integral stages of powder processing and their interrelationships were discussed. During the course of the conference it was decided that the scope of the conference should be enlarged from ceramic powder processing science to ceramic processing science.

Agglomerates and inclusions are the sources of defects in the final ceramic body. To minimize them it is necessary to understand long range repulsive interparticle potentials in slurries with a low particle fraction. Dr. Fred F. Lange of the University of California, Santa Barbara, in his opening speech, discussed the influence of interparticle potentials on the rheology of the slurry and particle packing. The experimental results described by him confirmed his conclusions that highly dispersed slurries, produced with highly repulsive, long range electrostatic potentials, can be converted into weakly flocced slurries by the addition of salt. These weakly flocced or coagulated slurries can, when consolidated by pressure filtration or centrifugation, give the highest particle packing without mass segregation. Lange explained that a coagulative system was more desirable than a flocced slurry because in a coagulative system particles attract each other but do not touch. He also showed experimental results that indicated that the consolidated bodies prepared from coagulated slurries relax strain easily, while those from flocced slurries do not and are prone to cracking.

Another way of achieving defectfree ceramic bodies is to control the nucleation and growth of particles from the precursor solutions. Prof. Gary Messing of Penn State University reviewed the principles involved in the tailoring of precursor systems to control nucleation and phase development.

He described results of his work on mullite and alumina. In the case of mullite, he obtained very fine grained structure by seeding the precursor solution (TEOS plus aluminum nitrate sol) with seed crystals of gamma aluminum hydroxide. Hybrid seeds consisting of hydroxide. Hybrid seeds consisting of gamma aluminum hydroxide and silica gave even better results. The temperature at which these gels were dried and sintered was also found to be an important factor.

A novel process to prepare fine ceramic powders that are monodispersed, with high purity and homogeneous composition in each particle and good sinterability, was reported by Prof. S. Matsumoto of Sakai Chemical Industrial Co. A variety of compounds including TiO,, BaTiO,, MnZnFe,O,, and y-Fe,O, have been prepared. The process essentially involves the interaction of precursor compounds in aqueous solution under hydrothermal conditions. For example, BaTiO, (perovskite structure) can be prepared by the interaction of titanium hydroxide gel with Ba(OH) By controlling the pH, temperature, and pressure, the particle size can be controlled within narrow limits. Matsumoto described the preparation of MnFe2O, and acicular y-Fe2O, and how the addition of certain impurities modified the crystal morphology of these modified the crystal morphology of these compounds. The acicular y-Fe,O, was obtained by the interaction of ferric chloride and sodium hydroxide solution at 160 °C in the presence of an organophosphoric acid salt as an additive. These powders are used in tapes for recording

and other devices.

Additives are very important as modifiers of particle size and morphology. Prof A. Takahashi of Mie University discussed the molecular mechasity discussed the molecular mechanisms of polymeric interactions of mixing with solvents, bridging particles, and steric stabilization with features of adsorption of polymers and adsorbed polymer conformation.

Prof. Ilhan Aksay of the University of Seattle, Washington gave an excellent talk on the shape forming of ceramics from colloidal slurries. He classified processing of slurries into two categories: (1) shape forming while the fluid medium is partially drained, e.g., colloidal filtration, and (2) shape forming without any fluid drainage, e.g., injection molding or extrusion. He expounded that because of the different mode of particle-particle interaction in the two techniques, colloidal slurry filtration is not suitable for injection molding. He also described how ultrafine inorganic particles are formed in biosystems. When the particle size is <0.1 micron, high density dispersions cannot be achieved because of clustering. To overcome this, it is necessary to use gel systems. In nature surfactants with bilayers (and not a polyelectrolyte) are involved. Under these conditions it is possible to form strings of nanosized particles that can be very densely packed.

Coating of particles with a phase that could become viscous at elevated temperatures, such as alumina particles coated with silica, can produce excellent consolidated dense bodies. Prof. M.D. Sachs of the University of Florida reported that sialon prepared from powders of silicon nitride and aluminum oxide coated with silica gave a much better product than that obtained from uncoated powders. Dr. H.K. Schmidt of the University of Saarland reviewed the synthesis of powders using sol gel processes. Preparation of thin layer electroceramics such as titanates, zirconates, niobates, and cuprates using chemical precursors was described by Prof. D.A. Payne of the University of Illinois. He gave examples of the evolution of structure from clusters, cage structures, oligomers, network formation, gels, and amorphous and crystallized layers of these compounds. Niel Claussen of the Technical University of Hamburg

showed that an isostatically pressed, attrition milled aluminum oxide/ aluminum powder mixture with siliceous additives, heated first between 800 and 1,100 °C followed by heating at 1,100 to 1,250 °C and then sintering at 1,250 to 1,500 °C, produces an excellent mullite ceramic. In the first heating step, aluminum is oxidized to aluminum oxide in the form of nanosized particles of alpha alumina. In the second heating step, silicon or silicon carbide is converted into cristobolite. In the third step, a reaction between alumina and cristobolite takes place, giving a dense mullite ceramic. The addition of zirconia and a humid air atmosphere enhanced the reaction velocity according to Dr. Claussen.

Dr. H. Wada of the Government Industrial Research Institute, Shikoku described a method for the preparation of aluminum borate (9A10.2B2O) whiskers by the interaction of aluminum sulphate and boric acid in a flux of potassium sulphate. Optimum yield was obtained at 1,100 °C. A number of studies reported preparation of electronic ceramics by the spray pyrolysis technique.

Although no breakthroughs were reported at this conference, a number of invited papers described considerable advances in understanding the nucleation and growth of particles, the structure of clusters, and the influence of surfactants on these clusters and the final consolidation. Aksay's observations on the mimicking of nature in producing higher density, nanosized particle packing were quite interesting. Also, the technique of producing high density, high quality mullite and aluminum oxide by sequential heating of an isostatically pressed mixture of aluminum oxide/aluminum metal powders with and without siliceous additive, respectively, appears to be a new approach to making high quality ceramics and ceramic matrix composites.-Iqbal Ahmad, AROFE

NIPT FEASIBILITY STUDY AND I don't know this for sure, I would WORKSHOPS

(Hirosawa works in the same office that has recently been involved in U.S./ Japan chip discussions.)

The feasibility study will last for about 1 year. It follows a preliminary study that was reported on in a previous issue of the Scientific Information Bulletin [D.K. Kahaner, “New Information Processing Technologies Symmation Processing Technologies Symposium (Sixth Generation Project)," 16(2), 45-52 (1991)]. If the feasibility study is positive the 10-year program will begin April 1992.

The NIPT program is now seen as being in eight definite projects. My own guesses as to the hardware/software components of each are in brackets. This begins to formalize the way research funds will be spent. Also item (3) is now clearly specified as a neural computer. (This had not been decided by the NIPT Symposium in March.) While

assume that the technical leads on each part will not rotate, i.e., that they will stay with the program all the way through.

(1) Research on theoretical foundations of flexible information processing. [Theory]

(2) Dataflow ultra-parallel computer based on concurrent objectoriented model. [Hardware of a special kind, as well as low level (systems) software]

(3) Million neuron parallel processor. [Hardware]

(4) Adaptive massively parallel machine. [Hardware, software, but I don't really know what this means] (5) Flexible information processing model based on modularized neural networks. [Neural network models (theory), maybe some hardware]

(6) Research on flexible understanding and flexible inference mechanism. [Theory, maybe with some software experiments]

(7) Optical neuro-computers--Theoretical modeling, device, and system technologies. [Hardware]

(8) Parallel digital optical computer architecture and algorithms. [Hardware for architecture, theory and software for algorithms]

Each project will be conducted by a consortium, consisting of companies and universities. The feasibility study will examine the following issues as well as others.

• Each project's feasibility (objectives, time-tables, consortium members, task sharing, budget, etc.).

« iepriekšējā Turpināt »

Grāmatas