Scientific Information Bulletin

But in a fully Josephson computer, the CPC approach claims to be able to increase clock speeds to 10 GHz, with resulting increases of processing speed. For example, for a Josephson CPC, matrix multiplication is predicted to execute at 20 GFLOP peak on a processor equipped with one floating point adder and one floating point multiplier when two matrix operands can be fetched from memory in parallel. Fast Fourier Transform (FFT) performance depends on the number of arithmetic units and the number of instruction operands that can be fetched in parallel, but a peak performance of 50 GFLOP is predicted if five operands can be fetched in parallel and if there are three floating point adders and two floating point multipliers.

Several versions of CPC have been designed and at least one has been built, FLATS2 (Ref 11), using silicon (ICs) rather than Josephson junction technology. FLATS2 is a CPC with two virtual processors that share 10 pipeline stages. Machine cycle time is 65 ns, which is equivalent to memory cycle time. The transfer rate of memory is 117 MB/s for instructions and data. FLATS2 consists of 26 logic boards, each of which contains between 200 and 400 IC chips, connected by a backplane board and by front flat cables, mounted on an air-cooled rack chassis (57 x 62 x 37 cm), which is then packed into a cubic box along with power supplies.

FLATS2 is running. In addition to the operating system (Ref 12), a Fortran language based on Jordan's Force with parallel constructs is available. The Architecture Group has run simulations on various matrix computations based on DGEFA and DGESL from LINPACK, conjugate gradient, FFTS, and Livermore loops. The results are interesting, but it is still too early to tell if this technique can really be applied without Josephson devices. Further, there are some scientists who feel that

traditional methods will be equally REFERENCES efficient.

But what is important about this research is that it presents an almost orthogonal view of how to design very high performance computers. Almost without exception today, researchers feel that highly parallel is the future, that is, large numbers of processors each with their own memory. The CPC approach uses shared pipelined memory, single processor, with multiple instruction streams. Of course, to be most practical, it may have to await Josephson technology. Nevertheless, as a research activity it has demonstrated several extremely innovative approaches and should be followed closely. Furthermore, there is a chance that new ECL devices could be built that have the ability to function as their own latches, an important characteristic of Josephson devices. Goto told me that he had recently devised such new devices and that Hitachi was sufficiently excited about their potential to involve several others on their research staff in a more thorough study of their costs and benefits. Finally, it was reported late last year that members of the Goto project had successfully fabricated a new chip, 2.5 mm square, on which four QFP devices were set. When cooled in a liquid-helium environment (-269 °C), all of the single devices had a clock frequency of 16 GHz, corresponding to a measured switching speed of 15 ps. Linewidth of the manufactured device is 5 microns, but when 0.5 micron VLSI technology is applied it is believed that the speed can be increased by about a factor of 10.

For additional information about CPC, contact

Dr. Yasuo Wada
Technical Manager
Quantum Magneto Flux Logic
Project

Bassin Shinobazu 202
2-1-42 Ikenohata
Taito-ku, Tokyo 110, Japan

1. A. Engel, "Opportunities for foreign researchers in Japan: ERATO," in Japanese Information in Science, Technology and Commerce, edited by Monch, Wattenberg, Brockdorff, Krempien, and Walravens (IOS Press, 1990) pp 553-58.

2. A. Gibor, "The ERATO program,” Scientific Information Bulletin 15(3), 27-30 (1990).

3. W. Hioe and E. Goto, Quantum Flux Parametron (World Scientific, Singapore, 1991).

4. K.F. Loe and E. Goto, DC Flux Parametron A New Approach to Josephson Junction Logic (World Scientific, Singapore, 1986).

5. K. Shimizu, E. Goto, and S. Ichikawa, "CPC (cyclic pipeline computer) - An architecture suited for Josephson and pipelined machines," IEEE Transactions on Computers, 825-32 (June 1989).

6. J. Rowell et al., JTECH panel report on the Japanese Exploratory Research for Advanced Technology (ERATO) program (Science Applications International Corporation, McLean, VA, 1988).

7. H. Hosoya, W. Hioe, J. Casas, R. Kamikawai, Y. Harada, Y. Wada, H. Nakane, R. Suda, and E. Goto, to be published in IEEE Trans. Appl. Superconductivity.

8. N.P. Jouppi, "The nonuniform distribution of instruction-level and machine parallelism and its effect on performance," IEEE Transactions on Computers 38(12) (December 1989).

9. Hennessy and Patterson, Computer Architecture, A Quantitative Approach (Morgan Kaufmann Publishers, 1990).

10. G. Pfister and V. Norton, "Hotspot contention and combining in multistage interconnection networks," ACM Transactions on Computer Systems 3(4) (October 1985).

11. S. Ichikawa, A study on the cyclic pipeline computer: FLATS2, Tokyo University, February 1987.

12. P. Spee, M. Sato, N. Fukazawa, and E. Goto, "The design and implementation of the CPX kernel," in Proceedings of the 7th RIKEN Symposium on Josephson Electronics, Wako-shi, Japan, 23 March 1990, pp 10-20.

Paul Spee received his M.Sc. degree from the Delft University of Technology in 1986. From 1984 to 1986 he was employed by the Delft University of Technology. Since 1986 he has been a researcher for the Research Development Corp. of Japan (JRDC). His areas of interest include parallel architectures, operating systems, and parallel and concurrent programming languages.

PARALLEL PROCESSING RESEARCH IN JAPAN

Parallel processing research (mostly associated with the dataflow model)
and database research in Japan, based on visits to various laboratories
and attendance at the IEEE Data Engineering Conference in Kobe (10-
12 April 1991), are summarized.

INTRODUCTION

Prof. Rishiyur Nikhil

Massachusetts Institute of

Technology (MIT) Laboratory for Computer Science 545 Technology Square Cambridge, MA 02139 Tel: (617)-253-0237 Fax: (617)-253-6652 Email: nikhil@lcs.mit.edu

spent 2 weeks in Japan (April 1991) visiting five research laboratories and attending the IEEE Data Engineering Conference in Kobe. What follows are Nikhil's observations along with Kahaner's comments, when relevant. PARALLEL PROCESSING

Electrotechnical Laboratory

The projects at the Electrotechnical Laboratory (ETL), the Institute for New Generation Computer Technology (ICOT), and Prof. H. Terada's laboratory at Osaka University are primarily focused on parallel processing architectures and languages and are the closest to Nikhil's own research work. Nikhil continues to be very highly impressed with the machine-building capabilities of his Japanese colleagues, but he thinks that with the exception of

by Rishiyur Nikhil and David K. Kahaner

the ICOT researchers, many of them are still very weak in software. Nikhil comments that it is quite breathtaking to see how quickly new research machines are designed and built, and by such small teams--and he wishes we could do as well in the United States. However, once built, their machines do not seem to be evaluated thoroughlygood languages, compilers, and runtime systems are not developed and (consequently, perhaps) very few applications are written. The new designs, therefore, do not benefit from any deep lessons learned from previous designs. Also, another consequence of the "hardware-centric" nature of the machine builders is that certain functions are built into hardware that one would expect ought to be done in software (such as resource allocation decisions and load monitoring in the ETL machines).

According to Nikhil, ETL's EM-4 and proposed EM-5 are the most exciting machines in Japan (and the world). The reason is this: as first elucidated in Reference 1, a large, general purpose parallel machine must be able to perform multithreading efficiently at a fine granularity, because this is the only way to deal effectively with the long internode latencies of large, parallel machines. The von Neumann processors are very bad at this, and dataflow architectures have always excelled at

this. However, previous dataflow architectures (including MIT'S TTDA and Monsoon and ETL's previous Sigma-1) were weak in single-thread performance and control over scheduling, two areas that are the forte of von Neumann processors. Recently, new architectures have been proposed to obtain the best of both worlds: the *T architecture at MIT and the EM-4 and EM-5 in Japan. Nikhil believes that these machines are the first truly viable parallel multiple instruction/multiple data (MIMD) machines.

EM-4 (Ref 2 and 3) is a medium sized machine (80 nodes) but does not have any floating point arithmetic. However, the chief problem is the lack of any good programming language or compiler. It is currently programmed in DFC ("dataflow C"), a very simple subset of C with single assignment semantics. Perhaps this situation will change in the future; the ETL researchers said that they have just hired a compiler expert, but they still do not expect a good programming environment for some years. Nikhil has doubts about their choice of C as the programming language for the EM-4 and EM-5.

According to Kahaner, dataflow research at ETL has a long history, including the Sigma-1, EM-4, and the proposed EM-5. The EM-4 was designed to have 1,024 processors. A prototype with 80 processors is running and he

was told that if the budget is maintained then the full system will be built. See his article in the Scientific Information Bulletin [“Electrotechnical Laboratory Dataflow Project," 15(4), 55-60 (1990)] and his electronically distributed report "parallel.904" of 6 November 1990.

Kahaner's interpretations of the ETL research direction are that their evolving designs are moving away from a pure dataflow model. At the same time interest in numerical applications, which was ambivalent, seems to have increased. Nikhil agrees that the ETL group is now more explicit about this but feels that they were always interested in general purpose computing, including scientific applications. Perhaps in the atmosphere of the 1980s, when there was so much emphasis in Japan on knowledge processing, they may have emphasized symbolic aspects, but in technical discussions they usually compared their machines to vector and other supercomputers and never to "symbolic supercomputers" such as Lisp machines or ICOT's machines. In other words, they may have always considered machines from Cray, NEC, and Fujitsu and the Connection Machine to be their real competition. It is interesting to note that the Connection Machine was also initially portrayed as a supercomputer for artificial intelligence (AI); the reality today is that it is mostly used for scientific supercomputing.

Sigma-1 was pure dataflow, similar to MIT's Tagged Token Dataflow Architecture. The EM-4 is based on what the ETL group called a strongly connected arc model. Their description of that follows (Ref 4).

In a dataflow graph, arcs are categorized into two types: normal arcs and strongly connected arcs. A dataflow subgraph whose nodes are connected by strongly connected arcs is called a strongly

connected block (SCB). There are two firing rules. One is that a node on a dataflow graph is firable when all the input arcs have their own tokens (a normal data-driven rule). The other is that after each SCB fires, all the processing elements (PE) which will execute a node in the block should execute nodes in the block exclusively.... In the EM-4, each SCB is executed in a single PE and tokens do not flow but are stored in a local register file. This property enables fast-register execution of a dataflow graph, realizes an advanced-control pipeline, and offers flexible resource management facilities.

The designers also wrote in 1989:

The dataflow concept can be applied not only to numerical computations involved in scientific and technological applications but also to symbolic manipulations involved in knowledge information processing. The application field of the EM-4 is now focused on the latter.

EM-4 was not originally designed to have floating point support, but Kahaner was told that this was also a budgetary issue.

For the EM-5, its objectives are as follows (Ref 4).

to develop a feasible parallel supercomputer including more than 16,384 processors for general use, e.g., for numerical computation, symbolic computation, and large scale simulations. The target performance is more than 1.3 TFLOPS, i.e., 1.3 x 1012 FLOPS (double precision) and 655 GIPS. Unlike the EM-4, the EM-5 is not a dataflow machine in any sense.

It exploits side-effects and it treats location-oriented computation [see note below]. In addition the EM-5 is a 64-bit machine while the EM-4 is a 32-bit machine.

The EM-5 will be based on a "layered activation model," a further generalization of the strongly connected arc mode of the EM-4.

The machine will be highly pipelined, with a 25-ns clock and 25-ns pipeline pitch. This is half the pitch of the EM-4, largely because of the use of RISC technology. Each of the up to 16,384 processors (called EMC-G) is 64-bit, RISC, with global addressing and no embedded network switch. Similarly, the floating point unit will not be within the processor chip but separate, like a coprocessor, because of limitations of pins and space on the chip. At the present time the designers have not decided on the topology of the interconnection network. Peak performance of the floating point unit will be 80 MFLOPS with a maximum transfer rate of 335 MB/s. The EMC-G will be built in a CMOS standard-cell chip with 391 pins and 100K gates, using 1.0 micron rules. This processor will have its logical design completed in 1991, and the gate design of the EMC-G will be completed in 1992. A full 16,384 node system will be designed in 1993 and a prototype is planned to be operational by March 1994.

With regard to languages, new work will emphasize DFC-II as Nikhil explained. This will have sequential description and parallel execution and is not a pure functional language. DFCII can break a single assignment rule and programs can contain global variables. The group is also planning to implement several other languages, such as Id and Fortran. Finally, some objectoriented model is also being considered.

In Japan at least, the ETL research group is considered to have some of the best (most creative, energetic, visionary,

etc.) staff among all the nonuniversity research laboratories. Readers may be interested to know that Dr. Shuichi Sakai of ETL (the chief designer of EM-4) is now visiting the dataflow group at MIT for 1 year, as of 1 April 1991. He will be assisting the group in the design of the new *T machine, which Nikhil mentions above (Ref 5). *T is based on Nikhil's previous work on the P-RISC architecture (Ref 6) and is a synthesis of dataflow and von Neumann architectures (Nikhil says that one should think of it as a step beyond EM-4-like machines). The group plans to build this machine in collaboration with Motorola, in a 3-year project that will follow the current MIT-Motorola project to build the Monsoon dataflow machine.

Concerning the remarks that the EM-5 will NOT be a dataflow machine, Kahaner passed them on to Nikhil, who was also quite surprised. Nikhil comments that the EM-5 is not fundamentally different from the EM-4. In both those machines, as well as in MIT's P-RISC and *T, the execution model is a HYBRID of dataflow and von Neumann models. In MIT's terminology, a program is a dataflow graph where each node is a "thread." ETL's equivalent of MIT's "thread" is the SCB, or strongly connected block.

Dataflow execution is used to trigger and schedule threads, just as in previous dataflow machines. In MIT's *T, this scheduling happens in the Start Coprocessor; in ETL's machines, it happens in the FMU (fetch and matching unit).

Within a thread, instructions are scheduled using a conventional program counter, as in von Neumann machines. In MIT's *T, this happens in the Data Coprocessor; in ETL's machines, it happens in the EXU (execution unit).

In both the EM-4 and EM-5 the processor is organized as an IBU (input buffer unit), followed by an FMU,

followed by an EXU. The overall execution strategy is the same in both machines.

The EM-5 and EM-4 differ in smaller details: EM-5 has newer chip technology, a separate memory for packet buffers, a finer pitch pipeline, a direct instruction pointer in packets, a floating point unit, a 64-bit arch, etc., but the fundamental organization is the same.

Nikhil also asked Sakai about the statements in Reference 4. Sakai claims that what he meant was "... the EM-5 is not a dataflow machine in SOME sense" and faults his poor command of English for this error. With respect to the second sentence, "It exploits side-effects and it treats location-oriented computation," Nikhil is not sure what the authors meant by this. He explains that dataflow architectures have never prohibited side-effects or enforced singleassignment semantics. It is only dataflow languages that take this position on side-effects. Dataflow architectures merely provided support for this, while not enforcing it. Dataflow architectures are equally appropriate for other languages, such as Fortran or C.

Institute for New Generation Computer Technology

After visiting ICOT, Nikhil remarks that he got a sense of complementary strengths relative to ETL. ICOT researchers seemed to be very sophisticated with respect to parallel languages, compilers, and runtime systems; the parallel machines, on the other hand, were not that exciting.

Nikhil does not think that anyone can claim any longer that the KL1 language used extensively at ICOT is a logic programming language (ICOT researchers themselves are quite frank about this). The main remaining vestige of logic programming (albeit a very important one) is the "logic variable," which is used for asynchronous

communication. Logic variables in KL1 are very similar (perhaps identical) to "I-structure variables" in Id, the programming language developed at MIT over the last 6 years.

Regardless of whether we label KL1 as a logic programming language or not, it is certainly a very interesting and expressive language and is perhaps the largest and most heavily used parallel symbolic processing language in existence anywhere. Because of the sheer volume of applications that people are writing in KL1 and running on ICOT's parallel machines (Nikhil saw five demos from a very impressive suite of demos), ICOT researchers are certainly as experienced and sophisticated as anyone in the world about parallel implementations of symbolic processing: compilation, resource allocation, scheduling, garbage collection, etc.

ICOT's machines are not as exciting as ETL's. The original PSIS (130 KLIPS) were heavily horizontally microcoded sequential machines, and one must wonder whether they will not go the way of Lisp machines, i.e., made obsolete by improving compiling technology on modern RISC machines. PSIS were not originally conceived of as nodes of a parallel machine. Thus, ICOT's two multi-PSIS, which are networks of PSIS (two-dimensional grid topology), are just short term prototypes for experimentation. ICOT researchers want to put one of the two multi-PSIS on the Internet for open access, but they are having trouble convincing the Ministry of International Trade and Industry (MITI) to allow this.

ICOT's real parallel targets are the PIM machines, the first of which (a PIM/p) had just been delivered to ICOT during Nikhil's visit (it was not yet up and running). ICOT's machines are built by various industrial partners, of course with heavy participation in the design by ICOT researchers. There are five different PIM architectures (different node architectures, different network

« iepriekšējā Turpināt »

Grāmatas