Scientific Information Bulletin

example, in a DRAM circuit with 7,000 transistors, broken into 191 subcircuits, the interconnection matrix is about 200 by 200, with about 20% of its entries nonzero. [Research based on the same mathematical foundations, but without the multiple hierarchy of subcircuits, is being done at Texas Instrument Corp. (Ref 5).]

Consideration of the form of the algorithm suggested that for many problems the subcircuit processors would compute independently and then all communicate to the interconnecon processor, which would then compute and finally communicate back to the subcircuit processors. Thus it will mostly be the case that communication is many-to-one or one-to-many. Nakata felt that a bus structure had better throughput than more general organizations, such as a hypercube. He also felt that shared memory would be more efficient than message passing systems. In 1987 the NEC group developed a prototype system composed of four processors. From that experience they have developed a 64-processor system, called Cenju, after the Buddhist goddess with 1,000 arms. (Nakata feels that the maximum number of processors that can effectively be used on circuit problems is about 100.) He decided to use a distributed-shared memory. This means that there is a unique global address for the local memory in each processor, and that each processor has a local memory bus so it can access its local memory without worrying about bus contention. Since some interprocessor communication occasionally takes place, and building a common bus with high data-transfer-rate suitable to so many processors was difficult, he has designed the data bus in a hierarchical (tree-like) form.

Cenju is thus composed of 64 processors. These are divided into eight "clusters." Processors in a cluster are connected by a bus that is connected via a multistage network. Each processor

is built around a 20-MHz Motorola
MC68020 CPU that has 4 MB of dual-
ported local memory connected by a
local memory bus. The cluster bus only
needs to be accessed when communi-
cating with other processors. The design
allows up to 256 processors. Since
floating point computation is a funda-
mental part of circuit simulation, Nakata
decided to use fast Weitek WTL1167
chips instead of the Motorola MC68882
chips that are most commonly used
with MC68000 systems. An MC68882
is on the board and is used for math
functions such as sin, cos, exp, etc., but
floating point addition, subtraction,
multiplication, and division are done
through calls to the Weitek chip. Other
details of the hardware organization
can be found in Reference 6.

Nakata is a computer architect, not
a numerical analyst or a circuit designer.
In developing Cenju, he collaborated
heavily with Norio Tanabe, who has
been doing circuit simulation for many
years and is in NEC's VLSI CAD
years and is in NEC's VLSI CAD
Engineering Division. Some of the
aspects of the algorithm were done in
collaboration with other scientists in
NEC's Computer and Communication
Systems Research Division and Scien-
tific Information System Development
Ltd. (a subsidiary). For complete
addresses, see References 4 and 6.

191 for the second. When there are more subcircuits than processors, there is quite a bit of extra overhead and Nakata feels that this accounted for some degradation in performance; the matrix problem to be solved for the interconnection matrix is larger and the matrix has about 20% nonzero elements. Its solution is done serially with respect to the other phases of the calculation. Nakata feels that additional improvements would occur if the domain decomposer was more sophisticated and produced fewer subcircuits as this would directly reduce the size of the matrix. He has also made some initial efforts to parallelize solving the matrix problem, and work in this direction has improved the speedup for the large problem to over 20. Nakata has not done as much work on solving the matrix problem as Fukui at Toshiba (below), because most of the original sparse matrix solution has been parallelized in the subcircuit processors; what is left is the sparse matrix solution for the interconnection points. However, he feels that by utilizing better domain decomposition, he can substantially increase the speedup.

Nakata originally built Cenju as a special-purpose computer, but now he is studying the possibilities of using it for other applications. Nakata notes that intercluster read access is very slow due to overhead in bus-error interrupt handling and timeout control. He dealt with this in circuit simulation by utilizing, whenever possible, algorithms in which the producer of data writes that data to its consumer.

Nakata has tested the running sys-
tem on two circuits, one with about
1,700 transistors and another with about
7,000. Both are part of the control cir-
cuitry for DRAMs. Speedups were ini-
tially reported as 15 for the small cir-
cuit and 16 for the larger. The system
designer enters the circuit. With respect
to domain decomposition, Nakata and TOSHIBA'S APPROACH
Tanabe feel that it is best to take
advantage of hierarchical design used
by the designers. When that cannot be
done, a domain processor should be
available to automatically generate the
subcircuits, although for these examples
they utilized the designers' hierarchical
design, 64 subcircuits for the first and

One of the early applications of direct methods for the solution of sparse linear systems was in the solution of problems from circuit simulation. In particular, this was the main reason for the interest in sparse matrix research of scientists from the Mathematical

Sciences Department at IBM Yorktown
Heights. This group, which included
Bob Brayton, Gary Hachtel, Fred
Gustavson, Werner Liniger, and Ralph
Willoughby, conducted much funda-
mental research into sparse matrix
techniques in the late 1960s and early
1970s. In 1970, Gustavson et al. pub-
lished a paper (Ref 7) describing the
use of a code-generation approach,
which they termed GNSO (GeNerate
and Solve), to the solution of sparse
equations. The idea of the code-
generation approach is to perform a
factorization of the sparse matrix using
Gaussian elimination and to record each
arithmetic operation involved in a linear
(loop-free) set of statements. This could
be Fortran, assembler code, or in a
compressed coded format (interpreta-
tive code approach). The initial elimi-
nation might be a regular factorization
or a symbolic one, depending on sparsity
alone and not on the numerical values.
As mentioned above, in the circuit
context (and in many other application
areas) during the Newton iteration it is
required to solve several systems with
a coefficient matrix of the same struc-
ture; thus, this list of operations can be
used for these subsequent factoriza- (3) use of block elimination
tions. In the Toshiba experiments,
sometimes about 900 iterations with a
Jacobian matrix of the same structure
were required so, in common with most
other experiments using this scheme,
the initial cost of generating the loop-
free code is not reported.

compilers and may not be vectorized by
them. Since about one line of code is
generated for each entry of the factors,
this approach is most attractive when
the factors can be kept relatively sparse.
Thus it would not be of interest in the
solution of elliptic partial differential
equations but is much more suited to
the types of matrices occurring in power
the types of matrices occurring in power
systems and circuit simulation prob-
lems. Additionally, in many circuit
problems, and especially for memory
problems, and especially for memory
modules, there is a natural stability in
the devices and this is reflected in the
equations, so that diagonal pivoting is
usually stable.

Duff (Ref 8) conducted some comparisons on the early code of Gustavson, for example, and concluded that their approach was of limited use for general problems. Duff and his coauthors also comment on this approach in Reference 9 (section 9.6), where the main problems are stated to be storage (with consequent speed problems if data have to be held on auxiliary devices) and stability. Additionally, loop-free code is not well optimized by most current

These features make the codegeneration approach potentially interesting in this application. We now discuss the improvements that Fukui et al. have made to the basic method to give them a competitive code. The improvements are:

(1) generation of assembler code using
(1) generation of assembler code using
vector operations

(2) packing of integer index informa-
tion

We discuss each in turn.

Using the example from their paper (Ref 10), a fragment of generated code (in Fortran) may look like

A(1) = A(1) - B*C(2)
A(3) = A(3) - B*C(4)
A(5) = A(5) - B*C(6)

and, if the programmer groups these
statements suitably, one can see that
the indirectly addressed vector A
(addressed above through index vector
(addressed above through index vector
13 5...) is updated by a scalar multiple
of the indirectly addressed vector C
(index vector 2 4 6...). Thus, using a
GATHER on A and C and a final

SCATTER on A, the code can be vectorized on nearly all current vector computers. This gives significant gains in both the storage of the generated code and the speed of execution. On the examples in their paper (Ref 10), the savings are a factor of just under three in both storage and time. It should be noted that the difference between generating Fortran code and assembler (scalar assembler) is a factor of about 30 in time.

If two integers are packed into a 64-bit CRAY word, the constituent integers can be quickly recovered using a shift instruction. This device is used both to decrease the memory required by the address data and to speed up the code through a reduction in memory references. A 30% speedup in CPU time is reported by the authors through this device.

Fukui says that his block elimination was similar to the use of matrixvector or level 2 BLAS (Basic Linear Algebra Subroutines) in Gaussian elimination on full systems. However, it is not clear how he would do that with his current approach and it does not appear to be illustrated in the paper. From the example there, it would appear that what is done is to hold the pivot row in vector registers and use it to update all the appropriate nonpivot rows without further memory references for the pivot row. It would also be possible to update several rows as a single vector operation, but it is also not clear if this is done. The gains reported for block elimination are a factor of 1.5 in both storage and time over the vector assembler code.

Fukui et al. also discuss (Ref 10) other issues in their adaptation of SPICE-GT, based on SPICE 2G.6 from Berkeley. The one other area of main interest is in the assembly or formation of the circuit simulation matrices. It is like the assembly of a finite-element matrix and thus can easily be parallelized.

Vectorization is not so easy because of complicated conditionals. They vectorize by grouping together (using indirect addressing) elements satisfying the same condition and then vectorize these subsets using vector GATHER/SCATTER. This is very similar to techniques used by Miura and others (for example, see Reference 11) in vectorizing Monte-Carlo codes. (At NEC, Nakata claims that the matrix setup phase is difficult to parallelize in a naive manner using a great many processors because of high overhead associated with shared memory contention.)

The gains over a 1984 code (which generated Fortran) were a factor of 4 in matrix assembly, 100 in solution, and under 2 in other parts of the code. The overall gain of the 1989 code is a factor of around 10. A factor of 300 compared with the original SPICE is claimed, presumably because of other algorithmic (or machine!?) changes not reported in Reference 10.

SUMMARY

The NEC-Nakata-Tanabe approach is essentially one of understanding the physical problem and then developing a hardware-software solution. ToshibaFukui focus much more directly on the sparse matrix problem that needs to be solved.

Nakata's work really begins with a clean sheet. Many of the relatively special needs of circuit simulation such as manyto-one and one-to-many communication, noncontentious local memory access, fast floating point, etc. are designed into the Cenju multiprocessor. More work still needs to be done on the serial portion of the algorithm, the sparse linear equation solver for the interconnection points, which is a bottleneck. Work is also going forward on automatic domain decomposition. Nevertheless, the current speedups of

over 20 are impressive. The Cenju design does not seem so specialized that it cannot be useful for other problems. This looks like a exciting project with a substantial amount of room for further research.

Fukui's work is very interesting, not least because it illustrates the use of a technique thought impractical by many sparse matrix researchers. It does not sparse matrix researchers. It does not seem, however, to be suitable for a wider range of applications because of stability and storage issues. Also, there is a great deal of machine dependence, although code-generation routines are usually in two parts, the first being machine independent. This seems to be the case here, so that potentially a separate code generator (and index packer) could be used for each machine. It is difficult to say from the data provided how good the generated code is. The improvements recorded are impressive but would need to be benchsive but would need to be benchmarked against other current codes or marked against other current codes or on standard test problems to obtain an accurate picture. Fukui has indicated his willingness to collaborate in benchmarking experiments. It should be pointed out that compilers are getting much better at generating vector code from standard matrix factorization routines, so that their output may not now end up so different from the code generated by the Fukui approach!!

Circuit simulation programs such as SPICE have a reputation of being difficult to vectorize or parallelize effectively, primarily because the type of circuit as well as the type of usage specified by the user have a significant impact on the "route" through the program. This will certainly be true when problems with long versus short transient simulation times are compared. For example, Mark Johnson (Ref 12) reported to us some statistics associated with running a standard public domain version of SPICE on two specific test problems. He observed that a

few lines in the matrix decomposition routine occupy 34% of total execution time for one problem but only about 3% for another, and also that in one problem 15 times as much CPU time was spent in index operations processing the sparse matrix as in the floating point portion of the Gaussian elimination step. Thus, the improvements reported by both Nakata and Fukui are dependent on problem specific details and may not be representative of performance on other problems.

REFERENCES

1. L. Nagel, SPICE2, A computer program to simulate semiconductor circuits, Technical Report ERL-M520 (University of California, Berkeley, May 1975).

2. D. Kahaner, C. Moler, and S. Nash, Numerical Methods and Software (Prentice-Hall, 1989).

3. J. Deutsch and A. Newton, "A multiprocessor implementation of relaxation-based electrical circuit simulation," in Proc. 21st DA Conference, 350-357 (1984).

4. T. Nakata, N. Tanabe, H. Onozuka, T. Kurobe, and N. Koike, “A multiprocessor system for modular circuit simulation," in Proc. ICCAD 87, 364-367 (1987).

5. P. Cox, R. Butler, and B. Epler, "Circuit partitioning for parallel processing," in Proc. ICCAD-86, 186-189 (1986).

6. T. Nakata, N. Tanabe, N. Kajihara, S. Matsushita, H. Onozuka, Y. Asano, and N. Koike, "Cenju: A multiprocessor system for modular circuit simulation," to be published in Computer Systems in Engineering (1990).

7. F. Gustavson, W. Liniger, and R. Willoughby, "Symbolic generation of an optimal Crout algorithm for sparse systems of linear equations," JACM 17, 87-109 (1970).

8. I. Duff, "Practical comparisons of codes for the solution of sparse linear systems," in Sparse Matrix Proceedings 1978, I. Duff and G. Stewart, editors (SIAM Press, 1979), 107-134.

9. I. Duff, A. Erisman, and J. Reid, Direct Methods for Sparse Matrices (Oxford University Press, 1986).

10. Y. Fukui, H. Yoshida, and S. Higono, "Supercomputing of circuit simulation," in Proc. Supercomputing '89, ACM Publication Order Number 415892, 81-85 (1990).

11. K. Miura and R. Babb, II, "Tradeoffs in granularity and parallelization for a Monte Carlo shower simulation code," Parallel Computing 8, 91-100 (1988).

12. M. Johnson (private communication) May 1990.

Iain S. Duff is Group Leader of Numerical Analysis in the Central Computing Department at the Rutherford Appleton Laboratory. He is also Project Leader of the Parallel Algorithms Group at CERFACS in Toulouse and is a visiting Professor of Mathematics at the University of Strathclyde. Dr. Duff completed his D.Phil. on "Analysis of Sparse Systems" at Oxford in 1972. He then was a Harkness Fellow visiting the State University of New York at Stony Brook and Stanford University and thereafter spent 2 years as a lecturer in computer science at the University of Newcastle before joining the Numerical Analysis Group at Harwell in 1975. He moved to his present position in June 1990. Dr. Duff has had several extended visits to Argonne National Laboratory, the Australian National University, the University of Colorado at Boulder, Stanford University, and the University of Umea. His principal research interests lie in sparse matrices and vector and parallel computing. He is editor of the IMA Journal of Numerical Analysis and associate editor of several other journals in numerical analysis and advanced scientific computing. He is a fellow of the Institute of Mathematics and its Applications and a member of SIAM, SMAI, and SBMAC.

FLEXIBLE AUTOMATION

A summary of the 1990 Japan-U.S.A. Symposium on Flexible Automation, 9-13 July, in Kyoto, Japan, is given. Plant visits to IBM's 3090 manufacturing plant, Sumitomo Electric, Mitsubishi Heavy Industries, and Matsushita Electric are described.

INTRODUCTION

Several hundred scientists from almost 10 countries met in Kyoto to present research in the area of flexible automation. This term refers to techniques that enable automated manufacturing to change from producing one type of product to another, rapidly and at low cost. It also includes simulation of the manufacturing process from the design stage to finished product. Many other related topics, such as robotics, quality control, fuzzy logic, etc., are also subsumed under this rubric. This report summarizes the major papers that were presented at this meeting and assesses them in the context of current work in the United States. For details on the papers presented, see the Appendix.

Following the meeting, several tours were arranged to interesting manufacturing facilities in the Kyoto-OsakaKobe area. These are also summarized and evaluations of the activities are given.

COMMENTS AND ASSESSMENT OF IMPORTANT SPEECHES

Zadeh, who is the father of fuzzy logic, remarked that he expected to see many new devices with high "MIQ" (machine intelligence quotient) appearing in the near future. He stressed that level I applications of fuzzy logic were to replace skilled human operators with

by David K. Kahaner

fuzzy logic controllers and that this was already happening, but that level II applications, involving the replacement of human experts with fuzzy logic systems, had still not appeared. There are certainly plenty of examples of products in the first category here in Japan. However, at this conference there were only a very few papers related to fuzzy logic (for example, on placing a peg in a hole), so Zadeh's presentation, however interesting, seemed oddly out of place.

Yamazaki Mazak Corp. is a $1B, 3,800 people company that manufactures flexible manufacturing systems for small cubic parts (120 of them have been sold overseas), as well as other automated machine tools. Nagae is its director of engineering. He spoke from the Japanese manufacturing perspective. His talk emphasized the issues to be confronted by Japanese manufacturing companies today, vis, a labor shortage in midsized companies due to low Japanese birth rate and movement of university graduates away from manufacturing into finance-related industries. Interestingly, he mentioned a problem of lack of pride associated with manufacturing in Japan (and we thought that this was only in the United States!). He felt that the responsibility of the companies was to create an environment that encourages the use of computers to take up the slack of a lessened workforce. He credited the success of his company to four main factors: (1) large capital investments,

(2) continual succession of new products, (3) a president who aggressively has pursued development of world-class machine tools, and (4) heavy use of advanced semiconductor technology. His company does no mathematical modelling or simulation. He cited several difficulties with U.S. manufacturing companies:

(1) The U.S. desire to sell flexible manufacturing systems (FMS) to other companies, but not much inhouse use of this technology in U.S. machine tool manufacturers. Associated with this he mentioned that his company must reinvent 20 to 30 new machine tools each year and hence has a strong desire to use FMS, while U.S. machine tool manufacturers are still using many old machine tools. (FMS means a machine tool that can easily be reconfigured to produce a number of different tools.)

(2) The unwillingness of U.S. manufacturers to realize that to understand global needs research and development (R&D) must be in many other countries.

(3) Too much U.S. research on software, especially associated with concurrent systems, and not enough on hardware. (In my opinion Japan has exactly the opposite problem.) He ended his talk with the following statements.

« iepriekšējā Turpināt »

Grāmatas