Scientific Information Bulletin

PACIFIC RIM FAULT TOLERANT SYSTEMS INTERNATIONAL SYMPOSIUM

A summary of the 2-day Pacific Rim Fault Tolerant Systems Symposium,
held from 26-27 September 1991 in Kawasaki, Japan, is given.

INTRODUCTION

Computer systems fail. This is either a minor annoyance or a major disaster, depending on the circumstances. For most scientists it is the former, although it can also be a personal disaster if needed data are lost. But for large financial, communication, and military systems, computer failures can have national and even international implications. These days, transferring money means transferring bytes; the recent AT&T failure in the New York area also illustrates just how dependent we are on computing systems. As computing becomes more distributed, there is greater likelihood that some parts will fail, and thus there is a greater need to understand how and why in order to increase the reliability of computers.

In September 1991, 90 scientists from 13 countries met in Kawasaki to share ideas at the 1991 Pacific Rim Fault Tolerant Systems (PRFTS) Symposium. The distribution of attendees was as follows: Japan (59); China (14); United States (7); Australia (3); Korea (2); India, Taiwan, France, Sweden, Germany, Poland, and the United Kingdom (1 each).

Sixty papers were submitted from 14 countries and 40 were selected for presentation and inclusion in the Proceedings. The distribution of submitted papers is very interesting: China (22), Japan (17), United States (5), India (3),

by David K. Kahaner and D.M. Blough

Australia (2), Taiwan (2), Korea (2), Hong Kong (1), Canada (1), European Communities (EC) [5 countries] (5). The Honorary Chair of the meeting

was

Prof. Yoshiaki Koga
Dept of Electrical Engineering
National Defense Academy
1-10-20 Hashirimizu
Yokosuka, 239 Japan
Tel: (0468) 41-3810 x2280
Fax: (045) 772-7552

The General Chair was Prof. Sachio Naito of Nagaoka University of Technology, which is in Niigata, on the north (Sea of Japan) coast. The program committee has prepared a complete Proceedings of the conference (in English) published by

IEEE Computer Society Press Customer Service Center 10662 Los Vaqueros Circle P.O. Box 3014

Los Alamitos, CA 90720-1264

When ordering, specify IEEE Computer Society Press Number 2275 or Library of Congress Number 91-55311.

There is now no known way to make a computer system entirely fail-safe. By "fault tolerant system" we mean one that employs a class of technologies that reduce the likelihood a system will fail. The techniques fall into two general classes.

(1) Hardware--making the hardware more reliable and rugged, for example, by making it capable of taking abuse or extreme environmental conditions. This includes redundancy in components such as power supplies, disk drives, memories, CPUs, etc. (It goes without saying that there is continual improvement in basic system component design and manufacture.) Of course, redundant hardware is also not new. Initially, computers were so unreliable that some early systems were designed with three units whose results were polled after each operation--the majority determining the output. These days generalizations of this idea are called "consensus recovery blocks." In research, there is work on how to allocate a small number of spare parts such as processors to make them optimally available and economically producible. This leads to some interesting problems in graph theory. Very early work was on error detecting and correcting codes for data transmission. There is still plenty of research going on here. Building in reliability is obviously not limited to special computers. Even many "conventional" systems include UPS (uninterrupted power supply), which is usually a battery or generator back-up.

(2) Software--making the software more reliable. This includes designing/testing out software bugs and also allowing the software to survive even if the hardware fails. Some theory is being developed on how to design operating system software that is more reliable, but there is no fundamental theory. Most researchers on fault tolerant software are probably referring to system rather than application software.

There is general agreement that (2) is the more difficult. Because hardware failures are often more or less random, hardware redundancy is usually successful in improving reliability. But experience with software has shown that the technique of having several people write a program for the same task (N-version programming) often does not improve the overall reliability of the system. In fact, one expert has remarked that it was better to reduce the specificity of the programming task to avoid having distinct programmers make the same error. Using more than one language is another approach, but this leads to other difficulties related to cross language execution.

Several commercial computer systems, notably Stratus and Tandem from the United States, market systems that combine redundant hardware with software that is capable of surviving many different kinds of hardware failures. These companies have been very successful in marketing products here in Japan.

SYMPOSIUM DETAILS

One member of the PRFTS program committee, Prof. D.M. Blough (University of California at Irvine) [blough@sunset.eng.uci.edu] has provided a summary of several papers of interest to him, and this summary is included as an appendix to this article

with my sincere appreciation for his efforts. My own comments are immediately below. I was particularly interested in the following.

(1) The large participation by scientists from China, most of whom gave papers on very specific systems they were developing.

(2) An excellent summary of fault tolerant (FT) systems in China, presented by Prof. Shiyi Xu, of Shanghai University of Science and Technology, including more than a dozen references. [Xu has spent time at the State University of New York (SUNY) at Binghamton, as well as Carnegie Mellon University (CMU), and his English is excellent.] He explained that the Chinese recognize the importance of FT systems, and outside of Japan China is probably the only Asian country to have developed FT systems on their own including a triple redundant one for spaceflight use. Nevertheless, Xu commented that there is at the moment no plan to produce a practical nonstop system along the lines of Tandem or Stratus, and that most of the FT work is research only. However, some of these research projects are quite advanced, such as Wuhan Digital Engineering Institute's 980FT86 system. This is based on a multibus hardware model. This system operates in either a multiprocessor or fault tolerant mode and has a failure rate that is simulated to be three orders of magnitude less than for a conventional system. Also, the Chinese have developed some practical software to test programmable logic arrays and this has been offered to foreign companies such as Fujitsu. Xu's paper is also summarized by Blough. Finally, the Chinese have proposed having the

1993 PRFTS in Shenzhen during 5-7 August 1993. For further information, correspond with

Prof. Yinghua Min

Center for Fault Tolerant
Computing

CAD Laboratory

Institute of Computing Technology Chinese Academy of Sciences P.O. Box 2704-CAD

Beijing 100080, People's Republic of China

Tel: +86-1-256-5533 x536
Fax: +86-1-256-7724

In the United States, the program co-chair is Prof. Ravi K. Iyer, University of Illinois at UrbanaChampaign.

(3) The description of Fujitsu's new Sure system 2000 fault tolerant computer, again summarized by Blough.

(4) Among the other technical papers, I was impressed with a description of C3, a connection network for reconfiguration at the chip design level (wafer scale integration) by T. McDonald (University of Newcastle, Australia) and another by K. Kawashima (Osaka University, Japan) on spare channel assignment in optical fiber networks. In particular, in the latter, the authors claim to have solved an integer LP (linear programming) problem in polynomial time.

(5) One paper was presented by a

substitute speaker (K. Forward, Australia) who was not one of the authors. In terms of audience understanding of the paper's key ideas and critical appraisal of the work, this was probably the best presentation at the symposium. Hats off to Prof. Forward. Too bad this technique isn't used more often.

David K. Kahaner joined the staff of the Office of Naval Research Asian Office as a specialist in scientific computing in November 1989. He obtained his Ph.D. in applied mathematics from Stevens Institute of Technology in 1968. From 1978 until 1989 Dr. Kahaner was a group leader in the Center for Computing and Applied Mathematics at the National Institute of Standards and Technology, formerly the National Bureau of Standards. He was responsible for scientific software development on both large and small computers. From 1968 until 1979 he was in the Computing Division at Los Alamos National Laboratory. Dr. Kahaner is the author of two books and more than 50 research papers. He also edits a column on scientific applications of computers for the Society of Industrial and Applied Mathematics. His major research interests are in the development of algorithms and associated software. His programs for solution of differential equations, evaluation of integrals, random numbers, and others are used worldwide in many scientific computing laboratories. Dr. Kahaner's electronic mail address is: kahaner@xroads.cc.utokyo.ac.jp.

Douglas M. Blough received a B.S. degree in electrical engineering and M.S. and Ph.D. degrees in computer science from The Johns Hopkins University, Baltimore, MD, in 1984, 1986, and 1988, respectively. Since 1988, he has been with the Department of Electrical and Computer Engineering, University of California, Irvine, where he is currently Assistant Professor. In 1989, he received a joint appointment as Assistant Professor of Information and Computer Science at UC Irvine. His research interests include fault-tolerant computing, computer architecture, and parallel processing. Dr. Blough is a member of Eta Kappa Nu, Tau Beta Pi, the Institute of Electrical and Electronics Engineers, and the Association for Computing Machinery.

Appendix

BLOUGH'S COMMENTS ON PRFTS '91 AND TECHNICAL TOURS

CONFERENCE PRESENTATIONS

The opening presentation was an invited talk delivered by Prof. Shiyi Xu of Shanghai University of Science and Technology. The presentation was titled "Fault-Tolerant Systems in China." Dr. Xu gave an overview of the research being carried out in fault tolerant computing and testing in China. Recently, China has begun to emphasize fault tolerant computing. A national conference on fault tolerant computing was begun in 1985 and has been held every 2 years since its inception. A technical committee on fault tolerant computing was formed in 1991. As for the research, it is apparent that China is far behind the United States, Japan, and Europe. The bulk of the work appears to be in the area of testing, where some good results have been achieved. In the area of fault tolerant systems, there have been a fair number of systems designed and manufactured for various applications. All of these systems use standard fault tolerance techniques that have been known for many years. There were no truly experimental systems presented. The most advanced system is the SWI system that was built by an unnamed defense company for an unidentified application. This system uses self-checking and real-time monitoring techniques and has a multilevel recovery mechanism consisting of error-correcting codes, operation retry, dynamic reconfiguration, and system recovery. I was somewhat disappointed by the small amount of work being done in fault tolerant systems theory and software fault tolerance, areas in which research does not require tremendous resources.

Prof. Shyan-Ming Yuan of the Department of Computer and Information Science of the National Chiao Tung University of Taiwan delivered a presentation titled "An O(N log^2 N) Fault-Tolerant Decentralized Commit Protocol." This was a very nice piece of work concerning the problem of implementing an atomic transaction mechanism in a distributed database system that is subject to node failures. The author had previously proven that any decentralized commit protocol without failures requires at least on the order of N log N messages. In this presentation, an algorithm was given that can tolerate up to (log_2 N) - 2 faults using O(N log^2 N) messages. The algorithm utilizes a supercube communication structure to perform the protocol. It has the desirable property that a transaction will be aborted only if some nodes want to abort the transaction or if some node fails before making a decision to commit or abort.

A paper titled "Fault-Tolerant Attribute Evaluation in Distributed Software Environments" by Feng, Kikuno, and Torii of Osaka University, Japan, dealt with the tolerance of workstation outages in multistation software development environments. In such environments, software developers combine to design large software packages using multiple workstations. Modules developed on one workstation often need to interface with modules developed on another station, meaning that workstation outages can significantly decrease productivity. In the proposed approach, redundant information concerning the modules on other workstations is stored in a data structure called an interface graph. The paper shows how this redundant information

can be used to allow one workstation to receive semantic information concerning software modules on another workstation even if that workstation is inaccessible.

Avery interesting presentation was given by Mr. Hiroshi Yoshida of Fujitsu Ltd. titled "Fault Tolerance Assurance Methodology of the SXO Operating System for Continuous Operation." In this presentation, we were exposed to the operating system for the SURE SYSTEM 2000 computer, which is Fujitsu's new entry into the fault tolerant computing marketplace. [The printed paper focused entirely on the operating system, SXO (SURE System 2000 Expandable Operating System), but the author also gave a very brief description of the hardware, which is built around two buses and is scalable up to six processor modules.] The presentation described the process through which Fujitsu designed, manufactured, and tested the SXO operating system in order to assure its software fault tolerance capability. The process consisted of exhaustive listing of error symptoms, construction of a recovery process chart that explicitly details the steps taken to detect and recover from each error symptom, identification and careful design of critical routes that could lead to system failure, and testing through artificial software fault injection. While this process was thoroughly and carefully implemented, there are several problems with the general approach that were brought up during the discussion following the presentation. The first problem deals with the reliance on an exhaustive listing of error symptoms. The number of error symptoms that could occur in such a system is virtually unlimited.

« iepriekšējā Turpināt »

Grāmatas