Lapas attēli
PDF
ePub

PACIFIC RIM FAULT TOLERANT SYSTEMS

INTERNATIONAL SYMPOSIUM

A summary of the 2-day Pacific Rim Fault Tolerant Systems Symposium,

held from 26-27 September 1991 in Kawasaki, Japan, is given.

by David K. Kahaner and D.M. Blough

was

INTRODUCTION

Australia (2), Taiwan (2), Korea (2), (1) Hardware--making the hardware

Hong Kong (1), Canada (1), European more reliable and rugged, for examComputer systems fail. This is either Communities (EC) (5 countries) (5). ple, by making it capable of taking a minor annoyance or a major disaster, The Honorary Chair of the meeting abuse or extreme environmental depending on the circumstances. For

conditions. This includes redunmost scientists it is the former, although

dancy in components such as power it can also be a personal disaster if Prof. Yoshiaki Koga

supplies, disk drives, memories, needed data are lost. But for large Dept of Electrical Engineering CPUs, etc. (It goes without saying financial, communication, and military National Defense Academy

that there is continual improvesystems, computer failures can have 1-10-20 Hashirimizu

ment in basic system component national and even international impli- Yokosuka, 239 Japan

design and manufacture.) Of course, cations. These days, transferring money Tel: (0468) 41-3810 X2280

redundant hardware is also not new. means transferring bytes; the recent Fax: (045) 772-7552

Initially, computers were so unreAT&T failure in the New York area

liable that some early systems also illustrates just how dependent we The General Chair was Prof. Sachio were designed with three units are on computing systems. As comput- Naito of Nagaoka University of Tech- whose results were polled after each ing becomes more distributed, there is nology, which is in Niigata, on the north operation--the majority determingreater likelihood that some parts will (Sea of Japan) coast. The program ing the output. These days generfail, and thus there is a greater need to committee has prepared a complete alizations of this idea are called understand how and why in order to Proceedings of the conference in "consensus recovery blocks." In increase the reliability of computers. English) published by

research, there is work on how to In September 1991, 90 scientists from

allocate a small number of spare 13 countries met in Kawasaki to share IEEE Computer Society Press

parts such as processors to make ideas at the 1991 Pacific Rim Fault Customer Service Center

them optimally available and ecoTolerant Systems (PRFTS) Symposium. 10662 Los Vaqueros Circle

nomically producible. This leads The distribution of attendees was as P.O. Box 3014

to some interesting problems in follows: Japan (59); China (14); United Los Alamitos, CA 90720-1264

graph theory. Very early work was States (7); Australia (3); Korea (2);

on error detecting and correcting India, Taiwan, France, Sweden, When ordering, specify IEEE Com- codes for data transmission. There Germany, Poland, and the United puter Society Press Number 2275 or is still plenty of research going on Kingdom (1 each).

Library of Congress Number 91-55311. here. Building in reliability is obviSixty papers were submitted from There is now no known way to make ously not limited to special 14 countries and 40 were selected for a computer system entirely fail-safe. By computers. Even many "convenpresentation and inclusion in the Pro- “fault tolerant system” we mean one tional” systems include UPS ceedings. The distribution of submitted that employs a class of technologies (uninterrupted power supply), papers is very interesting: China (22), that reduce the likelihood a system will which is usually a battery or Japan (17), United States (5), India (3) fail. The techniques fall into two gen- generator back-up.

eral classes.

(2) Software--making the software with my sincere appreciation for his 1993 PRFTS in Shenzhen during

more reliable. This includes efforts. My own comments are imme- 5-7 August 1993. For further infordesigning/testing out software bugs diately below. I was particularly inter- mation, correspond with and also allowing the software to ested in the following. survive even if the hardware fails.

Prof. Yinghua Min Some theory is being developedon (1) The large participation by scien- Center for Fault Tolerant how to design operating system tists from China, most of whom Computing software that is more reliable, but gave papers on very specific sys- CAD Laboratory there is no fundamental theory. tems they were developing.

Institute of Computing Technology Most researchers on fault tolerant

Chinese Academy of Sciences software are probably referring to (2) An excellent summary of fault tol- P.O. Box 2704-CAD system rather than application erant (FT) systems in China, pre- Beijing 100080, People's Republic software.

sented by Prof. Shiyi Xu, of Shanghai of China

University of Science and Tech- Tel: +86-1-256-5533x536 There is general agreement that (2) nology, including more than a dozen Fax: +86-1-256-7724 is the more difficult. Because hardware references. (Xu has spent time at failures are often more or less random, the State University of New York In the United States, the program hardware redundancy is usually suc- (SUNY) at Binghamton, as well as co-chair is Prof. Ravi K Iyer, cessful in improving reliability. But Carnegie Mellon University University of Illinois at Urbanaexperience with software has shown (CMU), and his English is excel- Champaign. that the technique of having several lent.) He explained that the Chinese people write a program for the same recognize the importance of FT (3) The description of Fujitsu's new task (N-version programming) often systems, and outside of Japan China Sure system 2000 fault tolerant does not improve the overall reliability is probably the only Asian country computer, again summarized by of the system. In fact, one expert has to have developed FT systems on Blough. remarked that it was better to reduce their own including a triple redunthe specificity of the programming task dant one for spaceflight use. (4) Among the other technical papers, to avoid having distinct programmers Nevertheless, Xu commented that I was impressed with a description make the same error. Using more than there is at the moment no plan to of C3, a connection network for one language is another approach, but produce a practical nonstop system reconfiguration at the chip design this leads to other difficulties related to along the lines of Tandem or Stratus, level (wafer scale integration) by cross language execution.

and that most of the FT work is T. McDonald (University of Several commercial computer sys- research only. However, some of Newcastle, Australia)and another tems, notably Stratus and Tandem from these research projects are quite by K. Kawashima (Osaka Univerthe United States, market systems that advanced, such as Wuhan Digital sity, Japan) on spare channel assigncombine redundant hardware with Engineering Institute's 980FT86

ment in optical fiber networks. In software that is capable of surviving system. This is based on a multibus particular, in the latter, the authors many different kinds of hardware fail- hardware model. This system oper- claim to have solved an integer LP ures. These companies have been very ates in either a multiprocessor or (linear programming) problem in successful in marketing products here fault tolerant mode and has a fail- polynomial time. in Japan.

ure rate that is simulated to be

three orders of magnitude less than (5) One paper was presented by a SYMPOSIUM DETAILS

for a conventional system. Also, substitute speaker (K. Forward,

the Chinese have developed some Australia) who was not one of the One member of the PRFTS pro- practical software to test program- authors. In terms of audience gram committee, Prof. D.M. Blough mable logic arrays and this has been understanding of the paper's key (University of California at Irvine) offered to foreign companies such ideas and critical appraisal of the [blough@sunset.eng.uci.edu) has pro- as Fujitsu. Xu's paper is also sum- work, this was probably the best vided a summary of several papers of marized by Blough. Finally, the presentation at the symposium. interest to him, and this summary is Chinese have proposed having the

Hats off to Prof. Forward. Too bad included as an appendix to this article

this technique isn't used more often.

[graphic]

David K. Kahaner joined the staff of the Office of Naval Research Asian Office as a specialist in scientific computing in November 1989. He obtained his Ph.D. in applied mathematics from Stevens Institute of Technology in 1968 From 1978 until 1989 Dr. Kahaner was a group leader in the Center for Computing and Applied Mathematics at the National Institute of Standards and Technology, formerly the National Bureau of Standards. He was responsible for scientific software development on both large and small computers. From 1968 until 1979 he was in the Computing Division at Los Alamos National Laboratory. Dr. Kahaner is the author of two books and more than 50 research papers. He also edits a column on scientific applications of computers for the Society of Industrial and Applied Mathematics. His major research interests are in the development of algorithms and associated software. His programs for solution of differential equations, evaluation of integrals, random numbers, and others are used worldwide in many scientific computing laboratories. Dr. Kahaner's electronic mail address is: kahaner@xroads.ocutokyo.ac.jp

Appendix

BLOUGH'S COMMENTS ON PRFTS '91 AND TECHNICAL TOURS

CONFERENCE

Prof. Shyan-Ming Yuan of the can be used to allow one workstation to PRESENTATIONS

Department of Computer and Infor- receive semanticinformation concern

mation Science of the National Chiao ing software modules on another The opening presentation was an Tung University of Taiwan delivered a workstation even if that workstation is invited talk delivered by Prof. Shiyi Xu presentation titled “An O(N log^2N) inaccessible. of Shanghai University of Science and Fault-Tolerant Decentralized Commit A very interesting presentation was Technology. The presentation was titled Protocol.” This was a very nice piece of given by Mr. Hiroshi Yoshida of Fujitsu “Fault-Tolerant Systems in China.” work concerning the problem of imple- Ltd. titled “Fault Tolerance Assurance Dr. Xu gave an overview of the research menting an atomic transaction mecha- Methodology of the SXO Operating being carried out in fault tolerant nism in a distributed database system System for Continuous Operation.” In computing and testing in China. that is subject to node failures. The this presentation, we were exposed to Recently, China has begun to empha- author had previously proven that any the operating system for the SURE size fault tolerant computing. A national decentralized commit protocol with- SYSTEM 2000 computer, which is conference on fault tolerant comput- out failures requires at least on the Fujitsu's new entry into the fault tolering was begun in 1985 and has been order of N log N messages. In this ant computing marketplace. (The held every 2 years since its inception. A presentation, an algorithm was given printed paper focused entirely on the technical committee on fault tolerant that can tolerate up to (log_2 N) - 2 operating system, SXO (SURE System computing was formed in 1991. As for faults using O(N log^2 N) messages. 2000 Expandable Operating System), the research, it is apparent that China The algorithm utilizes a supercube but the author also gave a very brief is far behind the United States, Japan, communication structure to perform description of the hardware, which is and Europe. The bulk of the work the protocol. It has the desirable prop- built around two buses and is scalable appears to be in the area of testing, erty that a transaction will be aborted up to six processor modules.] The where some good results have been only if some nodes want to abort the presentation described the process achieved. In the area of fault tolerant transaction or if some node fails before through which Fujitsu designed, manusystems, there have been a fair number making a decision to commit or abort. factured, and tested the SXO operatof systems designed and manufactured A paper titled "Fault-Tolerant ing system in order to assure its softfor various applications. All of these Attribute Evaluation in Distributed ware fault tolerance capability. The systems use standard fault tolerance Software Environments” by Feng, process consisted of exhaustive listing techniques that have been known for Kikuno, and Torii of Osaka University, of error symptoms, construction of a many years. There were no truly exper- Japan, dealt with the tolerance of recovery process chart that explicitly imental systems presented. The most workstation outages in multistation details the steps taken to detect and advanced system is the SWI system that software development environments. recover from each error symptom, identiwas built by an unnamed defense com- In such environments, software devel- fication and careful design of critical pany for an unidentified application. opers combine to design large software routes that could lead to system failure, This system uses self-checking and packages using multiple workstations.

packages using multiple workstations and testing through artificial software real-time monitoring techniques and Modules developed on one worksta- fault injection. While this process was has a multilevel recovery mechanism tion often need to interface with modules thoroughly and carefully implemented, consisting of error-correcting codes, developed on another station, mean- there are several problems with the operation retry, dynamic reconfigura- ing that workstation outages can sig- general approach that were brought up tion, and system recovery. I was some- nificantly decrease productivity. In the during the discussion following the what disappointed by the small amount proposed approach, redundant infor- presentation. The first problem deals of work being done in fault tolerant mation concerning the modules on other with the reliance on an exhaustive listsystems theory and software fault tol- workstations is stored in a data struc- ing of error symptoms. The number of erance, areas in which research does ture called an interface graph. The paper error symptoms that could occur in not require tremendous resources. shows how this redundant information such a system is virtually unlimited.

« iepriekšējāTurpināt »