Prof. Kishor S. Trivedi Center for Advanced Computing & Communication, Dept. of Electrical & Computer Engineering, Duke University, Durham, North Carolina

Recently, the phenomenon of "software aging", one in which the state of the software system degrades with time, has been reported. The primary causes of this degradation are the exhaustion of operating system resources, data corruption and numerical error accumulation. This may eventually lead to performance degradation of the software or crash/hang failure or both. Software aging has been reported in widely used software like Netscape and xrn. Aging in AT&T communications software is known to have resulted in packet loss. Numerous other examples exist, in systems with high availability requirements and also in safety-critical systems.

To counteract this phenomenon, a proactive approach of fault management, called "software rejuvenation", has been proposed. This essentially involves gracefully terminating an application or a system and restarting it in a clean internal state. This process removes the accumulated errors and frees up operating system resources. The preventive action can be performed at optimal times (e.g. when the load on the system is low) so that the overhead due to planned system downtime is minimal. A basic assumption here is that the overhead involved in the planned downtime and performing the clean-up operation is considerably less than the cost incurred due to unplanned system outages.

In the seminar, we shall discuss methods of evaluating the effectiveness of proactive fault management in operational software systems and determining optimal times to perform rejuvenation. This is done by developing stochastic models that trade-off the cost of unexpected failures due to software aging with the overhead of proactive fault management. In the second half of the seminar we shall deal with measurement-based models which are constructed using workload and resource usage data collected from the Unix operating system over a period of time. The measurement-based models are the first steps towards predicting aging-related failures. They are intended to help development of strategies for software rejuvenation triggered by actual measurements.


Prof. Trivedi has been on the faculty of Duke University since 1975 and now holds the Hudson Chair in the Department of Electrical and Computer Engineering. He also holds a joint appointment in the Department of Computer Science. He is the Duke site director of an NSF Industry- University Cooperative Research Center for applied research in computing and communications. He was an Editor of the IEEE Transactions on Computers from 1983-1987. He is a co-designer of HARP, SAVE, SHARPE, SPNP, and SREPT modeling packages which have been widely circulated.

Prof. Trivedi is the author of the well-known text "Probability and Statistics with Reliability, Queuing and Computer Science Applications", published by Prentice-Hall (second edition will soon be published by John Wiley). He has recently published two other books: "Performance and Reliability Analysis of Computer Systems" (Kluwer) and "Queuing Networks and Markov Chains" (Wiley). His research interests are in reliability and performance assessment of computer and communication systems.

He received the B.Tech. degree from the Indian Institute of Technology (Bombay), and M.S. and Ph.D. degrees in computer science from the University of Illinois, Urbana-Champaign.

Maintained by rbennett@cs.ucl.ac.uk