Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
1993, IEEE Transactions on Reliability
General purpose: Present modeling techniques Special math needed for explanations: Probability, Markov models Special math needed to use results: Same Results useful to: Reliability analysts and multiprocessor designers Summary & Conclusions -Performability models of multiprocessor systems and their evaluation are presented. Two cases in which hierarchical modeling is applied are examined. 1. Models are developed to analyze the behavior of processor arrays of various sizes in the presence of permanent, transient, intermittent, and near-coincident faults. Models can be generated for typical reconfiguration schemes that consider the failures of several types of components (detailed modeling). These models consider a survivability factor derived in terms of the physical distribution of faulty components. Capacity-based reward rates are then used to derive overall performability measures. 2. Queueing network models are solved to derive performance measures that are used as reward rates within an overall Markov failure-repair model of bus-based multiprocessor systems. Several configurations are compared in terms of their performability. In both cases, Markov models are generated using MGRE and solved using SHARPE. The analysis, particularly for case 2, is by no means exhaustive as several parameters are involved in the overall model. However, the hierarchical models shown, combined with the use of diverse tools such as MGRE & SHARPE, facilitate the analysis of large systems in various environments. The models can be fme-tuned according to specific applications and performance measures. ' The singular & plural of an acronym are always spelljxI the same.
Sigmetrics Performance Evaluation Review, 1988
Traditional evaluation techniques for multiprocessor systems use Markov chains and Markov reward models to compute measures such as mean time to failure, reliability, performance, and performability. In this paper, we discuss the extension of Markov models to include parametric sensitivity analysis. Using such analysis, we can guide system optimization, identify parts of a system model sensitive to error, and find system reliability and performability bottlenecks.
Performance Evaluation, 1986
It is argued that classical measures of computer system performance, for example mean response time, are inadequate in the context of fault-tolerant system design. Alternative, perception-based measures are proposed and theorems established describing their properties. Focus is directed upon the homogeneous M/M/m system in which total processor power is constrained by budget and processors are subject to failure and repair. A numerical technique for extracting both classical and perception-based measures from the associated two-dimensional Markov process is offered, along with bounds on time and space required for its execution. It is seen that the perception-based approach to system design can call for twice as many processors as the classical approach.
IEEE Transactions on Computers, 1990
Lecture Notes in Computer Science, 1993
IEEE Transactions on Computers, 1993
In this paper, we consider the problem of evaluating the performability density and distribution of degradable computer systems. A generalized model of performability is considered, wherein the dynamics of configuration modes are modeled as a nonhomogeneous Markov process, and the performance rate in each configuration mode can be time-dependent.
IEEE Transactions on Computers, 1992
Recent advances in VLSI/WSI technology have led to the design of processor arrays with a large number of processing elements confined in small areas. The use of redundancy to increase fault-tolerance has the effect of reducing the ratio of area dedicated to processing elements over the area occupied by other resources in the array. The assumption of fault-free hardware support (switches, buses, interconnection links, etc.,), leads at best to conservative reliability estimates. However, detailed modeling entails not only an explosive growth in the model state space but also a difficult model construction process. To address the latter problem, a systematic method to construct Markov models for the reliability evaluation of processor arrays is proposed. This method is based on the premise that the fault behavior of a processor array can be modeled by a Stochastic Petri Net (SPN). However, in order to obtain a more compact representation, a set of attributes is associated with each transition in the Petri net model. This representation is referred to as a Modified Stochastic Petri Net (MSPN) model. A MSPN allows the construction of the corresponding Markov model as the reachability graph is being generated. The Markov model generated can include the effect of failures of several different components of the array as well as the effect of a peculiar distribution of faults when the reconfiguration occurs. Specific reconfiguration schemes such as Successive Row Elimination (SRE), Alternate Row-Column Elimination (ARCE) and Direct Reconfiguration (DR), are analyzed
Microelectronics Reliability, 1994
This paper investigates two mathematical models based on structural computer systems. There are two types of operating environment in computer namely DOS and UNIX. Central Processing Unit (CPU) is the brain of the computer and it guides the monitor and dumb terminal (DT) according to the sequence of instructions as given by operator.A sensitive volume due to micro-chips,exists in the computer. An electromagnetic interfrence with this sensitive volume changes the operating behaviour of computer. These changes generate the partial and complete failure states.Several cost related reliability measures of the system effectiveness are studied by using the regenerative point technique. software systems and the use of computer to control vital and complicated functions. Several researchers [2,3,4,7] have studied the models related to computer systems and they have analysed the same for reliability and availability only, but not much more. The main aim of present study is to introduce and analyse the computer systems (DOS & UNIX) for reliability more measures. In DOS computer system, there are two compartments drive-C and drive-A. Here it is assumed that drive-C /drive-A may work with reduced efficiency due to minor hardware problem.This state of the system is called partially failed state, from this state it may be attained its original state or it reaches to totally failed state due to major hardware problem.
CVR Journal of Science & Technology, 2014
There are many applications of the computer systems where in the system availability has to be ensured. The evaluation of the availability is very vital before a computer system is being put into operation in such critical applications. In the case of hardware faults, high degree of reliability can be achieved by hardware redundancy. For microprocessor systems good additional feature is fault tolerance. By the use of dedicated customized hardware, fault tolerance can be achieved which is cost effective. A stochastic modeling of the microprocessor based computer system has been carried out and the lifetime availability is estimated. This evaluation is always being during the entire process of system design. The modeling framework used for the Multiprocessor system is based on an extension of Petri nets called Stochastic Activity Networks (SAN). A major contribution of this paper is that a SAN based comprehensive model for computer system using Mobius simulation tool has been developed which can be extensively used for the lifetime evaluation of systems of various architectures and hardware designs.
1994
Performability is a composite measure for the performance and reliability, which may b e i n terpreted as the cumulative performance over a nite mission time. The computation of its distribution allows the user to ensure that the system will achieve a given performance level. The system is assumed to be modeled as a Markov process with nite state space and a reward rate performance measure is associated with each state. We propose, in this paper, a new algorithm to compute the performability distribution of fault-tolerant computer systems. The main advantage of this new algorithm is its low polynomial computational complexity. Moreover it deals only with non negative n umbers bounded by one. This important property allows us to determine truncation steps and so to improve the execution time of the algorithm.
SCIREA journal of information science and systems science, 2019
We consider repairable parallel system that consists of m identical units. At any moment of time, a unit can be in one of the two states: either operational or failed. Suppose, the number n of repair facilities is restricted (n≤m), so failed units can form a queue for recovering. Assuming that the distributions of the time to failure (X) and the repair time (Y) for each unit are known, our task is to determine the reliability indices of the system. One of the methods for assessing system reliability is the simulation method. In this approach, the model replicates the operation of a real system, emulating the functioning process of the actual system over time. In many cases, simulation becomes the most effective and, often, the only practical method for determining the reliability of repairable systems. Using a GPSS World simulation model, we studied the dependencies of system's reliability indicators on the following parameters: the coefficient of variation of time to failure and repair time of units, and the ratio ρ=E(Y)/E(X). By utilizing the expressions for average transition times between states of the Markov birth-death process, we derive formulas for the mean values of the system's time to failure and time between failures in the case of exponential distributions of random variables X and Y. We validated the simulation models by comparing the results with
Reliability, maintainability, and availability (RAM) are three system attributes that are of great interest to systems engineers, logisticians, and users. Collectively, they affect both the utility and the life-cycle costs of a product or system. The origins of contemporary reliability engineering can be traced to World War II. The discipline's first concerns were electronic and mechanical components. However, current trends point to a dramatic rise in the number of industrial, military, and consumer products with integrated computing functions. Because of the rapidly increasing integration of computers into products and systems used by consumers, industry, governments, and the military, reliability must consider both hardware, and software. Maintainability models present some interesting challenges. The time to repair an item is the sum of the time required for evacuation, diagnosis, assembly of resources (parts, bays, tool, and mechanics), repair, inspection, and return. Administrative delay (such as holidays) can also affect repair times. Often these sub-processes have a minimum time to complete that is not zero, resulting in the distributions used to model maintainability having a threshold parameter. A threshold parameter is defined as the minimum probable time to repair. Estimation of maintainability can be further complicated by queuing effects, resulting in times to repair that are not independent. This dependency frequently makes analytical solution of problems involving maintainability intractable and promotes the use of simulation to support analysis.
Reliability measures of a computer system with independent constant failure of hardware and software components have been evaluated by using semi-Markov process and regenerative point technique. A single server is provided immediately to repair the system at hardware failure while software is up-graded whenever it fails to meet out the desired functions properly. Hardware repair and software up-gradation are perfect. The random variables are statistically independent. The failure times of hardware and software components follow negative exponential distribution whereas the distributions for their repair and up-gradation times are taken as arbitrary with different probability density functions. The graphical behaviour of some measures of system effectiveness has been observed for arbitrary values of various parameters and costs. The profit of the present model has also been compared with the models developed under component wise redundancy. Keywords: Computer System, Independent Fail...
Computers & Electrical Engineering, 1984
Current technology allows sufficient redundancy in fault-tolerant computer systems to insure that the failure probability due to exhaustion of spares is low. Consequently, the major cause of failure is the inability to correctly detect, isolate, and reconfigure when faults are present. Reliability estimation tools must be flexible enough to accurately model this critical fault-handling behavior and yet remain computationally tractable. This paper discusses reliability modeling techniques based on a behavioral decomposition that provides tractability by separating the reliability model along temporal lines into nearly disjoint fault-occurrence and fault-handling submodels. An Extended Stochastic Petri Net (ESPN) model provides the needed flexibility for representing the fault-handling behavior, while a nonhomogeneous Markov chain accounts for the possibly non-Poisson fault-occurrence behavior. Since the submodels are separate, the ESPN submodel, in which all time constants are of the same order of magnitude, can be simulated. The nonhomogeneous Markov chain is solved analytically, and the result is a hybrid model. The method of coverage factors, used to combine the submodels, is generalized to more accurately reflect the fault-handling effectiveness within the fault-occurrence model. However, due to approximations made in the aggregation of the two submodels and inaccurate estimation of component failure rates and other model parameters, errors can still arise in the subsequent reliability predictions. The accuracy of the model predictions is evaluated analytically, and error bounds on the system reliability are produced. These modeling techniques have been implemented in the HARP (Hybrid Automated Reliability Predictor) program.
Journal of Systems and Software, 1986
We present an effective technique for the combined performance and reliability analysis of multimode computer systems. A reward rate (or a performance level) is associated with each mode of operation. The switching between different modes is characterized by a continuoustime Markov chain. Different types of service-interruption interactions (as a result of mode switching) are considered. We consider the execution time of a given job on such a system and derive the distribution of its completion time. A useful dual relationship, between the completion time of a given job and the accumulated reward up to a given time, is noted. We demonstrate the use of our technique by means of a simple example.
IEEE Transactions on Computers, 2003
The aim of our work is to provide a modeling framework for evaluating performability measures of Multipurpose, Multiprocessor Systems (MMSs). The originality of our approach is in the explicit separation between the architectural and environmental concerns of a system. The overall dependability model, based on stochastic reward nets, is composed of 1) an architectural model describing the behavior of system hardware and software components, 2) a service-level model, and 3) a maintenance policy model. The two latter models are related to the system utilization environment. The results can be used for supporting the manufacturer design choices as well as the potential end-user configuration selection. We illustrate the approach on a particular family of MMSs under investigation by a system manufacturer for Internet and e-commerce applications. As the systems are scalable, we consider two architectures: a reference one composed of 16 processors and an extended one with 20 processors. Then, we use the obtained results to evaluate the performability of a clustered system composed of four reference systems. We evaluate comprehensive measures defined with respect to the end-user service requirements and specific measures in relation to the distributed shared memory paradigm.
This paper presents a common-cause failure analysis of redundant computer systems. Three mathematical models representing a two-unit redundant computer system are developed. Computer system reliability, steady-state availability, mean time to failure and variance of time to failure formulas are developed. In addition, a general formula for redundant computer system steady-state availability is derived when the failed computer system repair times are described by Erlangian distribution. The computer system reliability, availability and mean time to failure plots are shown. 2 2c $ g(t) MTTF 0-2 J Pj(t) NOTATION The following symbols are associated with all three models: constant failure rate of a single computer constant common-cause failure rate of the redundant computer system Laplace transform variable reliability of the redundant computer system at time t mean time to failure of the redundant computer system variance of time to failure of the redundant computer system jth state of the computer system, j =0,1,2,3 probability that the redundant computer system is in state j at time t, for j = 0,1,2,3 In addition to the above notations, the following symbols are defined separately for Models II and III: Model H Iz Model I11 # pdf Uc(x), g(x) ~(x), h(x) Pj(x, t) constant repair rate of a single computer constant repair rate of a single computer probability density function denote the repair rate and pdf of repair times, respectively, when the failed redundant computer system is in state 3 and has an elapsed repair time of x denote the repair rate and pdf of repair times, respectively, when the failed redundant computer system is in state 2 and has an elapsed repair time of x denote the probability density (with respect to repair time) that the failed redundant computer system is in state j at time t and has an elapsed repair time of x, for j = 2,3
Sadhana, 1987
The reliability of a system is the probability that the system will perform its intended mission under given conditions. This paper provides an overview of the approaches to reliability modelling and identifies their strengths and weaknesses. The models discussed include structure models, simple stochastic models and decomposable stochastic models. Ignoring time-dependence, structure models gi,~e reliability as a function of the topological structure of the sYstem. Simple stochastic models make direct use of the properties of underlying stochastic processes, while decomposable models consider more complex systems and analyse them through subsystems. Petri nets and dataflow graphs facilitate the analysis of complex systems by providing a convenient framework for reliability analysis.
Advances in systems science and applications, 2018
This paper discusses a unified approach to reliability, availability and performability analysis of complex engineering systems. Theoretical basis of this approach is continuous-time discrete state Markov processes with rewards. From reliability modeling point of view complex systems are the systems with static and dynamic redundancy, imperfect fault coverage, various recovery strategies, multilevel operation and varying severity of failure states. We propose a unified method of calculating the reliability, availability and performability indices based on the definition of special forms of reward matrix. This method proved to be effective in calculating both cumulative and instantaneous measures in steady-state and transient cases. We describe special analytical software which implements suggested method. We demonstrate the flexibility of the proposed method and software by analyzing multilevel process unit with protection and demand-based warm standby system.
2006
Reliability failure mechanisms, such as time dependent dielectric breakdown, electromigration, and thermal cycling have become a key concern in processor design. The traditional approach to reliability qualification assumes that the processor will operate at maximum performance continuously under worst case voltage and temperature conditions. However, the typical processor spends a very small fraction of its operational time at maximum voltage and temperature. In this paper, we show how this results in a reliability "slack" that can be leveraged to provide increased performance during periods of peak processor demand. We develop a novel, real time reliability model based on workload driven conditions. We then propose a new dynamic reliability management (DRM) scheme that results in 20-35% performance improvement during periods of peak computational demand while ensuring the required reliability lifetime.
International Journal of Computer Applications
In this paper, an effort for the stochastic analysis of a computer system has been made considering the idea of hardware redundancy in cold standby. The hardware and software failures occur independently in the computer system with some probability. A single server is employed immediately to conduct hardware repair and software upgradation on need basis. The repair and up-gradation activities performed by the server are perfect. The time to hardware and software failures follows negative exponential distribution, whereas the distributions of hardware repair and software upgradation times are taken as arbitrary with different probability density functions. The expressions for various reliability measures are derived in steady state using semi-Markov process and regenerative point technique. The graphs are drawn for arbitrary values of the parameters to depict the behaviour of some important performance measures of the system model.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.