Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2021, 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
…
13 pages
1 file
AI-generated Abstract
This paper presents a novel sentiment analysis-based approach for error detection in large-scale systems, particularly high-performance computing (HPC) systems designed for exascale computing. It offers a machine learning framework to automatically create a sentiment lexicon from system log messages and utilizes this lexicon to accurately identify system errors and problematic nodes with an average f-score of 96%. The approach outperforms traditional machine learning methods, indicating the effectiveness of leveraging sentiment in failure log analysis.
Empirical Softw. Engg., 2015
Predicting system failures can be of great benefit to managers that get a better command over system performance. Data that systems generate in the form of logs is a valuable source of information to predict system reliability. As such, there is an increasing demand of tools to mine logs and provide accurate predictions. However, interpreting information in logs poses some challenges. This study discusses how to effectively mining sequences of logs and provide correct predictions. The approach integrates different machine learning techniques to control for data brittleness, provide accuracy of model selection and validation, and increase robustness of classification results. We apply the proposed approach to log sequences of 25 different applications of a software system for telemetry and performance of cars. On this system, we discuss the ability of three well-known support vector machines-multilayer perceptron, radial basis function and linear kernels-to fit and predict defective log sequences. Our results show that a good analysis strategy provides stable, accurate predictions. Such strategy must at least require high fitting ability of models used for prediction. We demonstrate that such models give excellent predictions both on individual applications-e.g., 1% false positive rate, 94% true positive rate, and 95% precision-and across system applications-on average, 9% false positive rate, 78% true positive rate, and 95% precision. We also show that these results are similarly achieved for different degree of sequence defectiveness. To show how good are our results, we compare them with recent studies in system log analysis. We finally provide some recommendations that we draw reflecting on our study.
IBM Journal of Research and Development, 2000
Modern computer systems generate an enormous number of logs. IBM Mining Effectively Large Output Data Yield (MELODY) is a unique and innovative solution for handling these logs and filtering out the anomalies and failures. MELODY can detect system errors early on and avoid subsequent crashes by identifying the root causes of such errors. By analyzing the logs leading up to a problem, MELODY can pinpoint when and where things went wrong and visually present them to the user, ensuring that corrections are accurately and effectively done. We present the MELODY solution and describe its architecture, algorithmic components, functions, and benefits. After being trained on a large portion of relevant data, MELODY provides alerts of abnormalities in newly arriving log files or in streams of logs. The solution is being used by IBM services groups that support IBM xSeries A servers on a regular basis. MELODY was recently tested with ten large IBM customers who use zSeries A machines and was found to be extremely useful for the information technology experts in those companies. They found that the solution's ability to reduce extensively large log data to manageable sets of highlighted messages saved them time and helped them make better use of the data.
Proceedings of the 2nd …, 2007
System logs, such as the Windows Event log or the Linux system log, are an important resource for computer system management. We present a method for ranking system log messages by their estimated value to users, and generating a log view that displays the most important messages. The ranking process uses a dataset of system logs from many computer systems to score messages. For better scoring, unsupervised clustering is used to identify sets of systems that behave similarly. We propose a new feature construction scheme that measures the difference in the ranking of messages by frequency, and show that it leads to better clustering results. The expected distribution of messages in a given system is estimated using the resulting clusters, and log messages are scored using this estimation. We show experimental results from tests on xSeries servers. A tool based on the described methods is being used to aid support personnel in the IBM xSeries support center.
IEEE Transactions on Cloud Computing, 2017
High performance computing systems comprised of hundreds or thousands of computational nodes can generate a high volume of system log entries at a high data velocity. Analyzing these logs soon after they are generated is a significant challenge, due to the complexity of log messages, the speed at which they are produced, and the lack of a method to quickly map or categorize messages to meaningful sets. The impact of this problem is that it is not possible to comprehensively glean timely information from logs about the overall system or the health of individual nodes. In this paper, we address this problem through the development of a novel approach for system log analysis based on a markov random field (MRF) that can quickly categorize system log messages into multiple categories based on representative training examples provided by a user. We present a theoretical model of our approach, followed by an extensive evaluation of the accuracy and performance of the implementation of our model. We found that our MRF based approach can quickly categorize system log messages with a high degree of accuracy.
2009
Log preprocessing, a process applied on the raw log before applying a predictive method, is of paramount importance to failure prediction and diagnosis. While existing filtering methods have demonstrated good compression rate, they fail to preserve important failure patterns that are crucial for failure analysis. To address the problem, in this paper we present a log preprocessing method. It consists of three integrated steps: (1) event categorization to uniformly classify system events and identify fatal events; (2) event filtering to remove temporal and spatial redundant records, while also preserving necessary failure patterns for failure analysis; (3) causality-related filtering to combine correlated events for filtering through apriori association rule mining. We demonstrate the effectiveness of our preprocessing method by using real failure logs collected from the Cray XT4 at ORNL and the Blue Gene/L system at SDSC. Experiments show that our method can preserve more failure patterns for failure analysis, thereby improving failure prediction by up to 174%.
2009
Supercomputers are prone to frequent faults that adversely affect their performance, reliability and functionality. System logs collected on these systems are a valuable resource of information about their operational status and health. However, their massive size, complexity, and lack of standard format makes it difficult to automatically extract information that can be used to improve system management. In this work we propose a novel method to succinctly represent the contents of supercomputing logs, by using textual clustering to automatically find the syntactic structures of log messages. This information is used to automatically classify messages into semantic groups via an online clustering algorithm. Further, we describe a methodology for using the temporal proximity between groups of log messages to identify correlated events in the system. We apply our proposed methods to two large, publicly available supercomputing logs and show that our technique features nearly perfect accuracy for online log-classification and extracts meaningful structural and temporal message patterns that can be used to improve the accuracy of other log analysis techniques.
2002
The ability to track and analyze every possible fault condition, whether transient (soft) or permanent (hard), is one of the most critical requirements for large-scale cluster computer systems. All such events are generally termed as “RAS Events”(RAS for Reliability, Availability, and Serviceability).
Information and Software Technology, 2019
Context: A large amount of information about system behavior is stored in logs that record system changes. Such information can be exploited to discover anomalies of a system and the operations that cause them. Given their large size, manual inspection of logs is hard and infeasible in a desired timeframe (e.g., real-time), especially for critical systems. Objective:This study proposes a semi-automated method for reconstructing sequences of tasks of a system, revealing system anomalies, and associating tasks and anomalies to code components. Method: The proposed approach uses unsupervised machine learning (Latent Dirichlet Allocation) to discover latent topics in messages of log events and introduces a novel technique based on pattern recognition to derive the semantic of such topics (topic labelling). The approach has been applied to the big data generated by the ALMA telescope system consisting of more than 2,000 log events collected in about five hours of telescope operation. Results: With the application of our approach to such data, we were able to model the behavior of the telescope over 16 different observations. We found five different behavior models and three different types of errors. We use the models to interpret each error and discuss its cause. Conclusions: With this work, we have also been able to discuss some of the known challenges in log mining. The experience we gather has been then summarized in lessons learned.
ArXiv, 2020
The log-based analysis and trouble-shooting has remained prevalent and commonly used approach for centralized and time-haring systems. However, for parallel and distributed systems where happen-before relations are not directly available between the events, it become a challenge to fully depend on log-based analysis in such instances. This article attempts to provide solutions using log-based performance analysis of centralized system, and demonstrates the results and their effectiveness, as well presents the challenges and proposes solutions for performance analysis in distributed and parallel systems.
IEEE Access, 2022
System logs are the first source of information available to system designers to analyze and troubleshoot their cluster systems. For example, High-Performance Computing (HPC) systems generate a large volume of heterogeneous data from multiple subsystems , so the idea of using a single source of data to achieve a given goal, such as identification of failures, is losing its validity. System log-analysis tools assist system designers gain understanding into a large volume of system logs. They enable system designers to perform various analyses (e.g., diagnosing node failures or predicting node failures). Current system loganalysis tools vary significantly in their function and design. We conduct a systematic review of literature on system log-analysis tools and select 46 representative articles out of 3,758 initial articles. To the best of our knowledge, there is no work that studied the characteristics of log-correlation tools (LogCTs) with respect to four quality attributes including (a) spurious correlations, (b) correlation threshold settings, (c) outliers in the data and (d) missing data. In this paper, we (a) propose a quality model to evaluate LogCTs and (b) use this quality model to evaluate and recommend current LogCTs. Through our review, we (a) identify papers on LogCTs, (b) build a quality model consisting of the four quality attributes and (c) discuss several open challenges for future research. Our study highlights the advantages and limitations of existing LogCTs and identifies research opportunities that could facilitate better failure handling in large cluster systems.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Fall joint computer conference, 1986
Science World Journal, 2018
Proceedings of the Third Conference on Tackling Computer Systems Problems With Machine Learning Techniques, 2008
2009 Second International Conference on Dependability, 2009
2008 19th International Symposium on Software Reliability Engineering (ISSRE), 2008
2014 IEEE International Conference on Cluster Computing (CLUSTER), 2014
2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN), 2011
Journal of Systems and Software, 2013
High Performance …, 2010
Lecture Notes in Computer Science, 2010
2015 IEEE Trustcom/BigDataSE/ISPA, 2015
Proceedings of the First …, 2008