Problem Management Overview
HDI Capital Area Chapter September 16, 2009 Hugo Mendoza, Column Technologies
Problem Management Overview Agenda
Overview of the ITIL framework Overview of Problem Management
Definition of Problem Management Goals of Problem Management Business Value of Problem Management Problem Management Lifecycle Critical Success Factors of Problem Management Problem Management Roles and Responsibilities Problem Management Metrics
Challenges Facing IT
IT is constantly being asked to:
Improve service quality Reduce the complexity of IT Reduce risk Lower the cost of operations Manage compliance Reduce the burden on an overworked IT workforce Manage the IT organization more like a business.
ITIL can provide the framework for a strategy to make IT and particularly Problem Management more efficient
Introduction to ITIL
ITIL is a framework for IT Service Management best practice produced by the OGC Adopting ITIL guidance offers a range of benefits that includes:
Reduced costs; Improved IT services through the use of proven best practice processes; Improved customer satisfaction through a more professional approach to service delivery; Standards and guidance; Improved productivity; Improved use of skills and experience
ITIL Statistics (Problem Management Focused)
From Pink Elephant:
ITIL process improvements present senior IT management with an opportunity to improve efficiency and customer service quality, reduce IT workload and control costs by 20-40 per cent. Implementing key ITIL processes at Nationwide Insurance led to a 40% reduction of its systems outages. The company estimates a $4.3 million RoI over the next three years. An ITIL program at Capital One that began in resulted in a 30% reduction in systems crashes and software-distribution errors, and a 92% reduction in 'business-critical' incidents in 2 years
Agreed ITIL Disciplines
Financial Management Demand Management Service Portfolio Management Service Catalogue Management Capacity Management Availability Management Service Level Management Information Security Management Vendor Management Continuity Management Planning and Support Change Management
Incident Management Problem Management
Service Validation & Testing Management Release & Deployment Management Service Evaluation Management Knowledge Management Event Management Request Fulfillment Access Management Service Asset & Configuration
ITIL (v3) Library Components
ITIL Core:
Service Strategy Service Design Service Transition Service Operation Continual Service Improvement
Continual Service Improvement
Service Transition
ITIL Complementary Guidance
Set of publications specific to:
Industry sectors Organization types Operating models Technology architectures
Service Design
Service Strategy Service Operation
Available via web
[Link]
ITIL (v3) Service Operation
Service Operation
Achieving effectiveness and efficiency in the delivery and support of services Ensuring value to customers Topics
Stability in service operations Managing availability Controlling demand Scheduling operations Fixing problems Processes
Event Management Incident Management Request Fulfillment Problem Management Access Management
Incident vs Problem
Incident Management is restoring normal service operation as quickly as possible and minimizing the adverse effect on business operations. ('Normal service operation' is defined here as service operation within Service Level Agreement (SLA) limits) Problem Management process that seeks to resolve the root cause of incidents and thus to minimize the adverse impact of incidents and problems on business that are caused by errors within the IT infrastructure, and to prevent recurrence of incidents related to these errors. A `problem' is an unknown underlying cause of one or more incidents, and a `known error' is a problem that is successfully diagnosed and for which either a work-around or a permanent resolution has been identified.
Service Operation Balance
Reactive vs Proactive
Reactive organizations
Do not act unless triggered Reactive efforts tend to build until all work is reactive
An organization here is out of balance and is not able to effectively support the business strategy An organization here is quite balanced, but tends to fix services that are not broken, resulting in higher levels of change
Proactive organizations
Always looking for ways to improve services Can be overly expensive
Extremely Reactive
Extremely Proactive
Goal of Problem Management
Problem Management is both reactive and proactive in identification and resolution of errors Goal
To minimize the adverse impact of Incidents and Problems caused by errors in the infrastructure and to proactively prevent the occurrence of Incidents, Problems and errors.
Incident Management is concerned with restoring service
Objectives
Resolve Problems quickly and effectively To ensure resources are prioritized to resolve Problems in the most appropriate order based on business need To proactively identify and resolve Problems and Known Errors to minimize or prevent Incidents from occurring Minimize the impact of incidents that cannot be prevented To improve the productivity of support staff To provide relevant management information
Problem Management Definitions
Problem
Known Error Workaround Urgency Impact CI CMDB
The unknown root cause of one or more existing or potential Incidents A fault in a CI identified by the successful diagnosis of a problem and for which a temporary workaround or permanent solution has been identified A temporary remedy to eliminate or reduce interruption in service due to an Incident A measure of business criticality of an Incident, Problem or Change where there is an effect upon business deadlines. A measure of the effect that an Incident, Problem or Change might have on the business service being provided.
A Configuration Item (CI) is any object being managed by the IT Organization that is stored within the CMDB
A Configuration Management Database (CMDB) is a repository of all managed CIs and their associated relationships
Scope of Problem Management Scope
Diagnosis of the root cause
Identifies Known Errors
Strong relationship with Knowledge Management
Populates the Knowledge Management database
Uses similar if not identical tools and categorization as Incident Management Key process area within the ITIL framework
KPIs for Problem Management
Ratio of number of incidents versus number of problems sometimes grouped by services and in some cases by CIs. # of repeat problems (not incidents, problems) Balance of Problems solved with a KE - RFC or other Average problem closure duration % of unmodified/neglected problems % of problems with a root cause analysis Average cost to solve a problem % of problems with available workaround Average problem closure duration
KPIs continued
Number of Incidents resolved by Problem resolution Costs incurred during Problem resolution Expected plans and timelines for open Problems and Errors Number of Incidents resolved using the Knowledge Base
Source Column PM Scorecard and [Link]
Business Value of Problem Management
Value to Business
Problem Management reduces the Known Errors in the environment resulting in improved availability and fewer incidents
Other benefits
Higher availability of IT services Higher productivity of business and IT staff Reduced expenditure on workarounds or fixes that do not work Reduction in cost of effort in firefighting or resolving repeat incidents Better first-time fix rate of the Service Desk Improved organizational learning
Reactive vs. Proactive Problem Management
Reactive Problem Management
Reactive problem management seeks to cure the symptoms of problems. The reactive approach responds to reports of incidents that have already occurred.
Problem Control Activities Error Control Activities
Proactive Problem Management
Proactive problem management seeks to inoculate IT systems against problems. The proactive approach identifies potential problems before they emerge.
Trend Analysis Targeting Preventative Action
Problem Management High Level Process
Inputs Incident Details Workarounds Configuration details IT Infrastructure details Known Errors from Releases
Problem Management
Outputs
Known Errors Request for Changes (RFCs) Problem Records Management Information
Problem Management Activities
Problem control
Problem identification and recording Problem classification Problem investigation and diagnosis RFC and possible resolution and closure Tracking and monitoring of problems Error identification and recording Error assessment Recording error resolution Error closure Monitoring resolution progress
Error control
Assistance with the handling of major Incidents Proactive prevention of Problems
Trend analysis Targeting support action Providing information to the organization
Obtaining management information from Problem data Completing major Problem reviews
Problem Management Lifecycle
Problem Control Error Control
Tracking and Monitoring of Problems
Problem Classification
Tracking and Monitoring of Errors
Problem Identification and Recording
Error Identification and Recording
Error Assessment
Problem Investigation and Diagnosis
Record Error Resolution
RFC
RFC and possible Resolution and Closure
Close Error and Associated Problems
Successful Change Implementation
To Error Control Known Error Workaround Solution
Note: Error Control does not require a Problem to begin tracking and resolution of Errors Known Error Workaround Solution
Problem Management Critical Success Factors
Effective automated registration of Incidents
Should be linked with Incident records
Setting obtainable objectives and making use of skills of the Problem-solving team Good cooperation between Incident Management and Problem Management Setting aside time for true proactive Problem Management
A little time goes a long way to reduce the number of Incidents Over time, the reactive part of Problem Management will be reduced and more time spent on proactive Problem Management Focus on key Problems that cause the greatest pain
Errors in released software should be incorporated into the Known Error database for live services. Well defined Problem Management Roles
Problem Management Roles
Roles Problem Manager
Person or people responsible for Problem Management Responsible for:
Liaison with all problem resolution groups Formal closure of all Problem Records Develop and maintain relationships with suppliers and 3rd parties Major Problem Reviews
Problem Solving Groups
Technical groups and/or suppliers Responsible for problem solving
Problem Management Process Owner
Overall authority and responsibility for the process metrics, policies and procedures
Knowledge Manager
Responsible for the quality and integrity of the Knowledge Database
Problem Management Metrics (KPIs)
Problem Management Key Pitfalls
Poor link between Incident Management and Problem Management Lack of management commitment Insufficient time and resources to build and maintain the knowledge base Ineffective communication of Known Errors from the development environment to the live environment Organizational resistance to change
Questions and Answers
Questions and Answers