Transcription

University of CaliforniaSan FranciscoInformation Technology ServiceIT ENTERPRISEPROBLEM MANAGEMENTPROCESSVERSION 1.0 January 28, 2014This document contains confidential, proprietary information intended for internal use only and is not to be distributedoutside the University of California, San Francisco (UCSF) without an appropriate non-disclosure agreement in force. Itscontents may be changed at any time and create neither obligations on UCSF’s part nor rights in any third person.

UCSFInformation Technology and ServiceProblem Management ProcessTable of Contents12345678DOCUMENT INFORMATION1.1ABOUT THIS DOCUMENT1.2WHO SHOULD USE THIS DOCUMENT?1.3SUMMARY OF CHANGES1.4REVIEW AND APPROVAL DISTRIBUTION LISTINTRODUCTION2.1MANAGEMENT SUMMARY2.2GOAL OF PROBLEM MANAGEMENT2.3PROBLEM MANAGEMENT MISSION STATEMENT2.4BENEFITS2.5PROCESS DEFINITION2.6OBJECTIVES2.7DEFINITIONS2.8SCOPE OF PROBLEM MANAGEMENT2.9INPUTS AND OUTPUTS2.10METRICSROLES AND RESPONSIBILITIES3.1PROBLEM MANAGEMENT PROCESS OWNER3.2PROBLEM MANAGER3.3SUPPORT GROUP STAFF3.4FUNCTIONAL MANAGERS3.5SERVICE DESK3.6SERVICE OWNER3.7PROBLEM OWNER3.8PROBLEM ANALYST3.9PROBLEM REPORTER3.10PROBLEM MANAGEMENT REVIEW TEAM3.11SOLUTION PROVIDER GROUP3.12INTEGRATION WITH OTHER PROCESSESPROBLEM CATEGORIZATION AND PRIORITIZATION4.1CATEGORIZATION4.2PRIORITY DETERMINATION4.3WORKAROUNDS4.4KNOWN ERROR RECORD4.5MAJOR PROBLEM REVIEWPROCESS FLOW5.1HIGH LEVEL REACTIVE PROBLEM MANAGEMENT FLOW5.2SWIM LANE FLOW DIAGRAM5.3PROCESS ACTIVITIESRACI MATRIX6.1ROLE DESCRIPTION6.2ROLE MATRIXREPORTS AND MEETINGS7.1CRITICAL SUCCESS FACTORS7.2KEY PERFORMANCE INDICATORS7.3REPORTSPROBLEM POLICYUCSF – Internal Use Only2 of 223232424252526273131313232323333

UCSFInformation Technology and ServiceProblem Management Process1 Document Information1.1About this documentThis document describes the Problem Management Process. The Processprovides a consistent method to follow when working to resolve severe orrecurring issues regarding services from the UCSF IT Enterprise.1.2Who should use this document?This document should be used by:IT Enterprise personnel responsible for the restoration of services and forproblem root cause analysis/remediation.IT Enterprise personnel involved in the operation and management of theProblem Process.1.3Summary of changesThis section records the history of significant changes to this document.Where significant changes are made to this document, the version number willbe incremented by 1.0.Where changes are made for clarity and reading ease only and no change ismade to the meaning or intention of this document, the version number will beincreased by 0.1.Version1.01.4DateAuthorDescription of change1/28/2014Jeff FranklinInitial versionReview and Approval Distribution ListNameUCSF – Internal Use OnlyNetworkServer3 of 33DNSApplication IT Facilities

UCSFInformation Technology and ServiceProblem Management Process2 Introduction2.1Management SummaryThis document provides both an overview and a detailed description of theUCSF IT Enterprise Problem Management process and covers the requirementsof the various stakeholder groups.The Problem Management process is designed to fulfil the overall goal ofunified, standardized and repeatable handling of all Problems managed by UCSFIT Enterprise. Problem Management is the process responsible for managingthe lifecycle of all problems. The Problem Management Process works inconjunction with other IT Enterprise processes related to ITIL and ITSM in orderto provide quality IT services and increased value to UCSF.2.2Goal of Problem ManagementThe goal of Problem Management and Incident Management can be in directconflict. Both processes aim to restore unavailable or affected service to thecustomer. The Incident Management function’s primary goal is to restore thisservice as quickly as possible whereas the speed, with which a resolution forthe Problem is found, is only of secondary importance to the ProblemManagement process. Investigation of the underlying cause of the Problem isthe main concern of the Problem Management process.Problem Management activities result in a decrease in the number of incidentsby creating structural solutions for errors in the infrastructure and provideIncident Management with information to circumvent errors to minimize loss ofservice. The Problem Management process has both reactive and proactiveaspects. The reactive aspect is concerned with solving problems in response toone or more incidents. Proactive Problem Management focuses on theprevention of incidents by identifying and solving problems before incidentsoccur.The primary goals of Problem Management are to: Prevent problems and resulting incidents from happening. Eliminate recurring incidents. Minimize the impact of incidents that cannot be prevented.2.3Problem Management Mission StatementTo maximize IT service quality by performing root cause analysis to rectify whathas gone wrong and prevent re-occurrences. This requires both reactive andproactive procedures to effect resolution and prevention, in a timely andeconomic fashion.2.4BenefitsProblem Management works together with Incident Management, ChangeManagement, and Configuration Management to ensure that IT serviceavailability and quality are increased. When incidents are resolved, informationabout the resolution is recorded. Over time, this information is used to reduceUCSF – Internal Use Only4 of 33

UCSFInformation Technology and ServiceProblem Management Processthe resolution time and identify permanent solutions, reducing the number ofrecurring incidents. This results in less downtime and less disruption to theUCSF’s critical systems.2.4.1Benefits Overview to the Service Delivery Organizations Better first-time fix at the Service Desk Departments can show added value to the organization Reduced workload for staff and Service Desk (incident volume reduction) Better alignment between departments Improved work environment for staff More empowered staff Improved prioritization of effort Better use of resources More control over services provided2.4.2Benefits Overview to the Customer Organizations Improved quality of services Higher service availability Improved user productivityAdditional benefit details realized from adopting Problem Management.2.4.3Risk Reduction Benefit Problem Management reduces incidents leading to more reliable and higher quality ITservices for users.2.4.4Cost Reduction Benefit Reduction in the number of incidents leads to a more efficient use of staff time as well asdecreased downtime experienced by end-users.2.4.5Service Quality Improvement Benefit Problem Management helps the UCSF IT Enterprise organization to meet customerexpectations for services and achieve client satisfaction. By understanding existing problems, known errors and corrective actions, the Service Deskhas an enhanced ability to address incidents at the first point of contact. Problem Management helps generate a cycle of increasing IT service quality.2.4.6Improved Utilization of IT Staff Benefit Service Desk resources handle calls more efficiently because they have access to aknowledge database of known errors and corrective actions. Consolidating problems, known errors and corrective action information facilitatesorganizational learning.UCSF – Internal Use Only5 of 33

UCSFInformation Technology and Service2.4.7Problem Management ProcessOpportunity costs of NOT adopting a formal Problem ManagementProcess Interruptions will result in unsatisfied clients and loss of confidence in the IT Enterpriseorganization. Inefficient use of support resources as senior resources spend their efforts on reacting toincidents rather than pro-actively managing the delivery and support of services. Reduced employee motivation as they repeatedly address incidents with similarcharacteristics.2.5Process DefinitionProblem Management includes the activities required to diagnose the root causeof incidents and to determine the resolution to those problems. It is alsoresponsible for ensuring that the resolution is implemented through theappropriate control procedures.The Problem Management process will be based on ITIL best practices toensure the controlled handling, monitoring and effective closure of Problemswithin the UCSF IT Enterprise organization. This will be achieved by using acombination of activities that are designed in-line with ITIL Best Practices.Although the process is supported by a Problem Manager, other resources anddepartments are involved in the Problem Management Process.2.6ObjectivesThe primary objectives of Problem Management are to prevent problems andresulting incidents from happening, to eliminate recurring incidents and tominimize the impact of incidents that cannot be prevented. This leads toincreased service availability and quality.Problem Management is focused on implementing the appropriate correctiveactions to address problems that negatively impact IT services. ProblemManagement seeks to implement cost effective, permanent solutions toeliminate the root cause of incidents thereby preventing reoccurrence. ProblemManagement differs from the IT service restoration focus of IncidentManagement that often uses temporary workarounds to quickly restoreservices.There are two approaches to Problem Management, proactive and reactive: Reactive Problem Management identifies problems based upon review of multipleevents (incidents) that exhibit common symptoms or in response to a singleincident with significant impact. Proactive Problem Management identifies problems by reviewing incident trendsand non-incident data to predict that an incident is likely to (re-)occur.The basic steps in Problem Management include: Detection of problems via analysis of incident data, problem data, operationaldata, release notes, Problem Management database and capacity or availabilityreports.UCSF – Internal Use Only6 of 33

UCSFInformation Technology and ServiceProblem Management Process Logging, classification and prioritization of confirmed problems into the ProblemManagement database. Efficient routing of classified and prioritized Problems for appropriate action. Determination of the root cause of the problems using industry standardtechniques such as Kepner-Tregoe, Ishikawa Diagrams, Pain Value Analysis,Brainstorming, Technical Observation Post and Pareto Analysis. Logging and classification of known errors identified by either root cause analysisor information from other sources. Determination of alternative corrective actions to resolve the known errors. Implementation of the appropriate corrective action through ChangeManagement. Provide accurate and visible Problem status reporting. Ensure that Problem resolutions met the SLA requirements for the customerorganizations.2.7Definitions2.7.1 ImpactImpact is determined by how many personnel or functions are affected.There are three grades of impact: 3 - Low – One or two personnel. Service is degraded but still operating within SLAspecifications 2 - Medium – Multiple personnel in one physical location. Service is degraded and stillfunctional but not operating within SLA specifications. It appears the cause of the Problemfalls across multiple service provider groups 1 - High – All users of a specific service. Personnel from multiple agencies are affected.Public facing service is unavailableThe impact of the incidents associated with a problem will be used indetermining the priority for resolution.2.7.2IncidentAn incident is an unplanned interruption to an IT Service or reduction inthe Quality of an IT Service. Failure of any Item, software or hardware,used in the support of a system that has not yet affected service is alsoan Incident. For example, the failure of one component of a redundanthigh availability configuration is an incident even though it does notinterrupt service.An incident occurs when the operational status of a Production itemchanges from working to failing or about to fail, resulting in a conditionin which the item is not functioning as it was designed or implemented.The resolution for an incident involves implementing a repair to restorethe item to its original state.A design flaw does not create an incident. If the product is working asdesigned, even though the design is not correct, the correction needsto take the form of a service request to modify the design. The serviceUCSF – Internal Use Only7 of 33

UCSFInformation Technology and ServiceProblem Management Processrequest may be expedited based upon the need, but it is still amodification, not a repair.2.7.3Knowledge BaseA database that contains information on how to fulfill requests andresolve incidents using previously proven methods / scripts.2.7.4Known ErrorA Known Error is a problem that has an identified root cause and forwhich a workaround or (temporary) solution has been identified. Thisterm is also describes a fault in the infrastructure that can be attributedto one or more faulty CI’s (Configuration Items) in the Infrastructureand causes, or may cause, one or more incidents for which a workaround and/or resolution is identified.2.7.5Proactive Problem ManagementProactive Problem Management is one of two important ProblemManagement processes. It is used to detect and prevent futureproblems/incidents. Proactive problem Management includes theidentification of trends or potential weaknesses. Proactive ProblemManagement is performed by the Service Operations group.2.7.6ProblemA Problem is an undesirable situation, indicating the unknown rootcause of one or more existing or potential incidents. A problem is theunderlying cause of an incident and can be identified in the followingways: It is identified as soon as an incident occurs that cannot be matched to existing or recordedproblems for which a root cause is to be sought. It is identified as a result of multiple Incidents that exhibit common symptoms. It is identified from