Data Warehousing Fundamentals: A Comprehensive Guide for IT Professionals. Paulraj PonniahCopyright 2001 John Wiley & Sons, Inc.ISBNs: 0-471-41254-6 (Hardback); 0-471-22162-7 (Electronic)DATA WAREHOUSINGFUNDAMENTALS

DATA WAREHOUSINGFUNDAMENTALSA Comprehensive Guide forIT ProfessionalsPAULRAJ PONNIAHA Wiley-Interscience PublicationJOHN WILEY & SONS, INC.New York / Chichester / Weinheim / Brisbane / Singapore / Toronto

Designations used by companies to distinguish their products are often claimed as trademarks. In all instanceswhere John Wiley & Sons, Inc., is aware of a claim, the product names appear in initial capital or ALL CAPITALLETTERS. Readers, however, should contact the appropriate companies for more complete information regardingtrademarks and registration.Copyright 2001 by John Wiley & Sons, Inc. All rights reserved.No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronicor mechanical, including uploading, downloading, printing, decompiling, recording or otherwise, except as permitted underSections 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the Publisher. Requests tothe Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 605 Third Avenue,New York, NY 10158-0012, (212) 850-6011, fax (212) 850-6008, E-Mail: PERMREQ @ WILEY.COM.This publication is designed to provide accurate and authoritative information in regard to thesubject matter covered. It is sold with the understanding that the publisher is not engaged inrendering professional services. If professional advice or other expert assistance is required, theservices of a competent professional person should be sought.ISBN 0-471-22162-7This title is also available in print as ISBN 0-471-41254-6.For more information about Wiley products, visit our web site at

ToVimala, my loving wifeand toJoseph, David, and Shobi,my dear children

CONTENTSForewordxxiPrefacexxiiiPart 1 OVERVIEW AND CONCEPTS1111111111111111111The Compelling Need for Data WarehousingChapter Objectives 1Escalating Need for Strategic Information 2The Information Crisis 3Technology Trends 4Opportunities and Risks 5Failures of Past Decision-Support Systems 7History of Decision-Support Systems 8Inability to Provide Information 9Operational Versus Decision-Support Systems 9Making the Wheels of Business Turn 10Watching the Wheels of Business Turn 10Different Scope, Different Purposes 10Data Warehousing—The Only Viable Solution 12A New Type of System Environment 12Processing Requirements in the New Environment 12Business Intelligence at the Data Warehouse 12Data Warehouse Defined 13A Simple Concept for Information Delivery 141vii

viiiCONTENTS1An Environment, Not a Product 141A Blend of Many Technologies 141 Chapter Summary 151 Review Questions 161 Exercises 162 Data Warehouse: The Building Blocks11111111111111111111111119Chapter Objectives 19Defining Features 20Subject-Oriented Data 20Integrated Data 21Time-Variant Data 22Nonvolatile Data 23Data Granularity 23Data Warehouses and Data Marts 24How are They Different? 251Top-Down Versus Bottom-Up Approach 26A Practical Approach 27Overview of the Components 28Source Data Component 28Data Staging Component 31Data Storage Component 33Information Delivery Component 34Metadata Component 35Management and Control Component 35Metadata in the Data Warehouse 35Types of Metadata 36Special Significance 36Chapter Summary 36Review Questions 37Exercises 373 Trends in Data Warehousing1 Chapter Objectives 391 Continued Growth in Data Warehousing 401Data Warehousing is Becoming Mainstream 401Data Warehouse Expansion 411Vendor Solutions and Products 421 Significant Trends 431Multiple Data Types 441Data Visualization 461Parallel Processing 4839

CONTENTS11111111111111111111ixQuery Tools 49Browser Tools 50Data Fusion 50Multidimensional Analysis 51Agent Technology 51Syndicated Data 52Data Warehousing and ERP 52Data Warehousing and KM 53Data Warehousing and CRM 54Active Data Warehousing 56Emergence of Standards 56Metadata 57OLAP 57Web-Enabled Data Warehouse 58The Warehouse to the Web 59The Web to the Warehouse 59The Web-Enabled Configuration 60Chapter Summary 61Review Questions 61Exercises 62Part 2 PLANNING AND REQUIREMENTS4 Planning and Project Management1111111111111111111Chapter Objectives 63Planning Your Data Warehouse 64Key Issues 64Business Requirements, Not Technology 66Top Management Support 67Justifying Your Data Warehouse 67The Overall Plan 68The Data Warehouse Project 69How is it Different? 70Assessment of Readiness 71The Life-Cycle Approach 71The Development Phases 73The Project Team 74Organizing the Project Team 75Roles and Responsibilities 75Skills and Experience Levels 77User Participation 78Project Management Considerations 80Guiding Principles 8163

xCONTENTS1Warning Signs 821Success Factors 821Anatomy of a Successful Project 831Adopt a Practical Approach 841 Chapter Summary 861 Review Questions 861 Exercises 875 Defining the Business Requirements11111111111111111111111189Chapter Objectives 89Dimensional Analysis 90Usage of Information Unpredictable 90Dimensional Nature of Business Data 90Examples of Business Dimensions 92Information Packages—A New Concept 93Requirements Not Fully Determinate 93Business Dimensions 95Dimension Hierarchies/Categories 95Key Business Metrics or Facts 96Requirements Gathering Methods 97Interview Techniques 99Adapting the JAD Methodology 102Review of Existing Documentation 103Requirements Definition: Scope and Content 104Data Sources 105Data Transformation 105Data Storage 105Information Delivery 105Information Package Diagrams 106Requirements Definition Document Outline 106Chapter Summary 106Review Questions 107Exercises 1076 Requirements as the Driving Force for Data Warehousing1 Chapter Objectives 1091 Data Design 1101Structure for Business Dimensions 1121Structure for Key Measurements 1121Levels of Detail 1131 The Architectural Plan 1131Composition of the Components 114109

CONTENTS11111111111111xiSpecial Considerations 115Tools and Products 118Data Storage Specifications 119DBMS Selection 120Storage Sizing 120Information Delivery Strategy 121Queries and Reports 122Types of Analysis 123Information Distribution 1231Decision Support Applications 123Growth and Expansion 123Chapter Summary 124Review Questions 124Exercises 125Part 3 ARCHITECTURE AND INFRASTRUCTURE7 The Architectural Components11111111111111111111127Chapter Objectives 127Understanding Data Warehouse Architecture 127Architecture: Definitions 127Architecture in Three Major Areas 128Distinguishing Characteristics 129Different Objectives and Scope 130Data Content 130Complex Analysis and Quick Response 131Flexible and Dynamic 131Metadata-driven 132Architectural Framework 132Architecture Supporting Flow of Data 132The Management and Control Module 133Technical Architecture 134Data Acquisition 135Data Storage 138Information Delivery 140Chapter Summary 142Review Questions 142Exercises 1438 Infrastructure as the Foundation for Data Warehousing1 Chapter Objectives 1451 Infrastructure Supporting Architecture 145145

xii11111111111111111111111CONTENTSOperational Infrastructure 147Physical Infrastructure 147Hardware and Operating Systems 148Platform Options 150Server Hardware 158Database Software 164Parallel Processing Options 164Selection of the DBMS 166Collection of Tools 167Architecture First, Then Tools 168Data Modeling 169Data Extraction 169Data Transformation 169Data Loading 169Data Quality 169Queries and Reports 170Online Analytical Processing (OLAP) 170Alert Systems 170Middleware and Connectivity 170Data Warehouse Management 170Chapter Summary 170Review Questions 171Exercises 1719 The Significant Role of Metadata11111111111111111Chapter Objectives 173Why Metadata is Important 173A Critical Need in the Data Warehouse 175Why Metadata is Vital for End-Users 177Why Metadata is Essential for IT 179Automation of Warehousing Tasks 181Establishing the Context of Information 183Metadata Types by Functional Areas 183Data Acquisition 184Data Storage 186Information Delivery 186Business Metadata 187Content Overview 188Examples of Business Metadata 188Content Highlights 189Who Benefits? 190Technical Metadata 190173

CONTENTSxiii1 2 Content Overview 1901 2 Examples of Technical Metadata 1911 2 Content Highlights 1921 2 Who Benefits? 19212 How to Provide Metadata 1931 2 Metadata Requirements 1931 2 Sources of Metadata 1941 2 Challenges for Metadata Management 1961 2 Metadata Repository 1961 2 Metadata Integration and Standards 1981 2 Implementation Options 1991 2 Chapter Summary 2001 2 Review Questions 2011 2 Exercises 201Part 4 DATA DESIGN AND DATA PREPARATION10 Principles of Dimensional Modeling1111111111111111111111111Chapter Objectives 2031From Requirements to Data Design 2032 Design Decisions 2042 Dimensional Modeling Basics 2042 E-R Modeling Versus Dimensional Modeling 2092 Use of CASE Tools 2091The STAR Schema 2102 Review of a Simple STAR Schema 2102 Inside a Dimension Table 2122 Inside the Fact Table 2142 The Factless Fact Table 2162 Data Granularity 2171STAR Schema Keys 2182 Primary Keys 2182 Surrogate Keys 2192 Foreign Keys 2191Advantages of the STAR Schema 2202 Easy for Users to Understand 2202 Optimizes Navigation 2212 Most Suitable for Query Processing 2222 STARjoin and STARindex 2231Chapter Summary 2231Review Questions 2241Exercises 224203

xivCONTENTS11 Dimensional Modeling: Advanced Topics111111111111111111111111111111Chapter Objectives 2251Updates to the Dimension Tables 2262 Slowly Changing Dimensions 2262 Type 1 Changes: Correction of Errors 2272 Type 2 Changes: Preservation of History 2282 Type 3 Changes: Tentative Soft Revisions 2301Miscellaneous Dimensions 2312 Large Dimensions 2312 Rapidly Changing Dimensions 2332 Junk Dimensions 2351The Snowflake Schema 2352 Options to Normalize 2352 Advantages and Disadvantages 2382 When to Snowflake 2381Aggregate Fact Tables 2392 Fact Table Sizes 2412 Need for Aggregates 2422 Aggregating Fact Tables 2432 Aggregation Options 2471Families of STARS 2492 Snapshot and Transaction Tables 2502 Core and Custom Tables 2512 Supporting Enterprise Value Chain or Value Circle 2512 Conforming Dimensions 2532 Standardizing Facts 2542 Summary of Family of STARS 2541Chapter Summary 2551Review Questions 2551Exercises 25612 Data Extraction, Transformation, and Loading11111111112251Chapter Objectives 2571ETL Overview 2582 Most Important and Most Challenging 2592 Time-consuming and Arduous 2602 ETL Requirements and Steps 2602 Key Factors 2611Data Extraction 2622 Source Identification 2632 Data Extraction Techniques 2632 Evaluation of the Techniques 270257

CONTENTS1111111111111111111Data Transformation 2712 Data Transformation: Basic Tasks 2722 Major Transformation Types 2732 Data Integration and Consolidation 2752 Transformation for Dimension Attributes 2772 How to Implement Transformation 2771Data Loading 2792 Applying Data: Techniques and Processes 2802 Data Refresh Versus Update 2822 Procedure for Dimension Tables 2832 Fact Tables: History and Incremental Loads 2842 ETL Summary 2852 ETL Tool Options 2852 Reemphasizing ETL Metadata 2862 ETL Summary and Approach 2871Chapter Summary 2881Review Questions 2881Exercises 28913 Data Quality: A Key to Success1111111111111111111111xv1Chapter Objectives 2911Why is Data Quality Critical? 2922 What is Data Quality? 2922 Benefits of Improved Data Quality 2952 Types of Data Quality Problems 2961Data Quality Challenges 2992 Sources of Data Pollution 2992 Validation of Names and Addresses 3012 Costs of Poor Data Quality 3021Data Quality Tools 3032 Categories of Data Cleansing Tools 3032 Error Discovery Features 3032 Data Correction Features 3032 The DBMS for Quality Control 3041Data Quality Initiative 3042 Data Cleansing Decisions 3052 Who Should be Responsible? 3072 The Purification Process 3092 Practical Tips on Data Quality 3111Chapter Summary 3111Review Questions 3121Exercises 312291

xviCONTENTSPart 5 INFORMATION ACCESS AND DELIVERY14 Matching Information to the Classes of Users111111111111111111111111Chapter Objectives 3151Information from the Data Warehouse 3162 Data Warehouse Versus Operational Systems 3162 Information Potential 3182 User-Information Interface 3212 Industry Applications 3231Who Will Use the Information? 3232 Classes of Users 3232 What They Need 3262 How to Provide Information 3291Information Delivery 3292 Queries 3312 Reports 3322 Analysis 3332 Applications 3341Information Delivery Tools 3352 The Desktop Environment 3352 Methodology for Tool Selection 3352 Tool Selection Criteria 3382 Information Delivery Framework 3401Chapter Summary 3411Review Questions 3411Exercises 34115 OLAP in the Data Warehouse111111111111113151Chapter Objectives 3431Demand for Online Analytical Processing 3442 Need for Multidimensional Analysis 3442 Fast Access and Powerful Calculations 3452 Limitations of Other Analysis Methods 3472 OLAP is the Answer 3492 OLAP Definitions and Rules 3492 OLAP Characteristics 3521Major Features and Functions 3532 General Features 3532 Dimensional Analysis 3532 What are Hypercubes? 3572 Drill-Down and Roll-Up 3602 Slice-and-Dice or Rotation 362343

CONTENTS1111111111111112 Uses and Benefits 3631OLAP Models 3632 Overview of Variations 3642 The MOLAP Model 3652 The ROLAP Model 3662 ROLAP Versus MOLAP 3671OLAP Implementation Considerations 3682 Data Design and Preparation 3682 Administration and Performance 3702 OLAP Platforms 3722 OLAP Tools and Products 3732 Implementation Steps 3741Chapter Summary 3741Review Questions 3741Exercises 37516 Data Warehousing and the Web11111111111111111111111xvii1Chapter Objectives 3771Web-Enabled Data Warehouse 3782 Why the Web? 3782 Convergence of Technologies 3802 Adapting the Data Warehouse for the Web 3812 The Web as a Data Source 3821Web-Based Information Delivery 3832 Expanded Usage 3832 New Information Strategies 3