ETL Test Automation for DWH/BI, DataIntegration and Big Data Projects:Why It’s Imperative and How to Get StartedBy Wayne Yaddow, Data Quality Analyst

Tricentis ETL Test Automation 1Gartner recently stated that between 70 and 80 percent of business intelligence initiatives initially end upfailing, but many resume due to their importance to the organization. As businesses create (and need) moredata than ever before, the sheer number of BI failures threatens to grow exponentially. This could have a farreaching impact on the underlying digital transformation initiatives that these BI projects are designed toenable.Given that companies are releasing new applications faster than ever—some releasing updates on demandand multiple times per day—too many organizations are using manual ETL test processes and the wrong toolsto manage critical parts of releases for highly visible, often customer-facing applications. That translates intorisk: risk to customer loyalty, the brand, confidential data, and critical business decisions.This paper explores how applying DevOps-style test automation to DWH/BI and other data integration projectscan guarantee a high level of data quality—instilling the trust that is essential for the success of BI projects andthe digital transformation initiatives that are ultimately driving them.Taking a DevOps Approach to BI/DWH TestingDevOps, with its focus on tool automation across the entire development lifecycle, addresses an enormouschallenge for “big data” and DWH/BI developers. Many of today’s big data and DWH/BI projects are alreadyleveraging (or actively planning to adopt) Agile and DevOps processes—but not so much for testing. DWH/BIprojects, in general, are not currently using automated testing tools to the extent that is needed for projectsuccesses. Perhaps this is because they believe the required testing functions are not commercially available,or are too complex and expensive to develop in-house.When thinking about what needs to be tested to ensure data integrity, it’s important to consider that BI is morethan just data warehouses (DWH) and ETL (Extract Transform Load). Services between the ETL processes, aswell as the middleware and dashboard visualizations, also come under the purview of BI. Messages andnegotiating pacts between these layers is complex and requires much coordination and testing.DevOps helps facilitate this with constant deployments and testing. Implementing a DevOps testing approachto DWH/BI means automating the testing of different source and target data sets to keep data current. Thiscan be tremendously beneficial when handling many diverse data sources and volumes—for some projects,hundreds. Your team will be able to detect errors before they threaten BI applications in production. Moreover,you will have more time to fix issues before reaching 2018 Tricentis GmbH. All Rights Reservedv

Tricentis ETL Test Automation 2Why Test Automation?Continuous quality is a systematic approach to achieving the quality goals of development and the businessesit supports. In the 2018 Magic Quadrant for Software Test Automation, Gartner states: “Test automation toolsare essential elements of a DevOps toolchain and enablers for achieving the continuous quality approachrequired for successful DevOps.”However, as for any IT project, repeated (“regression”) testing is important for guaranteeing a high level of (data)quality. The more we test, the more bugs will be resolved before going live. This is especially crucial for businessintelligence projects. When the users can’t trust the data, it’s likely that the BI solution itself will not betrusted and fail.As mentioned earlier, ETL testing is primarily conducted manually, which makes it a very labor intensive anderror prone process. Automating ETL tests allows frequent smoke and regression testing without much userintervention and supports automated testing on older code after each new database build. Automation cannot only help execute tests; it can also assist with designing and managing them.The decision to implement automated tools for ETL testing depends on a budget that supports additionalspending to meet advanced testing requirements. It is important to remember that test tools built andmaintained in house are better than no test automation at all. In the end, test automation will save much time.Additionally, business users will appreciate the quality of BI deliverables and accept the data from the DataPlatform solution as the "single version of the truth.” 2018 Tricentis GmbH. All Rights Reservedv

Tricentis ETL Test Automation 3Figure 1: A sampling of tests and validations that should be considered for most DWH/BI projectsPlanning for the ETL Test Automation ProcessBelow are some of the most highly-recommended test automation planning steps for DWH/BI projects. As withall projects, the decisions made during the planning stages of a test automation project set the stage forsuccess—or for failure. For this reason, it’s suggested that you take the necessary time to set goals, analyzecurrent processes, and build the right implementation team prior to launching the test automation project.1. Analyze your current testing process—from unit testing, to component testing, to data quality testingand 2018 Tricentis GmbH. All Rights Reservedv

Tricentis ETL Test Automation 42. Define the stakeholders and IT team.3. Identify and prepare several test scenarios for test automation.4. Research and select two or three top commercial or open source ETL and data quality automation toolsfor an in-depth evaluation.5. Conduct proof of concept (PoC) exercises—preferably with the collaboration of tool vendors. Vendorshelp ensure you’re really understanding the potential of the tool. They can also help you gain the mostaccurate assessment in the shortest time possible.6. Implement the selected automation tools.7. Make time for training and the learning curve.8. Begin automating already-documented test cases.9. Plan to review your process and results, then adjust as necessary.Understanding Which ETL and DWH/BI Verificationsare Best for AutomationRegarding test scenarios for test automation, evaluate your scenarios and determine which are the bestcandidates for automation based on risk and value. Which types of defects would cause you to stop anintegration or deployment? Which types of tests exercise critical core functionality? Which tests cover areas ofthe application that have historically been known to fail? Which tests are providing information that is notalready covered by other tests in the pipeline?Common preferences for manual DWH testing Exploratory Testing: This type of testing requires the tester’s knowledge, experience, analytical/logicalskills, creativity, and intuition. Human skills are needed to execute the testing process in this scenario. Ad-hoc Testing: In this scenario, there is no specific approach. It is often an unplanned method of testingwhere the understanding and insight of the tester is the important factor.Common preferences for automated DWH testing Source to Target Data Reconciliation Testing (including transformation testing, regression testing,smoke testing): Here, automated testing is suitable because of frequent code changes and the ability torun the regression assessments in a timely manner. Repeated Execution: Since this testing requires the repeated execution of a task, it is best 2018 Tricentis GmbH. All Rights Reservedv

Tricentis ETL Test Automation 5 Load Testing: Load testing is another type of testing where automation is essential for efficiency. Performance Testing: Likewise, testing which involves the simulation of thousands of concurrent usersrequires automation. End-to-End Testing: Data testing can be time-consuming because of the variety of stages, technologiesand vast volume of data involved. Each phase of ETL testing requires different strategies and types oftesting—for example, one-to-one comparisons, validations of migrated data, validations oftransformation rules, reconciliations (e.g., sources to targets), data quality checks, and front end testingof BI reports.Table 1 below lists most types of test that are often considered for test automation and test automation toolimplementations (commercial, open-source, and in-house tools). Utilizing a list of test scenarios such as thiscan be a good start on your road to DHW/BI test automation.Table 1: Test scenarios and test cases frequently considered for automated testingTEST SCENARIOSTEST CASESMETADATA VALIDATIONValidating the source and target table structure as per the mapping andmetadata documents Data types are validated in the source and the target systems.The length of data types in the source and the target system should be thesame.Data field types and their format are verified to be the same in the sourceand the target system.Validating the column names in the target system.VALIDATE MAPPINGDOCUMENTSValidating mapping and metadata documents to ensure all the information hasbeen implemented. The mapping document should have a change log, maintaindata types, length, transformation rules, etc.VALIDATE CONSTRAINTSValidating all column and transformation constraints and ensuring that they areapplied on the expected tables.DATA CONSISTENCY &INTEGRITY CHECKSChecking the misuse of integrity constraints like foreign keys – no orphan FK’s.www.tricentis.comThe length and data type of an attribute may vary in different tables, althoughtheir definition remains the same at the semantic layer. 2018 Tricentis GmbH. All Rights Reservedv

Tricentis ETL Test Automation 6TEST SCENARIOSTEST CASESDATA COMPLETENESSVALIDATIONVerifying that all data is loaded to the target system from the source system. Record counts in the source and the target data. Boundary value analysis (tests of min/max, no truncations). Validating the unique values of primary keys.Verifying values of data in the target system.DATA CORRECTNESSVALIDATIONDATATRANSFORMATIONSAPPLIED ACCORDING TOBUSINESS RULES Misspelled or inaccurate numeric data in target table. Distinct values in columns (not unique data) is stored when you disableintegrity constraint at the time of import.Creating a matrix of scenarios for input values and expected results and thenvalidating with end users. Validating parent-child relationship in the data by creating scenarios. Using data profiling to verify the range of values in each field. Validating if the data types in the warehouse are same as mentioned in thedata model. Default values, data trimming, etc. Verify source table joins for aggregations, etc.Performing number check, date check, precision check, data check, null checks,etc. on both source and target data.DATA QUALITYVALIDATIONExample - Date format should be the same for all the values per the columndefinitionsDUPLICATE Validating duplicate values in target system columns and rows when datais loaded from multiple columns in sources Validating primary keys and other columns if there are any duplicate valuesas per the business requirement. Verifying that multiple columns specified as unique key can be groupedwithout resulting in duplicate records. 2018 Tricentis GmbH. All Rights Reservedv

Tricentis ETL Test Automation 7TEST SCENARIOSTEST CASESDATE VALIDATIONCHECKSValidating the date field for all defined actions performed in ETL processes From Dates not greater than To Dates. Min and max values within bounds ( 01/01/1970, 2099-12-31). Date and time values as specified. Date values contain no junk values or null values.CDC, SCD, FACT TABLEUPDATESVerifying that all changed data (CDC) is captured from sources and appliedaccording to changing dimensions (SCD) and fact table specifications inrequirements.COMPARE STAGING ANDDW TABLESVerifying that target DWH tables are precisely the same as staging wherespecified and DWH tables are correctly loaded where differences are specifiedbetween them.DROPPED RECORDSValidating that no records are dropped where they should not be between allsources and targets. Verify that records with error status ‘E’ are dropped andthat any records in the same or related tables that with foreign keys to thesedropped records are processed according to specifications.EXTRA RECORDS,ADDITIONAL COLUMNSIN TARGETVerifying that extraneous data that was not meant to be loaded was not actuallyloaded.UNIQUE KEYEXCLUSIVITYVerifying that all columns specified as unique key are unique among all records.RECORD COUNTSVerifying record counts as correct when compared with source records andwhen compared from one DB load to anotherEXPLORATORY TESTINGProviding distinct values from all columns to support input exploratory testing.MAINTAIN SOURCETABLE ID’S THROUGH TODWProviding functions that allow tracking / verification of surrogate or native or IDkeys from source to final target.ETL LOOKUPPROCESSINGVerifying that ETL “lookups” were processed correctlyAGGREGATED VALUESVerifying aggregation of values from sources to 2018 Tricentis GmbH. All Rights Reservedv

Tricentis ETL Test Automation 8Key Takeaways Many projects’ DWH/BI teams have found that it’s possible to succeed with automated testing. Automated testing will not replace all manual unit, component, or end-to-end testing. However, it willensure that the more costly manual work is focused on high-risk, high-value activities— and, in theprocess, complement the QA process. Creating automated DWH/BI tests is well worth the upfront effort, especially in the data warehousetesting phases. Automated tests can be run hundreds of times at modest cost with almost no physicaltime constraints. We know that testing takes time. We know that testing costs money. If planning and other upfrontautomation efforts reduce time and costs, that’s undeniably beneficial for your organiz