Please enjoy reading this archived article; it may not include all images.

Achieving Data Warehouse Nirvana

Date Published: 1 July 2010

Would one buy a house when the stability of the foundation is uncertain? Would one make a payment if the accuracy of the bill is in question? If the answer is no, then why would any organization settle for making business decisions based on inaccurate and inconsistent data warehouse information? A number of studies¹, ², ³ show that much of the data warehouse information available to business users is not accurate, complete or timely. Despite significant investment in data warehouse technologies and efforts to ensure quality, the trustworthiness of data warehouse information at best remains questionable.⁴, ⁵ Current approaches to restore trust in data warehouse information are often heroic efforts of the individuals responsible for the data warehouse and include:

Manual or semiautomated balancing, tracking and reconciliation to prove accuracy
Ad hoc queries of data sources to support “audit needs”
Extensive research and remediation to identify, diagnose and correct issues

These approaches provide short-term respites but are not sustainable in the long run. The increased labor cost for manual processes and the high processing cost for reruns when errors are identified late in the process increase ongoing operational costs. The cumbersome and costly processes for supporting audit needs also create organizational stress. Frequently, a large number of data warehouse projects are abandoned because of the high costs of efforts to ensure information quality.⁶

While standardized tools, such as those for extraction, transformation and loading (ETL) and data quality processes, solve part of the problem, there is an urgent need for adopting a systematic approach for establishing trust in data warehouse information. The proposed approach outlines a framework for ensuring the integrity of data warehouse information by using end-to-end information controls.

Root Causes of Information Quality Issues

While several factors can be attributed to the information quality issues, the following are the major causes of information errors within data warehouses:

Changes in the source systems—Changes in the source systems often require code changes in the ETL process. For example, the ETL process corresponding to the credit risk data warehouse in a particular financial institution has approximately 25 releases each quarter. Even with appropriate quality assurance processes, there is always room for error. The following list outlines the types of potential errors that can occur because of changes in the ETL processes:
– Extraction logic excludes certain types of data that were not tested.
– Transformation logic may aggregate two different types of data (e.g., car loan and boat loan) into a single category (e.g., car loan). In some cases, transformation logic may exclude certain types of data, resulting in incomplete records in the data warehouse.
– Similar issues are also observed with the loading process.
Process failures—Current processes may fail due to system errors or transformation errors, resulting in incomplete data loading. System errors may include abends due to the unavailability of source system/extract or the incorrect format of the source information. Transformation errors may result from incorrect formats.
Changes/updates in the reference data— Outdated, incomplete or incorrect reference data will lead to errors in the data warehouse information. For example, errors in the sales commission rate table may result in erroneous calculation of the commission amount.
Data quality issues with the source system— The source system’s data may be incomplete or inconsistent. For example, a customer record in the source system may have a missing zip code. A similar source system related to sales may use an abbreviation of the product names in its database. Incompleteness and inconsistency in source system data will lead to quality issues in the data warehouse.

Current Approach and Cost of Quality

The current focus in most data warehouse initiatives is to use ETL tools to standardize the data transfer process and to use data quality solutions to detect and correct incomplete and inconsistent data. While these efforts result in significant improvements, data warehouse teams rely on a number of manual/semiautomated processes to balance and reconcile the data warehouse information with the source system information. Some of the techniques currently used by various organizations are:

Developing an independent script that compares the record count and amount information from the source system with the record count and amount information from the data warehouse. These scripts are often executed either on an ad hoc basis or scheduled to run after the data load is complete.
Creating a control table and then populating the control table with the totals from the data warehouse and the source system as part of the ETL process. Checks are performed after the completion of the load process.

While these methods are somewhat effective in detecting the errors, they rely heavily on the ETL process, which is often the source of the error. More important, these approaches are not effective when the transactions from source systems are either split into several transactions or combined into a single transaction. Such scenarios require advanced logic for balancing the information between the source system and data warehouse. In addition, the inability to reconcile detaillevel information using scripts or ETL processes does not allow users to pinpoint the exact issue, resulting in significant manual research and resolution efforts. In addition to the high operational costs related to research and reruns to ensure quality, the current approach impacts the morale of the data warehouse team and the confidence of the business users.

The problem of data quality exacerbates when the data warehouse information is used for storing and reporting financial information. In this scenario, internal audit requests evidence of controls’ operation and documentation related to error resolutions when controls detect errors. Such requests are often met by querying a myriad number of log files, e-mail chains and data warehouse tables. This increases the workload of the data warehouse resources and increases the rift between audit and data warehouse teams.

Current approaches are not scalable and sustainable. There is an urgent need to use automated information controls for verifying, balancing, reconciling and tracking the data warehouse information. Ideally, information controls should be independent of the underlying application and should have the ability to store an audit trail of the information transfer process and its validation results.

Three Pillars for Ensuring Data Warehouse Quality

Successful and cost-effective data warehouse quality initiatives in Fortune 500 organizations are founded on three critical pillars.

Data quality (DQ) tools—Identify, correct and standardize incomplete and inconsistent source system information prior to loading to data warehouses. The focus of these solutions primarily has been on validating customer addresses and product names. These solutions often do not address the quality issues of the financial transactions.
ETL tools—Extract and transform source system information and load it into the data warehouse. The primary focus has been standardizing and increasing the efficiency of the data transfer process.
Information control (IC) solutions—Verify, balance, reconcile and track information as the source system data traverse through various points in the ETL process to the data warehouse. The focus has been to independently ensure the accuracy, consistency and completeness of the information at both an aggregate and transaction level.

Information controls not only balance and reconcile the data before and after the load, but also can be expanded outside the scope of the data warehouse to ensure that the data warehouse information is aligned with other critical applications such as the general ledger (GL). For example, although the same journal systems may feed both the data warehouse and the GL, manual adjustments in the GL system may cause an out-of-sync condition that could be detected early if an automated information control is in place. In addition, automated information controls store the audit trail information about the control actions and the resolutions in case of exceptions. Figure 2 compares ETL and DQ tools with IC solutions from various aspects.

Information Controls Framework for Data Warehouse

The proposed framework recommends a minimum of six information controls to achieve the objectives of the data warehouse quality initiatives. The locations of the information controls are depicted in figure 3. The six controls are:

Control X1, data warehouse to source system validation— Ensure that the data warehouse information can be balanced and reconciled with the source system. In addition to validating the number of records, controls should
Control X2, verification between feeds that the data are accurate and complete—Ensure that the related source feed information is consistent. For example, if one feed consists of credit card payment information and another feed consists of account credit information based on payment information, there needs to be a control to validate the consistency between these two feeds (i.e., validate that the payment information can be reconciled with the credit information).
Control X3, validation that the ETL process is accurate and complete—The control should monitor transactions and processes, e.g., source to ETL, data warehouse to data mart. Validate adherence to all process dependencies. Automated independent controls could also be used to automate ETL testing.
Control X4, verification within the data warehouse that information is consistent—Many data warehouses do not enforce referential integrity. Changes in the data update process by downstream applications can result in data discrepancies. Independent controls should be used to ensure referential integrity is maintained by reconciling relevant information.
Control X5, assurance that the data balance with downstream applications or data marts—Ensure that the data warehouse information can be balanced and reconciled with the downstream processes.
Control X6, validation between parallel systems and the data warehouse—Data warehouse information can also reside in other systems. For example, loan information resides both in the GL and the credit risk data warehouse. It is important to reconcile the information in the parallel system with the data warehouse information. In the absence of such a control, the loan information in the financial reports, generated from the GL system, may become out of sync with the loan information used for estimating the capital requirements for Basel II.

Conclusion

With the accelerating changes in the source systems to support business needs, increasing reliance on data warehouse information for critical business operation and decisions, and an expanding (and ever-changing) array of regulations and compliance requirements, the use of automated information controls is no longer an option; it is the only way to ensure information accuracy within the data warehouse and across the enterprise. Successful organizations expand the scope of information controls beyond the scope of the data warehouse by developing a companywide program for ensuring the enterprise information quality. With an appropriate selection of tools and frameworks for information controls, organizations can achieve the elusive goal of having higherquality enterprise information assets.

Endnotes

¹ English, Larry; Improving Data Warehouse and Business Information Quality, Wiley and Sons, USA, 2000
² Eckerson, Wayne W.; Data Quality and the Bottom Line, TDWI research series, USA, 2001
³ Friedman, Ted; Data Quality “Firewall” Enhances Value of the Data Warehouse, Gartner Report, USA, 2004
⁴ Violino, Bob; “Do You Trust Your Information?,” The Information Agenda, 23 October 2008
⁵ Computer Sciences Corp., Technology Issues for Financial Executives, USA, 2007
⁶ Gupta, Sanjeev; “Why Do Data Warehouse Projects Fail?,” Information Management, 16 July 2009

Christopher Reed
is solution consultant at Infogix Inc., leading solution consulting efforts. He works with Fortune500 companies to assist in the creation of information control solutions throughout the enterprise. In addition to his work at Infogix, Reed was an architectural consultant at Unisys, where he consulted with customers on deploying mission critical applications.

Yaping Wang, CISA , PMP
is product consultant at Infogix, where she leads client service projects that provide assessment, advisory, implementation and other services in automated information control domains.

Angsuman Dutta
is unit leader of the customer acquisition support team at Infogix. Since 2001, he has assisted numerous industry-leading enterprises in their implementation of automated information controls.