Perhaps it is time to start acknowledging the challenge of source data analysis?
“Garbage out, garbage in”
I know, I know I have got that the wrong way round. Well, actually I did it on purpose to illustrate a problem with BI/Data Warehousing projects and incidentally many other Information Management initiatives. That problem is the challenge of source data analysis – particularly with SAP, Oracle and Salesforce packages.
I recently read a very interesting and clear blog by Boris Evelson of Forrester on what factors to consider when deploying a business intelligence platform and how “Initial business intelligence (BI) deployment efforts are often difficult to predict and may dwarf the investment you made in BI platform software.” The link is here.
Obviously there is a huge variability in both the cost and the amount of resources required to deliver a successful outcome and this in turn will depend on a large number of parameters all of which are described really well in the blog. My purpose in this piece is not to rewrite Mr Evelson’s excellent document but to focus on a specific area which he refers to as “Source applications integration” and which I believe is often overlooked or ignored until it becomes critical and can then impact significantly on the overall cost and success of a BI project.
The problem of course is that unless we can be absolutely confident that we have found the right elements of the source system with which to populate the data warehouse we cannot be confident that the business is using the right information on which to base its decisions.
Herein lays my “garbage out, garbage in” thought because no matter how clever your ETL tools are or how good your data quality tool is, if you are working with the wrong files or tables or an incomplete set of tables then the data you extract from the source will be of very little use to the data warehouse or BI users who might make ill-advised decisions based on incorrect data.
A secondary challenge is the level of the quality of that data, but it is more important to identify the appropriate sources first. In most instances it is true of course that the source applications are relatively easy to understand. They are straightforward to reverse engineer or have good documentation, or tools exist which assist the process of identifying the tables and columns for use in the DW/BI implementation and often they present no real problem to the project teams.
Others however, including those from SAP, Salesforce and Oracle (eBusiness Suite, JD Edwards, Siebel, PeopleSoft) which are the main systems of record for hundreds of thousands of organisations, are much more complex and require much more effort to find the tables needed.
Part of the problems lie with the sheer scale of these applications, they all have many thousands of tables (SAP systems for example will have over 90,000 before customisations are applied) and a myriad network of inter-relationships meaning that traditional reverse engineering by modelling tools is either impossible or totally impractical. Imagine having a visualisation of a complete SAP system on a single screen. One of our customers called it a ‘death star visualisation’. It would also be impossible to navigate and search.
Additionally there are the sometimes idiosyncratic naming conventions and most implementations have been quite heavily customised meaning that it can be a challenge that even specialists will struggle with to find what they need quickly and easily.
Finally the information about tables (metadata) in these applications, and others like them is stored in data dictionary tables rather than the system catalogue. This reduces further the effectiveness of traditional tools which are designed to help you find the right tables and can significantly extend the time it takes to locate exactly what you are looking for and a distinct possibility that a good deal of ‘informed guesswork’ will be involved.
So coming back to ‘garbage out, garbage in’; without accurate knowledge about which tables hold the required information in the source systems, the mapping between source and target Data warehouse via ETL or DI will be less likely to meet business requirements.
This will potentially lead to overspend, re-work, lengthening delivery times and at worst risking ill-informed decisions being based on wrong information. Perhaps it is time to start taking the challenge of source data analysis, or ‘metadata discovery’ as we sometimes call it, more seriously before embarking on Data warehouse or Business Intelligence projects?
Reblogged this on Sutoprise Avenue, A SutoCom Source.