Data discovery – the step before extraction
What comes before data extraction? In any data intensive project be it a data warehouse, data migration, master data there is a data discovery phase which has to be completed before any movement of data can be performed.
This is what used to be called source data analysis and is in modern parlance more likely to be referred to as a part of ‘data wrangling’ or ‘data preparation’. Whatever the name of the task however, it refers to the process of physically locating the data you need, understanding that data and how it is structured in the context of the project – before you do anything to it. It is also a part of the project which is often forgotten or under appreciated, especially in the rush to adopt new end user tools.
Many types of source data submit to this part of the process relatively meekly. Their data model is easily found and understood; and the data structures are visible to the data professional. Examples might be data held in spreadsheets, small packages or in-house developed database applications. The metadata behind machine or sensor data is typically relatively easy to understand as is some external data from financial feeds or social media applications.
But what if the data you are trying to access and use is not in this format? If your organisation has implemented a packaged application, say from SAP or Oracle then data discovery then becomes a significant challenge.
These applications together with packages from Microsoft and increasingly Salesforce have large data models which seriously hampers ease of discovery; SAP ECC for example has over 90,000 tables.
To make matters worse these are often heavily customised and with opaque metadata which is difficult to find, navigate and exploit.
You might expect to find some useful and usable information about the data model in the RDBMS System Catalogue however this is not the case.
Again, an example from SAP can illustrate the problem. In SAP there is a table called T415S. It has 6 Fields/Columns called MANDT, STLST, KZBDP, KZPAF, KZKLF, KZAMS. In the RDBMS System Catalogue you will find very little information to help you to figure out what actual data the table holds and which other tables it is related to, all of which is critical to enabling accurate data discovery.
Couple this with a lack of tools from package and Information Management software vendors and it is not surprising that data discovery can place a significant drag on project delivery and raise costs as well as increase the risk of inaccurate data being introduced into a enterprise’s information ecosystem. Philip Howard discussed this in a paper in 2014.
The result is that data professionals believe that they are stuck with traditional, mostly manual, methods for finding and using the metadata in their packaged systems.
These include trying to locate documentation, asking technical specialists or hiring external consultants or searching the internet. They may try to use adaptors or templates which provide no intelligence about their actual application and occasionally resort to guesswork.
What’s needed is a tool designed to deliver Application Metadata Intelligence to the people who need to be able to access and understand the data models in large complex and customised packaged systems.
If you are interested, the diagram (right) shows what those collections of letters mean in terms of the content of the table which is actually called BOM (short for Bill of Materials) Status and its fields. There is also valuable information about the related Parent and Child tables provided.
This is the sort of information our product Safyr® provides. Safyr extracts metadata from an application as implemented (including the customisations) and stores it in a repository.
The user can then search and filter the data model to locate what is needed quickly and easily in the context of their project. Facilities exist to share the results with other software environments as well as to compare the data model of two instances of the same application.
It provides our customers and partners with an accelerator to the data discovery or data preparation process and thereby making project delivery faster and at both a reduced cost and risk. More information is available here.
Leave a Reply
Want to join the discussion?Feel free to contribute!