Converging Data Science and Data Governance
How data management and data analytics tools complement each other?
When talking about data and data management I often like to refer to the two characteristics of data management defined by DellaMule/Davenport in their article in Harvard Business Review: data offense and data defense.
Data offense is defined as supporting business objectives, such as sales, and often has connotations to big data. Data defense is defined as minimizing downside risk, such as compliance, integrity, etc.
The defense side of data management is about fostering data quality and data integrity. Data quality activities aimed at defense were often considered necessary purely for legal and regulatory reasons, and as being of minor importance for business itself.
However, this view is starting to change. Data governance and its direct contribution to data science is increasingly being regarded as a complementary element to creating data insights and according value: Data science relies on reliable and valid data. The better the quality of the data input, the more easily it can be used for analytical purposes.
Interestingly, it is advanced analytics and machine learning that are driving this change as the big data tools and supporting tool sets can also be used for data governance.
Traditional processes and tools used to monitor data quality are still often ruled-based. A lot of effort is needed to specify data quality rules which are subsequently implemented in data quality measurement systems. At the same time, the number of attributes to be monitored is growing as data governance matures. Scaling up resources 1:1 to keep up with this trend cannot be the answer.
In searching for scalability, the data governance teams quickly found the benefit of tools used in data engineering and analytics – the advantages offered by machine learning tools.
Within a proof of concept (PoC) phase lasting three months, we explored how machine learning can be used to monitor data quality.
For the PoC we chose 14 data attributes for which we had already set up a traditional data quality measurement to have a baseline for comparison purposes.
The PoC had to meet three challenges:
- Data errors identified with the classical approach must be replicable with machine learning.
- Data errors can be identified with unsupervised machine learning.
- Higher productivity can be achieved by increasing the level of automation and reusability of models for other data attributes of the same type.
The result of the PoC is very encouraging as all of the defined goals were achieved:
- The models used were able to reproduce almost 100 percent of the errors. If the model is trained further, a figure of 100 percent will be achieved.
- The models could also identify previously unknown data errors.
- We can assume substantial productivity increase if machine learning is used.
This is a good result considering the data management effort involved and in terms of pursuing the aim of applying machine learning within detective data quality monitoring.
In summary, the work of the governance and quality experts can be facilitated by applying advanced analytical support tools and machine learning. Machine learning can be used to define data quality thresholds and even perform lineage. The data offense and defense teams not only share the same tools but also teaming up more often to solve business problems as skill sets converge and business purposes become more aligned.
Dette indlæg er oprindeligt udgivet på CIO Applications Europe.