Real-Time Data Quality Through MDM
By Bob Wall, Senior Consultant
I have found that one of the biggest challenges for data migration, data integration and data warehousing projects that I’ve worked on, is the requirement for clean data. Typically, one of a plethora of data quality/cleansing tools is deployed. The tool employs sophisticated heuristic, probabilistic, deterministic, phonetic, linguistic and empirical methods and algorithms to perform data quality analysis. For example, for customer data being integrated into a data warehouse, we want the tool to reconcile Easthartford, Hartford East, Hartford and East Hartford to the same physical address. The data quality process is usually run in a batch process, first for the initial load and then continually after that.
Today, operational MDM hubs present an even bigger challenge. We have to ensure synchronization across multiple source systems and all data must consistently be correct. The data quality checks must be applied at various stages within the master data lifecycle, and must support federated sharing across and between systems, databases and applications. In a sense operational MDM requires real-time data quality.
Digressing for a moment, I have also worked on projects that employed various search engine technologies. On one in particular, a client wanted to have an automated way to categorize structured and non-structured web data (emails, documents, images, etc.); search for certain conditions and do it in an automated fashion. We investigated some technology that used what was referred to as the semantic web technology, which incorporated advanced semantic and linguistic analysis with classification schemes (using Hyper Text Markup Language (HTML), eXtensible Markup Language (XML), Resource Description Framework (RDF), and Web Ontology Language (OWL)) to render web content machine-readable and make it capable of being searched in an automated fashion.
Software tools that support inline SOA data quality services combined with advanced semantic and linguistic analysis/machine learning capabilities are beginning to evolve. Microsoft’s purchase last year of Zoomix is a testimonial to the strategic value of these types of products. The convergence of data quality and semantic web technologies may provide operational MDM projects with the ability to automate ongoing classification, matching, and standardization of master data records. I think it is definitely worth keeping an eye on to see if it leads to real-time data quality capabilities embedded in MDM tools.
photo by Blude (via Flickr)
Bob
Wall is a senior consultant with Baseline Consulting. He is an
information technology specialist with 30 years experience in all areas
of data warehouse administration, data architecture, data resource
management, training, and applications systems development, as well as
in corporate management.

Recent Comments