Be the first to know and get exclusive access to offers by signing up for our mailing list(s).

Subscribe

We ❤️ Open Source

A community education resource

7 min read

Revisiting data quality in the age of AI and ChatGPT

Five key dimensions of data quality every developer should understand.

Lost in the excitement about AI and agents and artificial general intelligence is the fact that AI is only as good as the underlying training data used to build the AI models. This is especially true in the case of companies that are looking to leverage AI models using their internal data.

Data quality is a nebulous concept – good quality data is taken for granted, but everyone notices if the data is of poor quality. In this blog we will look at a systemic framework to building and maintaining a quality program for your data so that the investments in automation and AI will pay off.

Free download: A developer’s guide to modern data infrastructure

Ensuring data quality across operations

The typical data landscape will have a combination of internal systems (like ERPs and CRMs and the general ledger) and many other departmental systems and applications each consuming from their own independent sources. Nowadays most companies also have information about customers and their products through social media platforms and other online resources.

In most companies, organic data originates in one of several critical core systems. These usually have a robust set of guardrails to ensure accurate and high-quality information is being captured and distributed. However, these systems are tightly controlled and hence data from these systems get distributed for more localized usage and augmentation.

Enterprise or departmental operational and decisioning systems typically blend data from the core systems with additional augmentation from sources within the company and from outside. This usually happens recursively from one system to the next, and the data shows up in a customer experience like an app or a website along the way.

Each of these systems could introduce a quality defect in the data as it moves through each system. Quality can be compromised in several ways – different representation for similar data (e.g. product numbers or customer identifiers or customer demographics being captured and stored differently), human errors of commission or omission, or simply code defects while handling data. 

On the surface, quality defects may appear quite random and sporadic, which might lead to a reactive approach to addressing each discovery as a unique event. Defining what is good quality is not always straightforward. Data quality is always in context of a certain consumption requirement – is the data fit for the intended use? Understanding this basic premise is important, otherwise we would be in a perpetual hunt for perfect quality and that simply doesn’t exist. Cue the pot of gold at the end of the rainbow!

Fortunately, data management practices have a rather efficient and time-tested framework to not only put a system of quality in place to identify data problems, but also prevent many of these from occurring in the first place.

5 key dimensions of data quality

The first step is to more tangibly define data quality along a set of dimensions – five to be precise. 

Accuracy

Let’s start with the accuracy dimension. This refers to the factual correctness of the data element for the intended purpose. What this means is that the data value respects the real-world purpose that it represents – for e.g. a date should only have months between 1 and 12 or a day should be between 1 and 30 or 31 (or 28 or even 29) depending on the month and the year, (of course, assuming we are operating off a Gregorian calendar). This could also apply to numeric values – for e.g. a rent amount should always be greater than 0.

Accuracy is usually universally understood and is the most basic form of quality control because it is not applicable to a specific context, unlike the other four, and it is the most easily automated.

Validity

The next dimension is validity – is the data valid within the business context? For instance, in the earlier date example, the year should always be greater than 1900 and less than 2100 because we are working on insurance claims and the anticipated life span for our target customer population is guaranteed to be within a 200-year life span. Similarly, our business may insist on rent payments to be received only from a recognized financial institution (as identified by the ABA routing number).

Validity is especially useful to identify cross-dependencies. For example, if I am aggregating data between my ERP system and a customer feedback database, did I match up the right customer with their feedback – typically this is matched by a universal customer identifier, so building referential checks between similar data in two places becomes necessary.

Consistency

After that comes the consistency dimension – and this is where real context intelligence becomes relevant. Are the data items internally consistent within itself and with one another? For example, if the monthly rent is $1200 then do all the rents for the month roughly fall in that range (accommodating for hardships or advance payments)? So, If I see a rent payment for $12,000, I have reason to be dubious about the quality of that data.  

Similarly, if the debt to salary ratio for a renter should not exceed 30% then I’d expect the salary amount to be somewhere near $4800 per month – anything that is drastically lesser would indicate a potential quality problem (and we could extend this analogy – maybe there was a waiver or a promotion in which case some other data value should reflect that and a separate check can be implemented for that).

Timeliness

The last two dimensions round out the fit for use aspect. The timeliness dimension is whether the freshness of the data is applicable to solving a business problem. Sometimes older data is sufficient for making directional judgements, but most business interactions rely on the latest available information – the last transaction, the last interaction, or the latest promotion. Knowing whether you are using the latest available data or older data can make all the difference between customer delight and customer defection.

Completeness

Finally, the completeness dimension ensures that you are not working with partial information. For example, knowing how many units of a product (say cupcakes) sold daily from each of your three franchises is required to plan your flour inventory for the next day. If only two stores sent in their information, then your flour inventory is a best guess (of course, the prior day’s information could be used as a proxy). But, what if a festival in the neighborhood of that third store was driving up demand for your cupcakes beyond your average sales? Sure would be helpful to know about the spike in demand!

Conclusion

Every business process can benefit from viewing their input data through the above framework and build a quality control process to monitor and detect data anomalies before it goes into your product. The framework is deliberately broad to accommodate a variety of scenarios and not get too bogged down in specifics.

At the end of the day, the best checks are those that never find anything as the underlying process is so reliable. In the absence of the perfect, a systemic approach to data quality is often the secret sauce that helps deliver superior customer outcomes that will help retain and grow your business’ reputation.

More from We Love Open Source

About the Author

Ganesh is a semi-retired product veteran specializing in Data and Analytics with 30+ years of experience in financial services. Ganesh is available for Fractional Product Management or Fractional Data & Analytics Leader roles in a consulting or advisory capacity.

Read Ganesh Jayaraman's Full Bio

The opinions expressed on this website are those of each author, not of the author's employer or All Things Open/We Love Open Source.

Want to contribute your open source content?

Contribute to We ❤️ Open Source

Help educate our community by contributing a blog post, tutorial, or how-to.

We're hosting two world-class events in 2026!

Join us for All Things AI, March 23-24 and for All Things Open, October 19-20.

Open Source Meetups

We host some of the most active open source meetups in the U.S. Get more info and RSVP to an upcoming event.