Estimated reading time: 6 minutes By: Editorial Team Published: January 18, 2024
Overview
The single most common reason AI projects fail in production is data quality, not model capability. Evaluating data readiness before committing to a build saves months of rework and prevents launching a system that cannot be trusted.
Key takeaways
- Data quality problems discovered late cost ten times more to fix than problems discovered early.
- Governance and compliance questions must be answered before data reaches a model.
- Labeling effort is almost always underestimated; plan for it explicitly.
- A data readiness review is a 1–2 day exercise, not a multi-week audit.
The data readiness review
A data readiness review answers four questions before any model training or prompt engineering begins:
- Is the data accessible?
- Is it representative?
- Is it clean enough?
- Is it legally permissible to use?
The checklists below operationalize each question.
Checklist 1: Accessibility
- [ ] Source data can be exported or queried without manual intervention
- [ ] Access credentials are documented and not tied to a single person's account
- [ ] Data volume is confirmed — enough records exist to cover edge cases
- [ ] Refresh frequency matches the model's expected update cadence
- [ ] A backup or snapshot exists so experiments can be reproduced
Checklist 2: Representativeness
- [ ] The training dataset covers all the input types the model will encounter in production
- [ ] Rare cases (outliers, edge cases) are explicitly included, not just common ones
- [ ] Temporal distribution is checked — old data may not reflect current behavior
- [ ] Geographic, demographic, or domain variation is accounted for if relevant
- [ ] Class imbalance has been measured and a handling strategy is documented
Checklist 3: Data quality
- [ ] Duplicate records have been identified and a deduplication strategy is defined
- [ ] Missing value rates are below an agreed threshold per field
- [ ] Format inconsistencies (date formats, unit variations, encoding issues) are catalogued
- [ ] Outliers have been reviewed — some are errors, some are valid edge cases
- [ ] Labels (if supervised learning) have been reviewed for consistency; inter-annotator agreement is measured
Checklist 4: Governance and compliance
- [ ] Data provenance is documented — where did each dataset come from?
- [ ] Consent or licensing terms permit the intended use
- [ ] Personally identifiable information (PII) has been identified and a handling policy is in place
- [ ] Data is subject to regulations (GDPR, HIPAA, CCPA) and those requirements are documented
- [ ] Retention and deletion policies are defined and technically enforceable
- [ ] Third-party data sharing (including sending data to external APIs) is cleared legally
Red flags that require a stop
If any of the following are true, the project should pause before proceeding:
- Source data is owned by a third party and no data-sharing agreement exists
- PII is present and no legal basis for processing has been confirmed
- Labels were generated by a process that cannot be explained or reproduced
- Data cannot be accessed without manual exports from a legacy system with no API
These are not problems to work around on the way to a demo. They become blockers in production.
Labeling effort planning
Supervised learning requires labeled examples. Labeling is consistently the slowest and most expensive part of an AI project when it is not planned upfront.
A realistic labeling plan documents:
- How many labeled examples are needed (start with 500–1000 for classification tasks)
- Who will label them and what their subject-matter expertise level is
- What labeling tool will be used and how disagreements are resolved
- How long labeling will take at a realistic throughput rate
Conclusion
Data readiness is unglamorous work that pays for itself many times over. An afternoon spent on these checklists before development begins prevents months of rework after launch. Treat it as the first engineering deliverable, not a prerequisite to skip.