When More Data Hurts: Prioritizing Data Quality in ML

Thinking

A team at a mid-sized fintech company spent four months collecting transaction records. They went from 500K labeled examples to 10 million. Their model got worse. Not marginally worse; meaningfully worse, enough that the 500K-record baseline outperformed the scaled version on every production metric that mattered.

The post-mortem revealed the new data came from a slightly different acquisition channel with inconsistent labeling conventions. They had not noticed because their offline test set had the same problem. This failure pattern appears frequently in our experience. The instinct to collect more data feels rational; volume implies coverage, coverage implies confidence. But data quantity and data quality are not interchangeable levers, and treating them as if they are is one of the more expensive mistakes in applied machine learning.

What these terms actually mean

Data quantity is straightforward on the surface: row count, sample size, coverage across your feature space. But precision matters. A dataset with 10 million rows that covers only three of your seven relevant customer segments is not large in any useful sense; it is dense in the wrong places.

Data quality is harder because it is multidimensional. Accuracy refers to whether labels and measurements reflect ground truth. Completeness captures missing values and sparse features. Consistency covers schema drift and unit mismatches across data sources. Timeliness addresses staleness and distribution shift as the world changes around your model. And representativeness — the most underappreciated dimension — asks whether your data actually reflects the real-world distribution you’re predicting on.

That last one is where many audits fall short. Teams check for nulls, validate schemas, and run class balance checks. They less frequently ask whether the data-generating process that produced their training set matches the process that will generate their inference requests. In machine learning, these two dimensions interact in ways that make the quality problem structurally harder than the quantity problem. You can measure quantity with a single integer. Quality requires five different lenses, minimum.

When volume genuinely wins

The case for data quantity is not wrong; it is incomplete. Scaling laws in deep learning are empirically well-documented. For large language models and vision systems, performance tends to scale predictably with data volume, given a sufficient quality floor. That caveat matters, but the underlying relationship is real.

For high-variance, low-signal problems, more data genuinely helps. Rare event prediction is the clearest example. If you’re modeling fraud on a transaction set where 0.1% of events are positive, you need volume just to see enough signal. Long-tail classification has the same property. You cannot learn a robust decision boundary from twelve examples of a rare class.

The data flywheel argument also holds in specific contexts. When collection is cheap and labeling is automated—clickstream data, sensor telemetry, system logs—quantity compounds with relatively low marginal cost. Semi-supervised settings extend this further; unlabeled data provides useful structural information even without clean labels, particularly for representation learning. For certain model families, the quantity advantage is documented and reproducible. Large neural networks and gradient boosting on tabular data both show real performance gains with more data, holding quality constant. More data works when the conditions support it.

The hidden costs of dirty data at scale

The conditions don’t always hold, and the failure mode is insidious because it is quiet. Label noise is the most studied version of this problem. Research by Northcutt et al. (2021) found pervasive label errors across widely used benchmark datasets, including ImageNet, where estimated error rates exceeded 6%. Their finding, consistent with earlier theoretical work, is that a 10% label error rate degrades model accuracy more than halving your dataset size.

More data with noisy labels does not average out the noise; it reinforces the wrong decision boundary with more evidence. Bias amplification follows the same logic. If your dataset systematically underrepresents a demographic, geographic region, or behavioral pattern, scaling that dataset does not dilute the bias. It compounds it. You’re not adding new information; you’re adding more of the same distorted signal with higher statistical confidence. The model becomes more certain about the wrong thing.

The silent failure mode is the one that burns teams most badly. A model trained on low-quality data sometimes appears to perform well on held-out test sets because the test set was drawn from the same flawed distribution. The divergence surfaces in production, where the real-world distribution does not share your dataset’s systematic errors. Offline metrics look fine; production metrics fall apart. Debugging that gap is expensive in time, compute, and organizational trust.

Consider a fraud detection model trained on inconsistently labeled transactions, where “fraud” was defined differently by two annotation vendors across different time periods. Adding five times more data from those same sources does not fix the labeling inconsistency; it widens the confidence intervals around a decision boundary that was miscalibrated from the start. The operational cost is not just the bad model; it is the three retraining cycles, the escalating debugging sessions, and the stakeholders who stop trusting the system.

Strategies that actually help

The goal is not to choose quality over quantity. It is to stop treating quantity as a substitute for understanding your data. Here is what that looks like in practice.

Audit before you augment. Before scaling collection, run distributional checks on what you already have: class imbalance, feature drift between training and serving, and label consistency across annotators or time periods. Tools like Great Expectations or Deepchecks make this tractable; so does custom statistical profiling if you know your domain well enough to write the checks. Quality issues tend to cluster in roughly 20% of your features or data slices. Finding that 20% before you 10x your dataset is significantly cheaper than finding it after.
Use confident learning for label noise. Rather than discarding suspected noisy examples — which destroys quantity — identify likely mislabeled instances and re-examine them. CleanLab is the most practical implementation of this approach; it uses the model’s own predicted probabilities to surface examples where the label is inconsistent with the learned distribution. You do not need to remove noisy data. You need to find it, correct it, and preserve the underlying quantity. These goals are not in conflict.
Sample strategically when you must reduce volume. If you need to work with a subset of a large dataset, random sampling is typically the wrong default. Stratified sampling preserves representativeness across subgroups: imbalanced classes, geographic diversity, temporal coverage. Random sampling from an already-skewed distribution produces a smaller, equally skewed dataset. Stratified sampling requires knowing your subgroup structure, which itself requires the audit you should have run first.
Treat data validation as infrastructure. Schema validation, range checks, and consistency tests belong at ingestion, not as post-hoc diagnostics. Data quality degrades over time; schema drift is the default state of any production pipeline that lives long enough. Treat data validation failures as blocking the same way a failing unit test blocks a deploy. That framing elevates quality checks from optional hygiene to required infrastructure.
Use synthetic data to fill gaps, not inflate volume. When real data is scarce and high-quality, targeted synthetic generation extends a clean dataset by filling underrepresented slices. Synthetic data inherits the assumptions of its generator. A generative model trained on your existing data reproduces your existing distribution’s blind spots. Validate synthetic augmentation on real-world holdout sets, and use it to address specific coverage gaps rather than to inflate overall volume. Wholesale replacement of real data with synthetic data is rarely the right move.

Which lever to pull

The diagnostic question before any data collection sprint is: would doubling this dataset make my problem better, or would it make my problem bigger? Pull the quantity lever when you have a quality floor established, your model family demonstrably benefits from scale, and collection cost is low relative to the expected performance gain. Large neural networks on well-curated data, rare event prediction with automated labeling, and semi-supervised settings with abundant unlabeled structure are the cases where volume compounds.

Pull the quality lever when validation performance is inconsistent across runs, when production metrics diverge from offline metrics, or when your dataset has known labeling provenance issues. Any time you cannot explain why your model is making a specific class of errors, that is a data quality diagnostic problem, not a data quantity problem.

In many production settings, the answer is both, sequentially. Fix quality first; then scale. The reverse order is how teams end up spending four months collecting 10 million records that make their model worse. Start with an audit. Validate your data-generating process against your inference distribution. Identify the 20% of issues that account for 80% of the quality gap. Only then pull the quantity lever. Dirty data at scale does not improve models; it makes them confidently wrong.

Tagged data quality, data quantity, dataset audit, label noise, representativeness

KPI & Dashboards

Trend Analysis

Segmentation & Churn

Pricing & Profitability

Cash-Flow Forecasting

Data Automation & Integration

Funnel & Conversion Analysis

Inventory Analytics

Survey & NPS Insights

Custom Financial Reports