Thinking

A professional abstract illustration representing the concept of Thinking in Data Science

A team at a mid-sized fintech company spent four months collecting transaction records. They went from 500K labeled examples to 10 million. Their model got worse. Not marginally worse; meaningfully worse, enough that the 500K-record baseline outperformed the scaled version on every production metric that mattered.

A professional blog header illustration for an article about Data Science. Context: A team at a mid-sized fintech company ...

The post-mortem revealed the new data came from a slightly different acquisition channel with inconsistent labeling conventions. They had not noticed because their offline test set had the same problem. This failure pattern appears frequently in our experience. The instinct to collect more data feels rational; volume implies coverage, coverage implies confidence. But data quantity and data quality are not interchangeable levers, and treating them as if they are is one of the more expensive mistakes in applied machine learning.

What these terms actually mean

A professional abstract illustration representing the concept of What these terms actually mean in Data Science

Data quantity is straightforward on the surface: row count, sample size, coverage across your feature space. But precision matters. A dataset with 10 million rows that covers only three of your seven relevant customer segments is not large in any useful sense; it is dense in the wrong places.

Data quality is harder because it is multidimensional. Accuracy refers to whether labels and measurements reflect ground truth. Completeness captures missing values and sparse features. Consistency covers schema drift and unit mismatches across data sources. Timeliness addresses staleness and distribution shift as the world changes around your model. And representativeness — the most underappreciated dimension — asks whether your data actually reflects the real-world distribution you’re predicting on.

That last one is where many audits fall short. Teams check for nulls, validate schemas, and run class balance checks. They less frequently ask whether the data-generating process that produced their training set matches the process that will generate their inference requests. In machine learning, these two dimensions interact in ways that make the quality problem structurally harder than the quantity problem. You can measure quantity with a single integer. Quality requires five different lenses, minimum.

When volume genuinely wins

The case for data quantity is not wrong; it is incomplete. Scaling laws in deep learning are empirically well-documented. For large language models and vision systems, performance tends to scale predictably with data volume, given a sufficient quality floor. That caveat matters, but the underlying relationship is real.

For high-variance, low-signal problems, more data genuinely helps. Rare event prediction is the clearest example. If you’re modeling fraud on a transaction set where 0.1% of events are positive, you need volume just to see enough signal. Long-tail classification has the same property. You cannot learn a robust decision boundary from twelve examples of a rare class.

The data flywheel argument also holds in specific contexts. When collection is cheap and labeling is automated—clickstream data, sensor telemetry, system logs—quantity compounds with relatively low marginal cost. Semi-supervised settings extend this further; unlabeled data provides useful structural information even without clean labels, particularly for representation learning. For certain model families, the quantity advantage is documented and reproducible. Large neural networks and gradient boosting on tabular data both show real performance gains with more data, holding quality constant. More data works when the conditions support it.

The hidden costs of dirty data at scale

The conditions don’t always hold, and the failure mode is insidious because it is quiet. Label noise is the most studied version of this problem. Research by Northcutt et al. (2021) found pervasive label errors across widely used benchmark datasets, including ImageNet, where estimated error rates exceeded 6%. Their finding, consistent with earlier theoretical work, is that a 10% label error rate degrades model accuracy more than halving your dataset size.

More data with noisy labels does not average out the noise; it reinforces the wrong decision boundary with more evidence. Bias amplification follows the same logic. If your dataset systematically underrepresents a demographic, geographic region, or behavioral pattern, scaling that dataset does not dilute the bias. It compounds it. You’re not adding new information; you’re adding more of the same distorted signal with higher statistical confidence. The model becomes more certain about the wrong thing.

The silent failure mode is the one that burns teams most badly. A model trained on low-quality data sometimes appears to perform well on held-out test sets because the test set was drawn from the same flawed distribution. The divergence surfaces in production, where the real-world distribution does not share your dataset’s systematic errors. Offline metrics look fine; production metrics fall apart. Debugging that gap is expensive in time, compute, and organizational trust.

Consider a fraud detection model trained on inconsistently labeled transactions, where “fraud” was defined differently by two annotation vendors across different time periods. Adding five times more data from those same sources does not fix the labeling inconsistency; it widens the confidence intervals around a decision boundary that was miscalibrated from the start. The operational cost is not just the bad model; it is the three retraining cycles, the escalating debugging sessions, and the stakeholders who stop trusting the system.

Strategies that actually help

The goal is not to choose quality over quantity. It is to stop treating quantity as a substitute for understanding your data. Here is what that looks like in practice.

Which lever to pull

The diagnostic question before any data collection sprint is: would doubling this dataset make my problem better, or would it make my problem bigger? Pull the quantity lever when you have a quality floor established, your model family demonstrably benefits from scale, and collection cost is low relative to the expected performance gain. Large neural networks on well-curated data, rare event prediction with automated labeling, and semi-supervised settings with abundant unlabeled structure are the cases where volume compounds.

Pull the quality lever when validation performance is inconsistent across runs, when production metrics diverge from offline metrics, or when your dataset has known labeling provenance issues. Any time you cannot explain why your model is making a specific class of errors, that is a data quality diagnostic problem, not a data quantity problem.

In many production settings, the answer is both, sequentially. Fix quality first; then scale. The reverse order is how teams end up spending four months collecting 10 million records that make their model worse. Start with an audit. Validate your data-generating process against your inference distribution. Identify the 20% of issues that account for 80% of the quality gap. Only then pull the quantity lever. Dirty data at scale does not improve models; it makes them confidently wrong.