7 min read

ShareinXf

⏱ 7 min read

How Bad Data Hides

A professional abstract illustration representing the concept of How Bad Data Hides in Data Science
A professional abstract illustration representing the concept of How Bad Data Hides in Data Science

The challenge is that bad data doesn’t always announce itself. Some failures are obvious: missing values, type mismatches, columns that should contain prices but contain strings. Practitioners catch these early, and most data pipelines have at least rudimentary checks for them. But even obvious failures are underestimated in their downstream effects.

A professional blog header illustration for an article about Data Science. Context: The challenge is that bad data doesn't...
A professional blog header illustration for an article about Data Science. Context: The challenge is that bad data doesn’t…

A 5% missing rate in a single feature appears manageable; a 5% missing rate across eight correlated features can compound into something far less predictable, especially when imputation assumptions differ between training and serving.

The genuinely dangerous failures are the silent ones. Label noise is the canonical example. Consider a sentiment classifier trained on historical customer reviews where a regex preprocessing step labeled “not bad” as negative because it matched on “bad.” The model trains. Evaluation metrics look reasonable. It isn’t until production, where edge cases accumulate, that the classifier’s systematic confusion becomes visible.

Distribution drift baked into historical data is similarly insidious; a model trained on pre-pandemic e-commerce behavior contains a version of the world that no longer exists, and that mismatch is often a data quality issue before it’s a modeling issue. Proxy variables encoding demographic bias fall into this category too; the data is technically accurate, it’s just accurately recording a biased process.

Then there are pipeline-induced failures, which are uncomfortable because the team introduces them. Join errors that silently drop records; timezone inconsistencies between event logs and transaction tables; feature store values that are stale by hours or days relative to the label timestamp. These don’t show up as obvious corruption. They show up as mysteriously degraded model performance and very long debugging sessions.

The compounding problem is worth dwelling on. A 5% label error rate combined with 10% imputation errors doesn’t typically produce 15% degraded performance in any linear sense. It produces unpredictable interactions that are hard to attribute and harder to fix after the fact.

Auditing Before You Model

A professional abstract illustration representing the concept of Auditing Before You Model in Data Science
A professional abstract illustration representing the concept of Auditing Before You Model in Data Science

Treating data quality seriously starts with building an audit habit before model development begins; not a one-time checklist, but a repeatable profiling practice that runs every time you touch a new dataset or a dataset you haven’t touched in a while.

Four dimensions are worth profiling systematically:

Completeness is the obvious one, but it extends beyond counting NaN values. Structurally missing data — events that should exist in logs but don’t — is harder to detect and often more damaging. If your clickstream data shows no activity for a user segment during a two-hour window, is that real behavior or a logging gap? You can’t answer that without understanding the data generation process.

Consistency catches the same entity represented differently across sources: customer IDs that use different formats between the CRM and the transaction table, units that differ between datasets (one team logs latency in milliseconds, another in seconds), categorical values with inconsistent capitalization or spelling variants. These issues rarely crash pipelines; they typically corrupt joins and aggregations in ways that are difficult to detect.

Accuracy is harder to measure because it requires ground truth. The practical approach is cross-validation against known-good subsets — records where you have high confidence in the labels or values — combined with domain expert spot-checks on a random sample. Spot-checks feel low-tech, but a domain expert reviewing 50 records often surfaces systematic errors that automated profiling misses.

Temporal validity is the dimension most commonly overlooked. Is this data still representative of the problem you’re solving? Concept drift is usually framed as a modeling problem, but it’s equally a data quality problem; the historical data you’re training on accurately reflects a world that no longer exists. Asking “when was this data collected, and what has changed since then?” should be a standard part of any audit.

Tools like Great Expectations, ydata-profiling, and deequ for Spark pipelines make systematic profiling faster. The tool matters less than the habit. What matters more is what you do with the findings: a data quality scorecard — a lightweight document tracking known issues, their estimated impact on model behavior, and mitigation status — becomes institutional memory. Without it, the same issues get rediscovered by the next person who touches the dataset, or worse, they get silently inherited by a production model.

One organizational note: assessment findings need to leave the notebook. A data quality issue that lives in a Jupyter cell comment is invisible to the team. Data integrity is a team sport, and that requires communication.

Data Cleaning Without Destroying Signal

Once you’ve profiled the data, the temptation is to clean aggressively. Resist it. Aggressive data cleaning can remove signal. Outliers are sometimes the most important data points in the dataset; removing everything beyond three standard deviations is statistically tidy and can be analytically destructive.

Imputation may introduce systematic bias if the assumptions don’t match the missingness mechanism. The first principle is to document every transformation. Cleaning decisions made under deadline pressure become invisible technical debt. A versioned transformation log — even a simple one — means that when a model behaves unexpectedly in production six months later, you can typically trace back exactly what was done to the training data and why. Without it, you’re debugging blind.

The second principle is to treat imputation as a modeling decision, not a preprocessing step. Mean imputation, model-based imputation, and indicator variables for missingness each carry different assumptions:

The mechanism matters; choose accordingly.

The third principle is to preserve a raw snapshot. Generally maintain an untouched version of the data before any cleaning. You will likely need to re-examine cleaning decisions when a model fails in unexpected ways, and you cannot do that if the original data is gone.

Class imbalance deserves a brief mention here because it’s usually treated as a modeling problem when it’s often a data quality signal. Before reaching for oversampling or undersampling, ask why the imbalance exists. If fraud represents 0.1% of transactions in your training data because the labeling process missed a class of fraud, resampling may amplify rather than fix the biased signal. Understanding the imbalance’s origin changes what you should do about it.

Finally, clean in collaboration with domain experts wherever possible. A data scientist might impute a missing hospital admission date using the surrounding records; a clinician knows that a missing admission date may mean the patient never arrived. That distinction can change the label entirely.

Validation in Pipelines and Production

The bigger mistake is treating data quality as a pre-training gate rather than a continuous concern. By the time a model is in production, the data generating it has typically already started to drift. The question is whether you have the infrastructure to notice.

Data validation belongs in CI/CD pipelines alongside code tests. Schema checks catch structural regressions; statistical distribution checks catch subtler shifts — a feature whose mean has moved significantly since last week, a categorical variable that has gained a new value the model has never seen. These checks don’t replace monitoring, but they often catch regressions before deployment rather than after.

In production, tracking input feature distributions over time gives you early warning before model performance degrades visibly. A shift in the data is often detectable before it propagates into prediction errors. If you’re only monitoring model outputs, you’re reacting; if you’re monitoring inputs, you have a better chance of getting ahead of it.

The feedback loop runs the other direction too. Unexplained prediction errors are often data quality signals. Building a process to investigate model failures — not just log them — creates a mechanism for surfacing upstream problems that automated checks may miss. Data contracts deserve attention here: formal agreements between data producers and consumers about expected schema, freshness, and quality thresholds. They’re gaining traction in data engineering precisely because they make implicit assumptions explicit and create accountability across team boundaries.

The Organizational Problem

Most data quality problems aren’t technical. They’re incentive problems. Data is collected by systems optimized for operations — CRMs, clickstream trackers, electronic health records — where quality is rarely a stated goal. The collecting system doesn’t care whether the timestamp format is consistent across environments; it cares whether the form submission succeeded.

ML teams typically don’t own the pipelines they depend on. This creates a gap in accountability that no amount of technical tooling fully closes. The practical implication is that ML engineers need to become advocates upstream, not just consumers downstream. That means building working relationships with data engineering, product, and domain teams; understanding how the data is generated, where the known issues are, and what changes are coming.

A recurring data quality review between ML and data engineering — even 30 minutes monthly — can create shared ownership that doesn’t exist otherwise. This isn’t soft advice. It’s a significant force multiplier on every technical improvement described above, because even the most comprehensive monitoring system has limited impact if the team producing the data doesn’t know what you need from it.

Start Before You Open the Notebook

Before you open a Jupyter notebook this week, spend 30 minutes profiling the dataset you’re about to use as if you’ve never seen it before. Run ydata-profiling. Check the label distribution. Look at the timestamp range. Find one thing that surprises you. There’s almost always something. The question is whether you find it before training or after deployment.

Enjoyed this data science article?

Get practical insights like this delivered to your inbox.

Subscribe for Free