Data Validation Techniques: What Catches Bad Data Early

The data validation techniques that actually stop bad data hitting production: schema validation, business-rule checks, profiling, and validating LLM outputs.

Data Validation Techniques: What Catches Bad Data Early

Updated May 2026. Refreshed with the validation libraries we use in 2026, including patterns for validating LLM outputs that simply did not exist in earlier versions of this guide.

Bad data is cheaper to catch at the door than to fix in production. Every dollar saved at validation time saves something like ten dollars in remediation, downstream re-runs, and the customer-facing apologies that follow. The numbers vary; the asymmetry is real. Validation is the cheapest form of insurance most data teams have available.

We are Osher Digital, an automation and AI consultancy in Brisbane. We build pipelines that move data between SaaS systems, custom backends, and AI agents for clients in healthcare, recruitment, finance, and professional services. The patterns below are what we actually use. Not every technique applies to every project. The ones that consistently pay off are the boring ones, applied early and consistently.

This guide covers the validation techniques that earn their keep, the modern tooling worth knowing about, and the new wrinkle that AI workflows have added: validating outputs from systems that are non-deterministic by design. For broader context on data quality programs, see our work on data integration and our automation consulting work.


Why validation is not optional in 2026

Three forces have made validation harder than it used to be.

SaaS schemas drift faster. Vendors push API changes monthly. The integration that worked last quarter may be silently dropping a new field this quarter. Without schema validation on inbound payloads, you find out when someone in finance asks why a number looks wrong.

Pipelines run more often. The nightly batch is dead. Streaming, event-driven, and minute-level scheduled jobs mean validation has to be cheap enough to run on every record without slowing the pipeline. Manual review at the end of the day is no longer an option.

AI is now part of the data stack. LLMs produce structured outputs that look correct and are sometimes nonsense. The old assumption that “data from a deterministic system is correct unless proven otherwise” does not hold for an extraction agent that hallucinated a totals field. Validation is now a runtime check on AI behaviour, not just a defence against typos.

The combination of these three pressures means validation has moved from “nice to have” to “the difference between a pipeline you trust and one you do not.”


Layer 1. Schema validation

The first layer is the cheapest and the most valuable. Every payload entering or leaving a system is validated against an explicit schema. The schema enforces field names, types, nullability, and basic constraints (string length, number ranges, enum membership).

The libraries we reach for, by language:

  • Python: Pydantic v2. Fast, mature, the de facto standard. Pairs naturally with FastAPI and the Anthropic and OpenAI SDKs.
  • TypeScript: Zod. Schema-first with type inference. Hono and tRPC integrate it cleanly.
  • JSON Schema: the cross-language fallback. Use when the schema needs to be shared across services or stored as a contract.
  • Database side: Postgres CHECK constraints, NOT NULL, foreign keys. The last line of defence and the one that survives application-layer bugs.

A simple Pydantic model that catches most of what schema validation should catch:

from datetime import date
from typing import Literal
from pydantic import BaseModel, EmailStr, Field, PositiveInt

class Customer(BaseModel):
    customer_id: PositiveInt
    email: EmailStr
    full_name: str = Field(min_length=2, max_length=200)
    country: Literal["AU", "NZ", "US", "UK"]
    signup_date: date
    tier: Literal["free", "pro", "enterprise"]

Six lines of constraint definitions catch field-name typos, missing fields, malformed emails, country codes the system was not designed for, and tier values that have drifted. Almost no validation code we write is more elaborate than this for the schema layer.


Layer 2. Business-rule validation

Schema validation catches structural problems. It does not catch a discount that is greater than the order total, or a project end date before its start date, or an invoice line that does not sum to the invoice total.

This is where domain knowledge enters the code. The rules are not generic. They are written from a list of business constraints a stakeholder gave you. The discipline is in capturing them somewhere they can be reviewed, not buried inside an if-statement nobody finds when the rule changes.

Pydantic v2 validators handle the per-record case cleanly:

from pydantic import BaseModel, model_validator

class Invoice(BaseModel):
    subtotal: float
    discount: float
    total: float

    @model_validator(mode="after")
    def check_totals(self):
        if self.discount > self.subtotal:
            raise ValueError("Discount cannot exceed subtotal")
        expected = round(self.subtotal - self.discount, 2)
        if abs(self.total - expected) > 0.01:
            raise ValueError(f"Total {self.total} does not match {expected}")
        return self

Two rules, both real, both seen in client data: discount exceeded subtotal, and total did not match line items. Each one is the kind of error that survives schema validation and lands as a finance team query at end of month.

For pipeline-scale rules (thousands or millions of rows), Great Expectations or its commercial sibling are still the right tool. Express the rules as expectations, run them as part of the pipeline, fail loudly when expectations are not met. The test report doubles as data documentation.


Layer 3. Cross-record and aggregate checks

Some bad data is invisible at the record level. A duplicate customer record passes every per-record check. A sudden 30 percent drop in daily volume is fine if you only look at one record. A sales region missing entirely from today’s batch looks like every other day at the row level.

The aggregate checks we run on most pipelines:

  • Row count vs. expected range (today’s batch should be within ±25 percent of the trailing 7-day average; alert otherwise)
  • Distinct values per categorical field (alert if a known category disappears)
  • Null rate per field (alert if a field that is usually 5 percent null hits 50 percent today)
  • Duplicate detection on natural keys
  • Aggregate sum reconciliation against the source system (the total amount in the destination should match the total amount in the source within a small tolerance)

None of these catch every problem. Together they catch most of the silent corruption that no per-record validation will find. We have caught vendor-side incidents this way before the vendor noticed.


Layer 4. Validating LLM outputs

This layer is new. It barely existed when most data validation guides were written. It is now the layer that matters most for any pipeline that has an AI agent or extraction step in it.

LLM outputs fail in three ways that traditional validation does not catch:

Hallucinated values. The model returned a totals field. The number is wrong. Schema validation passes. Business-rule validation passes if the number is internally consistent. The only check that catches it is comparison against the source document.

Format compliance under stress. Most modern models produce structured output reliably most of the time. Edge cases (long documents, ambiguous inputs, retries after errors) produce malformed JSON or schema-violating fields more often than the happy path suggests. Always validate the model output against a schema at runtime.

Confident wrong answers. The model returns the wrong category, the wrong supplier name, the wrong invoice number. Confidence is high. The only way to catch this systematically is with eval data and a sampling-based human review.

The patterns we use:

  • Native structured-output features (Anthropic tool use, OpenAI structured outputs) so the model is constrained to a schema at generation time, not just validated after
  • Pydantic models that the SDK validates the response against; failed validation triggers a retry with the validation error fed back to the model
  • Source-anchored verification on numerical fields. Extracted invoice total must match the sum of extracted line items. If it does not, escalate to human review.
  • Sampled human review on 2 to 5 percent of outputs, with results fed back into a small eval suite that runs on every model or prompt change

For a deeper take on AI agent design and where these checks fit, see our piece on building AI agents in Python.


Data profiling: the before-you-start step

Before you can validate data sensibly, you have to know what it actually looks like. Data profiling is a one-off (or periodic) deep look at a dataset to capture distributions, null rates, value frequencies, and outliers. The output drives the validation rules you write.

Tools we use:

  • ydata-profiling (formerly pandas-profiling) for ad-hoc Python profiling. Generates an HTML report you can hand to a stakeholder.
  • DuckDB for fast SQL-based profiling against CSVs or Parquet without spinning up a database.
  • Soda Core or Great Expectations for ongoing profile-based validation.

The discipline. Profile every new data source before you write a single line of integration code. The 30 minutes spent on profiling saves hours of debugging built on wrong assumptions about what the data contains.


Where in the pipeline to validate

Validate at every boundary. The cost of validation is low. The cost of bad data crossing a boundary undetected is high.

The boundaries that matter:

  • User input (web forms, CSV uploads, API requests)
  • Inbound integrations (webhooks, API polling, file drops)
  • LLM and AI tool outputs
  • Cross-stage handoffs in multi-step pipelines (extraction → transformation → load)
  • Database writes (DDL constraints as the floor)

If you only validate at one boundary, validate at ingestion. The earlier bad data is rejected, the smaller the blast radius. By the time bad data has propagated through three downstream systems, the cleanup costs have already exceeded what validation would have prevented.


When validation is overkill

Three cases where the validation effort exceeds the payback.

Throwaway one-off scripts. The data load that runs once for a migration, by a developer who watches it complete, and is then deleted. Quick null-and-type checks are fine. A full validation framework is overkill.

Read-only analytical reports where the source is already trusted. Pulling data from a well-governed warehouse to build a dashboard. The warehouse already validates. Adding another layer in the dashboard tool just slows it down.

Pre-pilot prototypes. The five-line script that proves whether an idea has legs. Building validation into a prototype kills the prototype. Build it correctly when the prototype graduates.

Conversely, anything that touches money, customers, regulated data, or feeds another system needs validation. The asymmetry of cost between catching errors and explaining them later is the test.


Failure handling and dead-letter queues

Validation is half the job. The other half is what happens to records that fail.

Three patterns we use:

Reject and surface immediately. The right pattern for user input. The user sees the validation message, fixes the data, resubmits.

Quarantine to a dead-letter queue. The right pattern for batch and integration data. Failed records go to a separate store with the validation error attached. A dashboard shows the quarantine. A human reviews and either fixes the data, fixes the validation rule, or marks the record as legitimately bad.

Soft fail with alerting. The right pattern for known-noisy sources where you cannot afford to lose records but need awareness. The record loads, but a flag is set and an alert fires for review.

Silently dropping records is never the right answer. We have inherited pipelines that were happily losing 5 to 10 percent of their input for years before anyone noticed. Whatever your handling pattern, make sure failed records leave a trail.


Ready to fix data quality at the source?

If you have a pipeline that is producing surprises and you want a fresh pair of eyes on the validation gaps, that is the kind of work we do most weeks. We will profile the data, identify the failure modes, and build the validation layer that catches them. Get in touch or book a call. For more on the systems side, see our AI consulting work.


Frequently Asked Questions

What are the main data validation techniques?

Schema validation (types, formats, nullability), business-rule validation (cross-field and domain constraints), aggregate validation (row counts, distributions, reconciliation against source), and increasingly LLM output validation for AI pipelines. The right combination is layered: cheap checks at every boundary, expensive checks where the data is most valuable.

What is a data validation framework?

A library or platform that lets you express validation rules in code or config, run them automatically against data, and report on results. In Python the practical choices are Pydantic for per-record validation and Great Expectations or Soda for pipeline-scale validation. For TypeScript, Zod plays the same role as Pydantic. JSON Schema is the cross-language fallback when contracts are shared between services.

What are the types of validation checks?

The categories that come up most often: type checks (is this a string or a number), format checks (does this match the expected pattern), range checks (is this number within plausible bounds), completeness checks (is this required field populated), uniqueness checks (no duplicates), referential checks (does the foreign key resolve), business-rule checks (does the discount exceed the subtotal), and aggregate checks (does this batch’s row count look right).

How do I validate data produced by an AI model?

Use the model’s structured-output mode (Anthropic tool use, OpenAI structured outputs) so the response is constrained to a schema. Validate the response against that schema with Pydantic; on validation failure, retry with the error fed back to the model. For numerical or factual fields, anchor the output to the source (extracted invoice total must equal the sum of extracted lines). For everything else, run periodic human review on a sample and feed the results into an eval suite.

What are data validation methods in research?

Research data validation tends to focus on construct validity (does the measure capture what it claims to measure), internal consistency (do the items agree), reliability across measurements, and outlier detection. The technical patterns overlap with engineering data validation: schema enforcement, range checks, completeness checks, and statistical profiling. The difference is in interpretation rather than mechanics.

How much does setting up data validation cost?

For a single integration or pipeline, adding Pydantic-style schema validation and a handful of business-rule checks is a few hours of development and effectively no ongoing cost. For a whole platform with Great Expectations or Soda, expect 1 to 4 weeks of initial setup and ongoing maintenance proportional to how often the underlying data changes. AUD-priced consulting engagements for validation framework implementations typically run $5,000 to $30,000 depending on scope.

Where in a pipeline should I validate?

At every boundary that matters: ingestion, every cross-system handoff, and the final write to a system of record. If you can only validate in one place, validate at ingestion. The earlier bad data is caught, the smaller the cleanup. Add database-level constraints (NOT NULL, CHECK, foreign keys) as the irreducible floor: they survive application bugs.

What should I do with records that fail validation?

For user input, surface the validation error immediately and let the user fix it. For batch and integration data, send failed records to a quarantine or dead-letter queue with the validation error attached, then review and either fix the data, adjust the validation rule, or accept the rejection. Never silently drop failed records; you will discover months later that the pipeline has been losing data the whole time.

Ready to streamline your operations?

Get in touch for a free consultation to see how we can streamline your operations and increase your productivity.