Data Engineering

Observability Was Built for Humans. Agents Need Reliability.

Thiru Arunachalam, Founder & CEO, WALT
June 19, 2026
All posts

A wrong number reaches the dashboard at 9 AM. By 9:05 AM, an agent has used it to re-rank a supplier list, fire three alerts, and update a forecast. Nobody caught it, because the dashboard was green.

Data failures are often silent. Schemas drift and sources change shape without warning. A join that worked last quarter goes stale when an upstream team renames a column at 2 AM.

The surface area of things that can be subtly wrong in a pipeline is far larger than in a service, and the failures rarely announce themselves. Broken data returns a number that looks fine, lands in a report, and shapes a decision before anyone thinks to ask. That is why data quality, not code, is the dominant failure mode.

In my Apple days, roughly half of production support tickets were data quality issues. Most large data organizations are in a similar shape, even if they don't measure it. Gartner estimates that poor data quality costs organizations at least $12.9 million on an average.

The default conclusion is "we need better tools." The right conclusion is "we have a paradigm problem," as tools alone aren’t going to fix the issue.

What observability gave us, and where it stops

The industry's answer to data quality failure was observability, and it helped. We got freshness checks, volume anomaly detection, lineage maps, schema-change alerts, and dashboards that turn red when something drifts.

Every one of those is built around the same assumption: a human is in the loop reading the alert, deciding what it means, and routing it to whoever can fix it.That approach doesn’t work for the autonomous agent era.

Why observability is the wrong primitive for the agent era.

Agents don't read dashboards. They query the data, take the numbers at face value, and make a decision in milliseconds.

That removes the one thing the whole model depended on: the human catch. Telling an agent that a metric looks anomalous is useless, because no one is reading the alert. The agent needs the data layer either to hold the bad value back, or to hand it over with a confidence signal it can weigh.

That’s why observability won’t do–it’s descriptive, whereas agents need commitments.

We need to shift from data observability to reliability.

Reliability borrows from SRE thinking: explicit service-level objectives and error budgets on data products, automatic circuit-breaking when a contract is violated, served confidence intervals, and machine-actionable lineage. The data layer commits to behaviors, not just exposes signals. Here's whatthis looks like in practice:

- Data products with published SLOs (freshness, completeness, accuracy).

- Contract enforcement at serve time, not just at ingest.

- Confidence metadata travels with every value.

- Failures degrade gracefully and machine-readably.

- Agents can query "is this safe to act on?" and get a real answer.

The agent stops asking "what does this number say" and starts asking "can I trust this value."

What this asks of data organizations

This is a shift in what you sell internally. Stop selling dashboards as the deliverable. Start selling reliability guarantees, and budget for the engineering it takes to honor them.

Skepticism is fair here. Reliability can sound like one more tool for a stack that already groans under too many. It is none of that: reliability layers onto the freshness checks, lineage, and quality signals you already pay for, and it works with the observability stack you have rather than replacing it.

A common question we get asked is, “would this replace my data engineering team?” It won’t. It replaces the grunt work of chasing a late-night pipeline break, tracing lineage by hand, and arguing over which "revenue" is the real revenue–exactly the work an agent should own.

How WALT’s Operator helps you with platform reliability

Your data engineers move from firefighting to building, leaving WALT's Operator to detect, diagnose, and fix data quality issues.

Here are some fires WALT’s Operator has put out for our clients:

1. A wrong dashboard number: The Operator traced the lineage from dashboard to mart to staging to source in seconds, found the broken pipeline, fixed it, and backfilled.

2. A 3 AM Black Friday Kafka lag: It auto-scaled, backfilled, and validated, so the engineer read the post-mortem at standup and the business impact was zero.

3. 847 alerts in a month: It surfaced the 12 that were real and dropped the other 835, because it monitors only the attributes downstream consumers actually query.

The Operator works with your existing observability stack, or stands one up if you do not have one.

Summing up

Observability told humans what was wrong. Reliability tells agents what is safe. The next decade of data engineering is the move between those two sentences.

If your AI roadmap is approved and your data still cannot promise an agent a number it can trust, that is the gap to close first. See how the Operator handles it by booking a demo with us today.