A new discipline for the agent era

Autonomous Data Engineering

The discipline of building, operating, and evolving the data platform with agents as the primary actors. Humans contribute intent and judgment — not plumbing.

Download the handbook
Data lakes are raw material. Insight lakes are the finished good. The whole establishment exists for the latter.

The physics of data

Joe walks into Starbucks. He orders a latte. The barista taps his card. A transaction happens — and the company has to record it now, or Joe is the customer staring at a spinning POS screen. At the end of the week, somebody at headquarters wants to know: how many lattes did we sell? That is a different question. It scans millions of rows across thousands of stores. It cannot run on Joe's terminal — and even if it could, doing so would make every customer wait.

These are two fundamentally different workloads. Transactional writes are single rows, low latency, point-in-time accurate. Analytical reads are many rows, high latency tolerance, summarized across history. You cannot serve both well on the same machine, with the same storage layout, on the same compute. This is the physics of data. Everything we call "data engineering" exists because of it.

Even after the data moves — through CDC, zero-ETL, or any other modern movement pattern — it is still the wrong shape for analytics. Operational systems store data the way that makes writes efficient: nested JSON, dates as varchars, inconsistent null representations, no shared vocabulary across SaaS sources, PII interleaved into operational records. Analytics needs the opposite. The path from raw operational data to a trustworthy business answer is never one hop. The physics gives you two systems. The data itself gives you all the work in between.

Autonomous Data Engineering is the discipline of building, operating, and evolving the data platform with agents as the primary actors. Humans contribute intent and judgment — not plumbing

Six defining principles

Agent-native

Agents are first-class producers and consumers of the platform, not bolt-on assistants.

Declarative and view-first

Definitions live as code and views, not as opaque pipelines.

Self-evolving

Schema change, source drift, and definitional evolution are normal operations, not exceptions.

Reliability-first

The platform commits to SLOs and serves confidence signals that humans and agents can both act on.

Human-in-the-loop on intent

Humans decide what data should mean. Agents handle implementation, operations, and evolution.

Topology-independent

Works on any stack, on-prem or cloud, legacy or modern. ADE is not a migration project.

Get the full playbook

The anatomy of an autonomous data platform

A small number of agent roles, each owning a slice of the physics. They work as one crew — discovering, reshaping, reasoning, monitoring, and governing the data platform end-to-end.

Ingestor

Discovers sources, negotiates contracts, lands data.

Learn more

Transformer

Reshapes raw operational data into trustworthy analytical form.

Learn more

Reasoner

Builds the data context graph and resolves questions deterministically.

Learn more

Operator

Monitors, detects drift, runs reliability checks, self-heals.

Learn more

Governor

Enforces PII handling, jurisdiction rules, access policy.

Learn more

Operator

Monitors, detects drift, runs reliability checks, self-heals.

Learn more

Governor

Enforces PII handling, jurisdiction rules, access policy.

Learn more

You don't care whether your Amazon package was carried by FedEx or UPS. You care that it arrived on time, in the right condition, at the right address. The destination of an agent-native data platform is one that earns the same kind of indifference.

Six defining principles

Agents are first-class producers and consumers of the platform, not bolt-on assistants.

Ingestion

The connector is already a commodity. What changes is everything around it — discovery, drift, backfills, monitoring.

Transformation

Agents profile patterns, propose transformations, and ship through CI/CD. Pipelines stop being human-crafted artifacts.

Modeling (Gold)

Consumption shapes Gold instead of upfront specification. Definitions emerge from what people actually ask.

MDM

Vector embeddings plus LLM reasoning collapse the cost of identity resolution from years to weeks.

Analytics & reasoning

Agents build the dashboards. The insight lake is pre-computed before anyone asks.

Operations

The shift from observability (descriptive) to reliability (acted-on contracts).

Governance

Enforced at deployment, not discovered in production. Compliance becomes a SQL query.

Data Migration

The same agent crew that runs day-to-day data engineering can migrate the platform underneath it.

Get the full playbook

Agent-native first. Let the agents modernize the rest.

The most important thing for a data leader to internalize is that ADE is not a precondition for modernization. It is a substitute for modernization-first as a strategy. You do not need to be on the latest lakehouse, the newest table format, or the cleanest stack to become agent-native. Every topology can host the discipline — on-prem SQL Server with stored procedures, legacy data warehouses, flat Parquet on object storage, modern lakehouses. The shift is about how the platform is operated, not about which compute engine sits underneath.

Once you are agent-native, the agents themselves can drive the modernization on a timeline that makes sense for your business — opportunistically, piece by piece, where the economics justify it. The modernization stops being a 24-month program ahead of value, and becomes a downstream consequence of an operating model that is already working.

Agent-native first. Let the agents modernize the rest.

Get the full handbook

The complete playbook — every discipline walked through, real-world examples, and the operating model CDOs are adopting now.

Please enter a different email address. This form does not accept personal email providers.
Thank you for downloading!
Oops! Something went wrong while submitting the form.

Further Reading

Research

  1. Trinh, T. H. et al. "Solving olympiad geometry without human demonstrations." Nature 625, 476–482 (2024). nature.com
  2. Colelough, S. & Regli, W. "Neuro-Symbolic AI in 2024: A Systematic Review." arXiv:2501.05435 (2025). A PRISMA-based review of 167 papers covering learning, inference, logic, reasoning, and knowledge representation in neuro-symbolic systems. arxiv.org
  3. Feldstein, A. et al. "Mapping the Neuro-Symbolic AI Landscape by Architectures: A Handbook on Augmenting Deep Learning Through Symbolic Reasoning." arXiv:2410.22077 (2024). The first systematic mapping of neuro-symbolic techniques into architectural families. arxiv.org
  4. Delvecchio, M., Molfetta, D. & Moro, G. "Neuro-Symbolic Artificial Intelligence: A Task-Directed Survey in the Black-Box Models Era." IJCAI 2025. Examines how symbolic systems enhance explainability and reasoning in NLP and computer vision. ijcai.org