Data Engineering's Shift from Imperative to Declarative - and Why It Matters More in the Age of AI

If you've been paying attention, you'd have noticed a distinct shift in data engineering - away from imperative, towards declarative.

What started with dbt in 2016 has since spread across the full ELT stack - on the transformation side: Databricks Delta Live Tables, Snowflake Dynamic Tables, Microsoft Fabric Materialized Lake Views, and Spark Declarative Pipelines; on the ingestion side: Fivetran, Airbyte, and Fabric's Copy Job.

Data engineering isn't alone here. Infrastructure as code made this move years ago - Terraform, Kubernetes, Helm. We're following a pattern software engineering set long before us.

The appeal for human engineers is straightforward. In an imperative setup, every loading pattern - append-only, upsert, merge, SCD Type 2 - means writing and maintaining the logic yourself.

In a federated model, where multiple domain teams each own their own pipelines, this compounds quickly: how do you ensure the finance team's SCD Type 2 logic matches the central data team's?

Declarative frameworks remove that problem. Say "I want SCD Type 2" and the engine handles the rest - the same tested, version-controlled implementation for every team.

What started as a way to abstract complexity away from engineers has, coincidentally, done the same for AI agents.

As a rule of thumb, I'd rather spend engineering effort on business logic than rebuilding framework capabilities that already exist elsewhere. Everything else should be a well-known public framework - one that AI tools were trained on.

If you were building a house, would you tell the bricklayer how and where to lay each brick, how to mix the mortar - or would you just say "I want brick veneer"?

It's a question I've been working through in practice - and Microsoft Fabric is where I've been putting it to the test.

1. What: Imperative vs. Declarative

Procedural programming (a sub-class of imperative) - the way most data engineers have worked for years - means step-by-step instructions.

Read this table, filter these rows, join here, write the output there. You own the order, the control flow, and every decision in between.

Declarative programming flips it. Describe what the output should look like; the system works out how to get there.

One thing worth clarifying: SQL has always been declarative. SELECT * FROM orders WHERE status = 'pending' describes what you want, not how to retrieve it. Early ETL tools like DataStage and Informatica had visual drag-and-drop interfaces that were also declarative in spirit.

The shift we're talking about isn't SQL the language. It's the framework built around SQL - how model dependencies get managed, how testing gets enforced, how documentation stays current, whether your business logic is tied to a specific execution engine. That's what's changed.

Take SCD Type 2. Imperatively, you write the logic yourself: join new data to the existing table on the business key, check whether tracked columns changed, end-date the old row, insert a new version.

In PySpark, that's roughly 30 lines of Delta merge logic - and you own every edge case.

Declaratively:

snapshots:
  - name: customers
    config:
      unique_key: customer_id
      strategy: timestamp
      updated_at: updated_at

Run dbt snapshot. The framework handles the comparison, the end-dating, and the insert. You declared the outcome - track history on this table using this key - and the system worked out the how.

2. When: A Decade in the Making

Date	Procedural / Imperative	Declarative
Jan 1997	DataStage GA - hand-coded ETL jobs become the enterprise standard
2005	SSIS ships with SQL Server 2005 - drag-and-drop pipelines dominate Microsoft shops
Apr 2006	Apache Hadoop - teams write Java MapReduce jobs to transform data at scale
2009	Apache Spark created at UC Berkeley AMPLab
2010	Apache Spark open-sourced - Python replaces Java, but logic stays procedural
2013	Databricks founded; Spark donated to Apache - PySpark notebooks begin replacing ETL tools
Feb 2014	Spark Top-Level Apache Project - notebooks become the default transformation environment
2016	PySpark notebooks are the de facto T layer across cloud data platforms	Fishtown Analytics (later dbt Labs) founded
Jan 2019	Notebooks remain dominant for most teams	"Analytics Engineer" coined - naming the practitioner role the new frameworks were enabling
Nov 2020	Notebooks still the default at most organisations
May 2021		Databricks previews Delta Live Tables at Data + AI Summit
Jun 2021		Declarative transformation reaches unicorn status - $1.5B market validation
Feb 2022		$222M raised at $4.2B - Snowflake and Databricks among investors
Apr 2022		Databricks Delta Live Tables reaches GA
Jun 2022		Snowflake Dynamic Tables debut at Data Cloud Summit
May 2025	Notebooks remain the default for the majority of teams	Microsoft Fabric previews Materialized Lake Views at Microsoft Build
Jun 2025		Databricks contributes Spark Declarative Pipelines (SDP) to Apache Spark - a new open-source framework informed by years of running DLT in production
Oct 2025		Fivetran merges with dbt Labs - EL and T combine at ~$600M ARR
Dec 2025		Apache Spark 4.1.0 ships with SDP as a headline feature
Mar 2026		Microsoft Fabric Materialized Lake Views reach GA at FabCon Atlanta

The two columns overlap, which is intentional. Declarative didn't replace procedural overnight - it grew alongside it. PySpark notebooks didn't stop being used when dbt launched in 2016, and they haven't stopped today.

The direction of travel for new work is what's changing, and the cost of staying put is rising.

The February 2022 entries are worth a closer look. Snowflake and Databricks both invested in dbt Labs - and then, within months, shipped their own native declarative tools. They backed the independent framework to validate the market, then built platform-native versions of the same idea.

Six weeks apart! That's not a coincidence.

Chart showing the rise of declarative transformation tooling from 1994 to 2026 — The declarative shift in context - from legacy ETL tools through the PySpark era to the current convergence on declarative frameworks across every major platform.

3. Why: The Cost of Procedural at Scale

The problem with notebooks

PySpark notebooks have their place - ML feature engineering, complex Python logic, bespoke ingestion that dedicated EL tools don't cover. Plenty of teams use Fivetran, Azure Data Factory, or Airbyte for ingestion and never open a notebook for it.

And notebooks have worked fine for transformation too. Teams have been shipping production pipelines with them for years. The question isn't whether notebooks work - they do - it's whether declarative frameworks do it better.

For the transformation layer, here's how they compare:

	Procedural notebooks	Declarative frameworks
Dependencies	Execution order implied by cell position. Downstream impact when you change an upstream model is yours to track manually.	You declare relationships between models; the framework determines execution order automatically.
Error detection	Fails at runtime - a broken column reference, a schema change, a typo in a join key only surfaces when the job actually executes.	Errors surface at compile time, before a single row is processed.
Testing	Whatever someone remembered to write. No standard framework, no consistent enforcement.	Data quality constraints declared alongside the model definition. Run automatically.
Documentation	Wikis and comments go stale as code evolves. No mechanism to keep them in sync.	Generated from the same source as the logic. Always in sync.
Version control	Notebooks stored as JSON. Diffs are noisy and hard to review - cell metadata and output state mix with logic. Merge conflicts in multi-developer teams are painful.	Plain SQL files. Clean diffs. Meaningful code review.
Portability	Business logic coupled to the execution engine - PySpark DataFrames don't travel.	Portable SQL. Swap the adapter or platform; the logic stays intact.

At small scale, notebooks are perfectly manageable. As the codebase grows and more people touch it, the gaps in the left column compound - and that's where declarative frameworks pull ahead.

Why every major platform arrived at the same answer

The clearest evidence that this shift is real isn't any one tool - it's that competing platforms appear to have arrived at the same conclusion independently.

Snowflake built Dynamic Tables. Databricks built Delta Live Tables (now Lakeflow). Microsoft built Materialized Lake Views.
In February 2022, Snowflake and Databricks both invested in dbt Labs - then shipped their own competing native tools within months.
In October 2025, Fivetran - the leading EL platform - merged with dbt Labs at ~$600M ARR. The company owning Extract and Load merged with the company owning Transform.
In June 2025, Databricks contributed Spark Declarative Pipelines (SDP) to Apache Spark - not a straight open-sourcing of DLT, but a new framework informed by years of running DLT in production. It shipped as a headline feature in Spark 4.1.0 in December 2025.

These aren't coordinated moves. When competitors build the same thing without talking to each other, that's usually a signal the underlying idea is right.

Why AI amplifies the shift

There's one more layer that wasn't in play when dbt launched in 2016.

LLMs don't know your custom metadata framework. They don't know your in-house state store or your bespoke pipeline config. You describe your schema and hope the model infers the rest.

Declarative frameworks give AI a structural safety net on top: the server validates the declaration before execution begins. Most major declarative tools ship with some form of dry-run validation:

dbt: dbt-dry-run - validates your models compile and resolve without running them
Spark Declarative Pipelines: --dry-run flag - checks pipeline structure before execution
Kubernetes / Helm: --dry-run on apply and install

When an LLM generates declarative config and gets something wrong, the framework catches it at compile time - before a single row moves. With imperative notebook code, the same mistake only surfaces at runtime.

In the agent-based workflows I've been experimenting with, this structural property matters more than I initially expected. The constraint that the only custom code should be your business logic isn't just a design preference - it's what makes the whole thing debuggable when something goes wrong.

4. How: Moving Forward

Scope first

The declarative shift is happening across the whole ELT stack, but the tooling is most mature in the transformation layer.

Ingestion is catching up. Fivetran, Airbyte, Fabric's Copy Job, and Lakeflow Connect all follow the same principle: declare what you want moved and where, and the platform handles execution, schema changes, and incremental state.

Python and PySpark still have a role where they're genuinely needed: complex API ingestion, streaming, ML inference, iterative algorithms, deeply nested data. Declarative tools don't try to replace them there.

The transformation layer is where the shift is most pronounced and the tooling most mature - stable, SQL-expressible business logic turning landed data into clean, trusted models. That's where notebook costs compound most visibly, and where the declarative payoff is clearest.

The rest of this series is focused on Microsoft Fabric specifically - not because it's the only viable platform, but because it's where I've been doing the work. And the honest answer to whether it's production-ready is more nuanced than most of the marketing material suggests.

The declarative spectrum

Not all declarative tools are declarative in the same way.

Interface-level declarative - dbt and similar frameworks
You describe the desired end state; the framework handles dependency resolution, materialisation, and testing. Worth being transparent: at the execution level, dbt runs an imperative sequence of SQL mutations against a mutable database. It's declarative in what you write, not necessarily in how it executes - think Makefiles. For most teams moving away from notebooks, this is the right starting point.

Execution-level declarative - SQLMesh
SQLMesh takes a more rigorous approach. MODEL blocks are designed to declare properties, dependencies, update methods, and schedules directly in code. Its plan/apply workflow aims to show exactly what will change before anything executes, and virtual environments are intended to allow pipeline changes to be tested in isolation. For teams that want declarativeness at the execution level - not just the interface - it's worth evaluating, though Fabric-specific support is limited at time of writing.

Platform-native declarative - Dynamic Tables, Lakeflow, MLVs, Spark SDP
If you're committed to a single platform, the native tooling is worth a look. No framework to install, no adapter to configure. The trade-off is portability - these tools are tied to their host platform in ways open-source frameworks aren't.

One caveat across all of the above - leaky abstractions
Declarative tools abstract away execution, which is mostly the point. But abstractions leak, and when they do you feel it. You give up some granular control - specific compute configurations, spot instance selection, fine-grained performance tuning. For most transformation workloads that doesn't matter. For teams with complex performance or cost requirements, it's worth knowing upfront: you're trading control for convenience, and occasionally you'll hit the edges of what the abstraction covers.

Where to start

If you're on…	Worth considering…
SSIS / legacy ETL tools	A SQL-first declarative framework - dbt Core is the most widely adopted starting point. The jump is significant but the payoff is proportional.
PySpark notebooks on Databricks	Databricks Lakeflow (platform-native, minimal setup) or an open-source framework if portability matters
PySpark notebooks on Microsoft Fabric	Fabric's native Materialized Lake Views for low friction, or dbt for richer testing, CI/CD, and portability
Already on dbt, want more rigour	SQLMesh - plan/apply, virtual environments, execution-level state management

You don't need to migrate everything at once. The typical pattern is to start with one domain - a staging layer, a single subject area - run it alongside existing notebooks, and compare. The difference tends to be obvious pretty quickly.

Notebooks don't disappear - they just get used where they're actually needed: ingestion, streaming, ML, complex Python logic. Going back to the brick veneer analogy: saying you want brick veneer doesn't mean the bricklayer goes home. They're still on site - laying bricks where bricks need to be laid.

Coming Next

Over the past few months I've been experimenting with what declarative ELT actually looks like in Microsoft Fabric. Some aspects work surprisingly well. Others expose gaps that require augmentation with external tooling. In the next article I'll walk through that experiment, the trade-offs I encountered, and what I learned.

- Brad Coles

Sources

IBM InfoSphere DataStage history: Wikipedia
SQL Server Integration Services history: Wikipedia
Apache Hadoop history: Wikipedia
Apache Spark history: Wikipedia
"Analytics Engineer" origin: dbt Labs blog
dbt Labs funding history: Tracxn
Databricks Delta Live Tables GA: Databricks blog, April 2022
Databricks contributes SDP to Apache Spark: Databricks blog, June 2025
Snowflake Dynamic Tables: Snowflake blog, August 2022
Microsoft Fabric MLVs announced: Microsoft Fabric blog, May 2025
Apache Spark 4.1.0 release: Apache Spark News
Spark Declarative Pipelines programming guide: Apache Spark docs
Fivetran + dbt Labs merger: Fivetran press release, October 2025
Databricks on procedural vs. declarative: Databricks docs
dbt's declarative approach: Migrating from stored procedures - dbt Labs
SQLMesh as a declarative framework: Orchestra, Synq dbt vs SQLMesh comparison
"dbt isn't declarative": Jenny Kwan

Brad Coles is a Senior Consultant and Data Engineering Capability Lead at Synechron Australia, specialising in Microsoft Fabric and modern data platform engineering. Connect on LinkedIn.

Follow me on LinkedIn - I write about Microsoft Fabric, real implementation gotchas, and what the docs don't tell you.