← Back to all products
$49
Delta Lake Patterns Library
Delta Lake table design patterns, merge operations, time travel queries, vacuum strategies, and Z-ORDER optimization.
PythonYAMLMarkdownJSONDatabricksPySparkSparkDelta Lake
📁 File Structure 17 files
delta-lake-patterns/
├── LICENSE
├── README.md
├── configs/
│ ├── table_maintenance.yaml
│ └── table_properties.yaml
├── guides/
│ └── delta-lake-best-practices.md
├── notebooks/
│ ├── cdf_processor.py
│ ├── maintenance_runner.py
│ └── setup_tables.py
├── src/
│ ├── change_data_feed.py
│ ├── liquid_clustering.py
│ ├── merge_patterns.py
│ ├── optimization.py
│ ├── table_utilities.py
│ └── time_travel.py
└── tests/
├── conftest.py
└── test_merge_patterns.py
📖 Documentation Preview README excerpt
Delta Lake Patterns
Production-ready Delta Lake merge, optimization, and maintenance patterns for Databricks.
Master the full spectrum of Delta Lake operations — from SCD Type 2 merges to Liquid Clustering migration, Change Data Feed processing, and automated table maintenance.
---
What You Get
- 5 merge strategies — SCD1, SCD2, upsert, delete+insert, and conditional merge with full PySpark implementations
- Table optimization toolkit — OPTIMIZE, ZORDER, vacuum scheduling, and ANALYZE TABLE automation
- Time travel operations — Version history queries, point-in-time restore, and audit trail generation
- Change Data Feed processing — Incremental CDF readers with watermark tracking and replay support
- Liquid Clustering — Setup, migration from ZORDER, and monitoring utilities
- Table utilities — Clone, convert-to-delta, property management, and schema inspection
- Maintenance scheduler — YAML-driven maintenance configs for bronze/silver/gold layers
- Runnable notebooks — Setup, maintenance runner, and CDF processor ready for Databricks
- Tests included — Merge pattern tests with sample data and pytest fixtures
File Tree
delta-lake-patterns/
├── README.md
├── manifest.json
├── LICENSE
├── src/
│ ├── merge_patterns.py # SCD1, SCD2, upsert, delete+insert, conditional
│ ├── optimization.py # OPTIMIZE, ZORDER, vacuum, maintenance scheduler
│ ├── time_travel.py # Version history, restore, audit trail
│ ├── change_data_feed.py # CDF reader, incremental processing
│ ├── table_utilities.py # Clone, convert-to-delta, describe history
│ └── liquid_clustering.py # Liquid clustering setup and migration
├── configs/
│ ├── table_maintenance.yaml # Maintenance schedule per layer
│ └── table_properties.yaml # Standard table properties
├── notebooks/
│ ├── setup_tables.py # Create Delta tables with configs
│ ├── maintenance_runner.py # Run maintenance across schemas
│ └── cdf_processor.py # Process Change Data Feed
├── tests/
│ ├── test_merge_patterns.py # Test SCD1/SCD2 merge logic
│ └── conftest.py # Pytest fixtures
└── guides/
└── delta-lake-best-practices.md
Getting Started
1. Run a SCD Type 2 Merge
from src.merge_patterns import scd2_merge
scd2_merge(
target_table="catalog.silver.dim_customer",
source_df=incoming_customers,
merge_keys=["customer_id"],
tracked_columns=["email", "address", "phone"],
*... continues with setup instructions, usage examples, and more.*
📄 Code Sample .py preview
src/change_data_feed.py
"""
Delta Lake Change Data Feed (CDF) Processing
=============================================
Read, filter, and process change data from Delta tables with CDF enabled.
Supports incremental processing with checkpoint-based watermarking.
Datanest Digital | https://datanest.dev
"""
from __future__ import annotations
import json
import logging
from datetime import datetime
from typing import Optional
from pyspark.sql import DataFrame
from pyspark.sql import functions as F
# Databricks globals
spark = spark # type: ignore[name-defined]
dbutils = dbutils # type: ignore[name-defined]
logger = logging.getLogger(__name__)
def enable_cdf(table_name: str) -> None:
"""Enable Change Data Feed on a Delta table.
Args:
table_name: Fully qualified table name.
"""
spark.sql(
f"ALTER TABLE {table_name} SET TBLPROPERTIES "
f"(delta.enableChangeDataFeed = true)"
)
logger.info("CDF enabled on %s", table_name)
def read_cdf_changes(
table_name: str,
starting_version: Optional[int] = None,
starting_timestamp: Optional[str] = None,
ending_version: Optional[int] = None,
ending_timestamp: Optional[str] = None,
) -> DataFrame:
"""Read Change Data Feed from a Delta table.
You must provide either starting_version or starting_timestamp.
# ... 124 more lines ...