← Back to all products

Delta Lake Patterns Library

$49

Delta Lake table design patterns, merge operations, time travel queries, vacuum strategies, and Z-ORDER optimization.

📁 17 files🏷 v1.0.0
PythonYAMLMarkdownJSONDatabricksPySparkSparkDelta Lake

📁 File Structure 17 files

delta-lake-patterns/ ├── LICENSE ├── README.md ├── configs/ │ ├── table_maintenance.yaml │ └── table_properties.yaml ├── guides/ │ └── delta-lake-best-practices.md ├── notebooks/ │ ├── cdf_processor.py │ ├── maintenance_runner.py │ └── setup_tables.py ├── src/ │ ├── change_data_feed.py │ ├── liquid_clustering.py │ ├── merge_patterns.py │ ├── optimization.py │ ├── table_utilities.py │ └── time_travel.py └── tests/ ├── conftest.py └── test_merge_patterns.py

📖 Documentation Preview README excerpt

Delta Lake Patterns

Production-ready Delta Lake merge, optimization, and maintenance patterns for Databricks.

Master the full spectrum of Delta Lake operations — from SCD Type 2 merges to Liquid Clustering migration, Change Data Feed processing, and automated table maintenance.

---

What You Get

  • 5 merge strategies — SCD1, SCD2, upsert, delete+insert, and conditional merge with full PySpark implementations
  • Table optimization toolkit — OPTIMIZE, ZORDER, vacuum scheduling, and ANALYZE TABLE automation
  • Time travel operations — Version history queries, point-in-time restore, and audit trail generation
  • Change Data Feed processing — Incremental CDF readers with watermark tracking and replay support
  • Liquid Clustering — Setup, migration from ZORDER, and monitoring utilities
  • Table utilities — Clone, convert-to-delta, property management, and schema inspection
  • Maintenance scheduler — YAML-driven maintenance configs for bronze/silver/gold layers
  • Runnable notebooks — Setup, maintenance runner, and CDF processor ready for Databricks
  • Tests included — Merge pattern tests with sample data and pytest fixtures

File Tree


delta-lake-patterns/
├── README.md
├── manifest.json
├── LICENSE
├── src/
│   ├── merge_patterns.py          # SCD1, SCD2, upsert, delete+insert, conditional
│   ├── optimization.py            # OPTIMIZE, ZORDER, vacuum, maintenance scheduler
│   ├── time_travel.py             # Version history, restore, audit trail
│   ├── change_data_feed.py        # CDF reader, incremental processing
│   ├── table_utilities.py         # Clone, convert-to-delta, describe history
│   └── liquid_clustering.py       # Liquid clustering setup and migration
├── configs/
│   ├── table_maintenance.yaml     # Maintenance schedule per layer
│   └── table_properties.yaml      # Standard table properties
├── notebooks/
│   ├── setup_tables.py            # Create Delta tables with configs
│   ├── maintenance_runner.py      # Run maintenance across schemas
│   └── cdf_processor.py           # Process Change Data Feed
├── tests/
│   ├── test_merge_patterns.py     # Test SCD1/SCD2 merge logic
│   └── conftest.py                # Pytest fixtures
└── guides/
    └── delta-lake-best-practices.md

Getting Started

1. Run a SCD Type 2 Merge


from src.merge_patterns import scd2_merge

scd2_merge(
    target_table="catalog.silver.dim_customer",
    source_df=incoming_customers,
    merge_keys=["customer_id"],
    tracked_columns=["email", "address", "phone"],

*... continues with setup instructions, usage examples, and more.*

📄 Code Sample .py preview

src/change_data_feed.py """ Delta Lake Change Data Feed (CDF) Processing ============================================= Read, filter, and process change data from Delta tables with CDF enabled. Supports incremental processing with checkpoint-based watermarking. Datanest Digital | https://datanest.dev """ from __future__ import annotations import json import logging from datetime import datetime from typing import Optional from pyspark.sql import DataFrame from pyspark.sql import functions as F # Databricks globals spark = spark # type: ignore[name-defined] dbutils = dbutils # type: ignore[name-defined] logger = logging.getLogger(__name__) def enable_cdf(table_name: str) -> None: """Enable Change Data Feed on a Delta table. Args: table_name: Fully qualified table name. """ spark.sql( f"ALTER TABLE {table_name} SET TBLPROPERTIES " f"(delta.enableChangeDataFeed = true)" ) logger.info("CDF enabled on %s", table_name) def read_cdf_changes( table_name: str, starting_version: Optional[int] = None, starting_timestamp: Optional[str] = None, ending_version: Optional[int] = None, ending_timestamp: Optional[str] = None, ) -> DataFrame: """Read Change Data Feed from a Delta table. You must provide either starting_version or starting_timestamp. # ... 124 more lines ...