Insights

Part 2: Building a Strong Data Engineering Team on Databricks and Google Cloud

CI/CD in Action (Expanded Edition)

Deep dive into CI/CD practices for data engineering teams using Databricks and Google Cloud, covering packaging, pipelines, security, and practical implementation roadmap.

The Content Team

September 16, 2025

15 min read

Introduction

In Part 1, we described the target state for a modern data engineering team and how to structure people and roles so work gets done. In Part 2, we go deep on the engine that keeps that target state healthy day after day: CI/CD.

CI/CD stands for Continuous Integration and Continuous Delivery. In simple terms, CI/CD is the habit of packaging, testing, and deploying changes automatically and reliably. When CI/CD is in place, you reduce human error, move faster, and keep production stable—even while multiple teams are shipping new pipelines and features.

This article explains not just what to set up on Databricks + Google Cloud, but also why each choice matters, with practical examples and a step-by-step rollout plan.

What is CI/CD and Why Does It Matter?

Continuous Integration (CI)

Every time a developer changes code, the system automatically checks that change. It can:

Run quick tests (for Python, SQL, PySpark)
Lint code (check style and basic mistakes)
Validate schemas and configurations

Continuous Delivery (CD)

Tested changes get packaged and deployed to dev, then staging, then production, using repeatable steps.

Dev Staging Production

Why CI/CD is Critical for Data Teams

Schema drift is real

A column gets renamed upstream, and jobs break. CI catches this early with schema tests, before production fails at 2 a.m.

Many cooks in the kitchen

Multiple squads push code. CI/CD keeps environments consistent so one team's change doesn't silently break another team's job.

Speed with safety

The business wants new data and features quickly. CI/CD gives you both speed (automated steps) and safety (tests and approvals).

Traceability & compliance

Auditors and FinOps want to know who changed what, when, and how. CI/CD plus version control makes this easy.

Mental model: CI/CD is the paved road your teams drive on. It's smoother and safer than everyone taking their own side streets.

CI/CD on Databricks + Google Cloud

Five key components that work together to create a robust, secure, and scalable CI/CD pipeline

1. Packaging with Databricks Asset Bundles (DABs)

The shipping container for your data pipelines

What it is:

Databricks Asset Bundles (DABs) package your notebooks, libraries, and configuration into a single, consistent unit. Think of a DAB as a shipping box that carries your pipeline from dev to prod without repacking.

What goes inside a bundle:

Code (notebooks, Python modules)
Job/workflow definitions
Environment-specific settings
Dependencies (library versions)

Why it matters:

Consistency

The same bundle that worked in dev runs in prod with only environment variables changed—no hand edits.

Fewer mistakes

Configuration lives next to code, reducing mismatch errors.

Faster rollbacks

Re-deploy a known-good bundle if something goes wrong.

Example:

In dev, a job writes to dev_sales.silver.orders. In prod, it writes to prod_sales.silver.orders. The same bundle switches targets by reading an environment variable.

2. Pipelines with GitHub Actions or Cloud Build

The automation engine that runs your CI/CD steps

What a typical pipeline does:

CI Stage

Build Stage

Deploy Dev

Promote Staging

Promote Production

When to prefer which:

GitHub Actions

If your code lives in GitHub and you want tight integration and simple setup.

Cloud Build

If you prefer a GCP-native service and want deeper Google Cloud integrations.

Why it matters:

Automation reduces errors

The computer follows the same steps every time.

Guardrails

Approvals and environment protection rules ensure sensitive steps are controlled.

Speed

Engineers spend time on features, not on manual deployments.

3. Infrastructure as Code (IaC) with Terraform

Declare and manage infrastructure through code

What to manage with Terraform:

Workspaces and workspace settings

Unity Catalog: metastore, catalogs, schemas

Cluster policies and pools

Service principals and permissions

Why it matters:

Consistency

Environments (dev/stage/prod) are clones, not hand-built cousins.

Auditability

Infra changes are reviewed via PR like code.

Rebuilds & recovery

If something breaks, re-apply state to restore a known configuration.

Best Practices:

• Store Terraform state in a locked GCS bucket
• Create reusable modules
• Use plan → review → apply via CI

4. Security with Workload Identity Federation (WIF)

Keyless authentication for CI/CD

What it is:

Workload Identity Federation lets your CI runners (GitHub Actions or Cloud Build) authenticate to GCP without long-lived keys. The runner exchanges a short-lived token for a GCP identity.

Why it matters:

No secrets in repo

Rotation by design

Least privilege

High-level flow:

1 Runner proves identity

2 Google trusts that identity

3 GCP issues short-lived token

4 Pipeline performs allowed actions

5. Service Principals and OAuth (Databricks)

Non-human accounts for pipeline access

What they are:

A service principal is a non-human account your pipelines use to talk to Databricks. OAuth provides tokens for that access.

Why it matters:

Separation of duties:

Deployments do not depend on personal accounts

Traceability:

Actions are clearly attributed to "CI/CD bot"

Least privilege:

Principal gets only the roles it needs

Practice notes:

• Use Unity Catalog roles to scope access
• Rotate tokens automatically
• Prefer OAuth M2M over PATs

What "Good" Looks Like in CI/CD

Every Commit Triggers Tests

Push code → run unit tests, linting, and schema checks. Problems surface immediately.

Examples: Unit test PySpark functions, schema tests, contract tests

Bundles Build Automatically

DABs get created by the pipeline with correct configs. No manual repacking.

Examples: Single bundle with per-env overrides, includes job JSON/YAML

Environment Flow

Dev → Staging → Prod, with automated checks and approvals.

Practices: Trunk-based development, manual approvals at boundaries

Secure Secrets

Secrets in GCP Secret Manager; CI uses WIF; Databricks uses service principals.

Practices: Never store secrets in Git, rotate regularly

Everything Versioned

Code, infra, configs, and data-model definitions in version control.

Extras: Use tags/releases, semantic versioning, Delta Lake time travel

Health Checklist

Tests run on every PR

One-click deployments

No secrets in code

Avoids collisions (e.g., three squads changing the same table this week).

Day-to-day:

Publish release calendar
Facilitate go/no-go gates for prod deployments
Track risks and rollback drills

Common Pitfalls and How to Avoid Them

Learn from common mistakes to build a more robust CI/CD pipeline

1. Skipping Tests

Smell: "We'll test in staging."

Fix: Enforce minimum test coverage; block merges if tests fail.

2. Hardcoding Secrets

Smell: Tokens in notebooks or YAML.

Fix: Secret Manager + WIF; environment variables injected at runtime.

3. Manual Promotions & Hotfixes

Smell: SSH into prod to "just fix it."

Fix: Require promotions via pipeline; add emergency hotfix path that still uses the pipeline.

4. Irreproducible Notebooks

Smell: Code works only in one user's workspace.

Fix: Treat notebooks as code (in Git); package as DABs; test with CI.

5. No Test Data Management

Smell: Tests fail randomly because data changed.

Fix: Use fixtures or snapshot small, stable datasets for CI.

6. Over-Engineering Early

Smell: Months building "perfect" pipelines before first value.

Fix: Ship a "hello-world" deploy in week 1–2; iterate.

7. Ignoring Cost & Performance

Smell: CI runs spin huge clusters for tiny tests.

Fix: Use small clusters for CI; larger clusters only for staging/prod perf tests.

8. No Rollback Plan

Smell: "If it breaks, we'll figure it out."

Fix: Document rollback steps per job; canary releases; keep last known-good bundle.

Quick Checklist: "Is our CI/CD healthy?"

Tests run on every PR and on main

One command (or one click) can deploy dev → staging → prod

Secrets never live in code

Failed prod deploys roll back automatically or with one action

We can answer who changed what and when

Ready to Transform Your Data Engineering?

Let our experts help you implement these strategies and build a world-class, scalable data platform.

Team Assessment & Design Data Strategy AI Strategy Databricks Architecture Foundation

Start Your Transformation

Stay Connected with KData

Follow us on LinkedIn to get the latest insights on data engineering, Databricks, Snowflake, AI strategies, and cloud best practices. Join our professional community of data experts.

KData Company

Data Engineering & AI Experts

Follow KData on LinkedIn

Join thousands of data professionals

Back to All Articles