DevOps from Zero to Hero: Database Migrations and Zero-Downtime Deployments
Support this blog
If you find this content useful, consider supporting the blog.
Introduction
Welcome to article seventeen of the DevOps from Zero to Hero series. In the previous articles we built a complete CI/CD pipeline, set up observability, and deployed our TypeScript API to Kubernetes. Everything works, the pipeline is green, and deploys are smooth. But there is one topic we have been quietly avoiding: the database.
Deploying new application code is relatively straightforward. You build a new image, roll it out, and if something goes wrong, you roll back. But database changes are different. They are stateful. You cannot just “undo” a column drop. They affect every instance of your application simultaneously. And if you get the ordering wrong, you can take down your entire service.
In this article we will cover what database migrations are and how they work, how to write safe migrations using Prisma, the expand-contract pattern for making backwards-compatible schema changes, zero-downtime deployment strategies in Kubernetes, health checks and readiness probes, and rollback strategies for when things go wrong. By the end, you will know how to ship database changes confidently, even under production traffic.
Let’s get into it.
Why database changes are the riskiest part of deployments
Application code is stateless. If you deploy a bad version, you roll back to the previous container image and the problem is gone. The old code runs exactly as it did before. But databases are stateful. Once you drop a column, that data is gone. Once you rename a table, every query that references the old name breaks instantly.
Here is what makes database changes so dangerous:
- They are shared state: Every pod, every instance, every replica reads from the same database. A schema change affects all of them at once. You cannot do a gradual rollout of a database change the way you can with application code.
- They are hard to reverse: Adding a column is easy to undo (just drop it). But dropping a column, changing a column type, or deleting data? Those operations are destructive. You cannot “undelete” a column and get the data back.
- Ordering matters: If your application code expects a column that does not exist yet, it crashes. If your migration removes a column that old application code still references, it crashes. The sequencing between code deploys and schema changes is critical.
- They hold locks: Many schema changes (especially on large tables) acquire locks that block reads or writes. A migration that takes 30 seconds to run on your dev database might take 30 minutes on a production table with millions of rows, locking out all traffic.
The core challenge is this: during a deployment, you will have old code and new code running at the same time. Your database schema must be compatible with both versions simultaneously. This constraint drives every decision we will make in this article.
Migration fundamentals
A database migration is a versioned, incremental change to your database schema. Instead of manually running SQL statements against your database, you write migration files that describe the change, and a migration tool applies them in order.
Every migration has two parts:
- Up: The forward change. Create a table, add a column, create an index. This is what runs when you apply the migration.
- Down: The reverse change. Drop the table, remove the column, remove the index. This is what runs when you roll back the migration.
Migration files are typically named with a timestamp or sequence number so the tool knows what order to run them in:
migrations/
20260601120000_create_users_table/
migration.sql
20260602090000_add_email_to_orders/
migration.sql
20260603140000_create_audit_log/
migration.sql
The migration tool keeps track of which migrations have been applied in a special table (usually
called _prisma_migrations or schema_migrations). When you run migrate, it checks which
migrations are pending and applies them in order. This gives you a complete, auditable history of
every schema change, the ability to reproduce your database schema from scratch, and a mechanism
to roll back changes when needed.
Setting up migrations with Prisma
We will use Prisma as our TypeScript ORM and migration tool. Prisma takes a different approach from traditional migration tools: you define your schema in a declarative file, and Prisma generates the SQL migrations for you.
First, install Prisma in your project:
npm install prisma --save-dev
npm install @prisma/client
# Initialize Prisma with PostgreSQL
npx prisma init --datasource-provider postgresql
This creates a prisma/schema.prisma file. Let’s define our first model:
// prisma/schema.prisma
generator client {
provider = "prisma-client-js"
}
datasource db {
provider = "postgresql"
url = env("DATABASE_URL")
}
model User {
id Int @id @default(autoincrement())
name String
email String @unique
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
orders Order[]
}
model Order {
id Int @id @default(autoincrement())
amount Float
status String @default("pending")
userId Int
user User @relation(fields: [userId], references: [id])
createdAt DateTime @default(now())
}
Now generate the first migration:
npx prisma migrate dev --name create_users_and_orders
Prisma compares your schema file to the current database state, generates the SQL, and applies it. The generated migration looks like this:
-- CreateTable
CREATE TABLE "User" (
"id" SERIAL NOT NULL,
"name" TEXT NOT NULL,
"email" TEXT NOT NULL,
"createdAt" TIMESTAMP(3) NOT NULL DEFAULT CURRENT_TIMESTAMP,
"updatedAt" TIMESTAMP(3) NOT NULL,
CONSTRAINT "User_pkey" PRIMARY KEY ("id")
);
-- CreateTable
CREATE TABLE "Order" (
"id" SERIAL NOT NULL,
"amount" DOUBLE PRECISION NOT NULL,
"status" TEXT NOT NULL DEFAULT 'pending',
"userId" INTEGER NOT NULL,
"createdAt" TIMESTAMP(3) NOT NULL DEFAULT CURRENT_TIMESTAMP,
CONSTRAINT "Order_pkey" PRIMARY KEY ("id")
);
-- CreateIndex
CREATE UNIQUE INDEX "User_email_key" ON "User"("email");
-- AddForeignKey
ALTER TABLE "Order" ADD CONSTRAINT "Order_userId_fkey"
FOREIGN KEY ("userId") REFERENCES "User"("id") ON DELETE RESTRICT ON UPDATE CASCADE;
Now let’s add a column. Say we need a phone field on the User model:
model User {
id Int @id @default(autoincrement())
name String
email String @unique
phone String? // nullable, so existing rows are fine
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
orders Order[]
}
npx prisma migrate dev --name add_phone_to_users
The generated SQL:
-- AlterTable
ALTER TABLE "User" ADD COLUMN "phone" TEXT;
Notice that the column is nullable (TEXT without NOT NULL). This is important. If we made it
required, the migration would fail because existing rows would not have a value for the new column.
Making new columns nullable (or giving them a default value) is one of the most basic safe migration
patterns.
Now let’s rename a column. Say we want to rename name to fullName. In Prisma, you use the
@map attribute to rename the database column without changing the Prisma field name:
model User {
id Int @id @default(autoincrement())
fullName String @map("full_name")
email String @unique
phone String?
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
orders Order[]
@@map("users")
}
But hold on. If we just rename the column directly, any application code that still queries the old column name will break. This is exactly the kind of dangerous operation we need to handle with the expand-contract pattern, which we will cover next.
Safe migration patterns
The golden rule of safe migrations is: never make a breaking change in a single deploy. Instead, break it into multiple steps where each step is backwards-compatible.
Here are the patterns you should follow:
Adding a column (safe)
Adding a nullable column or a column with a default value is always safe. Old code ignores the new column. New code can use it.
-- Safe: nullable column, old code ignores it
ALTER TABLE "User" ADD COLUMN "phone" TEXT;
-- Safe: column with default, old code ignores it
ALTER TABLE "Order" ADD COLUMN "currency" TEXT NOT NULL DEFAULT 'USD';
Adding an index (safe, but watch lock time)
Creating an index is safe from a compatibility standpoint, but on large tables it can lock the table
for a long time. Use CONCURRENTLY in PostgreSQL to avoid blocking writes:
-- This locks the table until the index is built (dangerous on large tables)
CREATE INDEX idx_orders_user_id ON "Order" ("userId");
-- This builds the index without locking (safe for production)
CREATE INDEX CONCURRENTLY idx_orders_user_id ON "Order" ("userId");
Creating a new table (safe)
New tables do not affect existing code at all. Always safe.
Dropping a column (dangerous if done wrong)
If you drop a column that old application code still reads from, those queries will fail. Never drop a column in the same deploy that still references it.
Renaming a column (dangerous if done in one step)
A column rename is essentially a drop-and-add from the application’s perspective. Old code queries the old name. New code queries the new name. During a rolling update, both versions run simultaneously, and one of them will always be broken.
The expand-contract pattern
The expand-contract pattern is the standard way to make breaking schema changes safely. It works in three phases:
Phase 1: Expand (add the new thing)
Add the new column alongside the old one. Update your application code to write to both columns. Deploy this change.
-- Migration 1: Add the new column
ALTER TABLE "users" ADD COLUMN "full_name" TEXT;
// Application code writes to both columns
async function updateUser(id: number, name: string) {
await prisma.user.update({
where: { id },
data: {
name: name, // old column (for old code still reading it)
fullName: name, // new column (for new code)
},
});
}
Phase 2: Migrate data
Backfill the new column with data from the old column. This can be a migration script or a background job.
-- Migration 2: Backfill existing data
UPDATE "users" SET "full_name" = "name" WHERE "full_name" IS NULL;
At this point, both columns have the same data. Old code reads from name, new code reads from
full_name, and everything works.
Phase 3: Contract (remove the old thing)
Once all application code has been updated to use the new column, and you have verified that no queries reference the old column, you can drop it.
-- Migration 3: Drop the old column (only after all code uses full_name)
ALTER TABLE "users" DROP COLUMN "name";
This three-phase approach means that at every point during the rollout, the database schema is compatible with both the old and new versions of your code. No downtime, no errors, no data loss.
Here is the timeline:
# Expand-contract timeline
#
# Deploy 1: Add "full_name" column, write to both columns
# Old code: reads "name" -> works (column still exists)
# New code: reads "full_name" -> works (column was just added)
#
# Deploy 2: Backfill "full_name" from "name"
# All rows now have both columns populated
#
# Deploy 3: Remove all reads from "name", drop column
# Old code: gone (fully rolled out)
# New code: reads "full_name" -> works (only column left)
Yes, this takes three deploys instead of one. That is the trade-off. Safety costs velocity, but it saves you from 3am incidents.
Dangerous patterns to avoid
Here are the schema changes that cause the most outages, and how to handle them instead:
- Renaming a column in a single deploy: This is a drop plus add. Use expand-contract instead. Add the new column, backfill, update code, then drop the old column.
- Changing a column type in place: Changing
VARCHAR(50)toTEXTmight seem harmless, but changingTEXTtoINTEGERwill fail if any rows contain non-numeric data. Add a new column with the new type, backfill with type conversion, switch code, then drop the old column.- Adding a NOT NULL constraint without a default: If you add
NOT NULLto an existing column that has null values, the migration will fail. First backfill all nulls, then add the constraint.- Dropping a table that is still referenced: Foreign key constraints will block the drop, but application code will crash. Remove all code references first, then drop.
- Running large data migrations in the main transaction: Updating millions of rows in a single transaction locks the table and can cause timeouts. Batch your updates (1000-5000 rows at a time) with small pauses between batches.
A useful rule of thumb: if a migration cannot be reversed with a simple “undo” migration, it is dangerous and needs the expand-contract treatment.
Zero-downtime deployments in Kubernetes
Now that we know how to handle database changes safely, let’s look at the other half of the puzzle: deploying application code without dropping requests. Kubernetes gives you several mechanisms for this.
Rolling updates
The default Kubernetes deployment strategy is RollingUpdate. It gradually replaces old pods with
new pods, ensuring that some pods are always available to serve traffic.
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # At most 1 extra pod during rollout
maxUnavailable: 0 # Never have fewer than 3 healthy pods
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: myapp
image: myapp:v2
ports:
- containerPort: 3000
With maxSurge: 1 and maxUnavailable: 0, Kubernetes will:
- Step 1: Create 1 new pod (v2). Now you have 3 old + 1 new = 4 pods total.
- Step 2: Wait until the new pod passes its readiness probe.
- Step 3: Terminate 1 old pod (v1). Now you have 2 old + 1 new = 3 pods.
- Step 4: Create another new pod (v2). Now you have 2 old + 2 new = 4 pods.
- Repeat until all pods are running v2.
During this process, both v1 and v2 pods are serving traffic. This is exactly why your database schema must be compatible with both versions.
The Recreate strategy
The Recreate strategy kills all old pods before creating new ones. This means downtime, so you
should only use it when your application cannot run two versions simultaneously (for example, if it
holds an exclusive lock on a resource).
strategy:
type: Recreate
For almost all web applications, RollingUpdate is what you want.
Health checks: liveness, readiness, and startup probes
Probes are how Kubernetes knows if your pod is healthy. There are three types, and each one serves a different purpose:
- Readiness probe: “Is this pod ready to receive traffic?” Kubernetes only sends traffic to pods that pass their readiness probe. During a deployment, new pods will not receive traffic until they are ready. This is the most important probe for zero-downtime deployments.
- Liveness probe: “Is this pod still alive?” If a pod fails its liveness probe, Kubernetes restarts it. Use this to recover from deadlocks or stuck processes. Be careful: if your liveness probe is too aggressive, Kubernetes will restart pods that are just slow, creating a crash loop.
- Startup probe: “Has this pod finished starting up?” This is for applications with a slow startup (loading large caches, running migrations). The startup probe runs first, and liveness/readiness probes do not start until it passes.
Here is how to configure all three:
containers:
- name: myapp
image: myapp:v2
ports:
- containerPort: 3000
readinessProbe:
httpGet:
path: /health/ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3
livenessProbe:
httpGet:
path: /health/live
port: 3000
initialDelaySeconds: 15
periodSeconds: 20
failureThreshold: 3
startupProbe:
httpGet:
path: /health/started
port: 3000
failureThreshold: 30
periodSeconds: 10
And here is what the health check endpoints look like in your TypeScript API:
// Health check endpoints
app.get("/health/live", (req, res) => {
// Liveness: is the process running?
// Keep this simple. If this endpoint responds, the process is alive.
res.status(200).json({ status: "alive" });
});
app.get("/health/ready", async (req, res) => {
// Readiness: can this pod serve traffic?
// Check that all dependencies are reachable.
try {
await prisma.$queryRaw`SELECT 1`; // database is reachable
res.status(200).json({ status: "ready" });
} catch (error) {
res.status(503).json({ status: "not ready", error: "database unreachable" });
}
});
app.get("/health/started", (req, res) => {
// Startup: has the app finished initializing?
if (appIsInitialized) {
res.status(200).json({ status: "started" });
} else {
res.status(503).json({ status: "starting" });
}
});
A common mistake is making the liveness probe too strict. If your liveness probe checks the database, and the database has a brief network hiccup, Kubernetes will restart all your pods at once, making the situation much worse. Keep liveness probes simple (just “is the process running?”) and use readiness probes for dependency checks.
Graceful shutdown: preStop hooks and connection draining
When Kubernetes terminates a pod during a rolling update, it sends a SIGTERM signal. Your
application should catch this signal and stop accepting new requests while finishing in-flight
requests. But there is a race condition: Kubernetes removes the pod from the service endpoints at
the same time it sends SIGTERM, and the endpoint removal takes a moment to propagate. During that
window, traffic can still be routed to a pod that is shutting down.
The fix is a preStop hook that adds a small delay:
containers:
- name: myapp
image: myapp:v2
lifecycle:
preStop:
exec:
command: ["sh", "-c", "sleep 10"]
This tells Kubernetes to wait 10 seconds before sending SIGTERM. During those 10 seconds, the pod
is removed from the service endpoints, so no new traffic is routed to it. After the sleep, SIGTERM
is sent and the application can shut down gracefully.
In your TypeScript application, handle the shutdown signal:
// Graceful shutdown handler
process.on("SIGTERM", () => {
console.log("SIGTERM received. Starting graceful shutdown...");
// Stop accepting new connections
server.close(() => {
console.log("HTTP server closed. Cleaning up...");
// Close database connections
prisma.$disconnect().then(() => {
console.log("Database disconnected. Exiting.");
process.exit(0);
});
});
// Force exit after 30 seconds if graceful shutdown hangs
setTimeout(() => {
console.error("Graceful shutdown timed out. Forcing exit.");
process.exit(1);
}, 30000);
});
Also, set terminationGracePeriodSeconds on the pod spec to give your application enough time to
drain. The default is 30 seconds, but adjust it based on how long your longest requests take:
spec:
terminationGracePeriodSeconds: 60
containers:
- name: myapp
# ...
PodDisruptionBudgets
A PodDisruptionBudget (PDB) tells Kubernetes how many pods must remain available during voluntary disruptions like node drains, cluster upgrades, or autoscaler scale-downs. Without a PDB, Kubernetes could drain all your nodes at once during a cluster upgrade, taking down every pod simultaneously.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: myapp-pdb
spec:
minAvailable: 2 # At least 2 pods must always be running
selector:
matchLabels:
app: myapp
You can also use maxUnavailable instead of minAvailable:
spec:
maxUnavailable: 1 # At most 1 pod can be down at a time
For a deployment with 3 replicas, minAvailable: 2 and maxUnavailable: 1 are equivalent. Use
whichever reads more clearly for your team.
Blue-green deployments
Blue-green deployments take a different approach from rolling updates. Instead of gradually replacing pods, you run two complete environments simultaneously: the “blue” environment (current production) and the “green” environment (new version). Once the green environment is validated, you switch traffic from blue to green in one step.
Here is how to implement blue-green with Kubernetes services:
# Blue deployment (current production, running v1)
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-blue
spec:
replicas: 3
selector:
matchLabels:
app: myapp
version: blue
template:
metadata:
labels:
app: myapp
version: blue
spec:
containers:
- name: myapp
image: myapp:v1
---
# Green deployment (new version, running v2)
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-green
spec:
replicas: 3
selector:
matchLabels:
app: myapp
version: green
template:
metadata:
labels:
app: myapp
version: green
spec:
containers:
- name: myapp
image: myapp:v2
---
# Service (points to blue initially)
apiVersion: v1
kind: Service
metadata:
name: myapp
spec:
selector:
app: myapp
version: blue # Change to "green" to switch traffic
ports:
- port: 80
targetPort: 3000
To switch traffic, you update the service selector from version: blue to version: green. All
traffic moves at once. If something is wrong, you switch back to blue.
- When to use blue-green: When you need instant rollback (one label change), when you want to validate the new version with real traffic patterns before committing, or when your application cannot tolerate mixed versions.
- When to stick with rolling updates: For most standard deployments. Blue-green requires double the resources (both environments run simultaneously) and adds operational complexity.
Rollback strategies
No matter how careful you are, things will go wrong. Having a rollback plan is not optional. Here are the three main strategies:
Kubernetes rollback
Kubernetes keeps a history of your deployments. You can roll back to a previous version with a single command:
# See rollout history
kubectl rollout history deployment/myapp
# Roll back to the previous version
kubectl rollout undo deployment/myapp
# Roll back to a specific revision
kubectl rollout undo deployment/myapp --to-revision=3
# Watch the rollback progress
kubectl rollout status deployment/myapp
This only rolls back the application code (the container image). It does not roll back database migrations. If your migration was additive (adding a column), the old code simply ignores the new column, and there is nothing to roll back. If your migration was destructive (dropping a column), you need a database rollback.
Database rollback
Prisma does not have a built-in “undo last migration” command for production. In production, you write a new migration that reverses the change:
# In development, you can reset (destroys all data)
npx prisma migrate reset
# In production, create a new "undo" migration
npx prisma migrate dev --name undo_add_phone_to_users
The “undo” migration is just another migration that reverses the previous change:
-- Undo migration: remove the phone column
ALTER TABLE "User" DROP COLUMN "phone";
This is another reason to prefer additive migrations. Adding a column is easy to undo (drop it). Dropping a column is impossible to undo (the data is gone). If you follow the expand-contract pattern, your “undo” is always just “drop the column you added.”
Feature flags as an alternative to rollbacks
Instead of rolling back code or database changes, you can use feature flags to disable the new functionality without changing the deployed code:
// Feature flag check
app.get("/api/orders", async (req, res) => {
const orders = await prisma.order.findMany({
include: {
user: true,
},
});
if (featureFlags.isEnabled("show-order-currency")) {
// New behavior: include currency field
return res.json(orders.map(o => ({
...o,
currency: o.currency ?? "USD",
})));
}
// Old behavior: no currency field
return res.json(orders);
});
Feature flags let you decouple deployment from release. You deploy the code (including the migration), but the new feature is behind a flag. If something goes wrong, you flip the flag off. No rollback needed, no redeployment, no database undo.
Practical example: adding a column under production traffic
Let’s put it all together with a real scenario. We need to add a currency column to the Order
table. The API is serving traffic, and we cannot afford any downtime.
Step 1: Write the migration
Update the Prisma schema:
model Order {
id Int @id @default(autoincrement())
amount Float
status String @default("pending")
currency String @default("USD") // new column with a default
userId Int
user User @relation(fields: [userId], references: [id])
createdAt DateTime @default(now())
}
Generate and review the migration:
npx prisma migrate dev --name add_currency_to_orders
Generated SQL:
ALTER TABLE "Order" ADD COLUMN "currency" TEXT NOT NULL DEFAULT 'USD';
This is safe because the column has a default value, so existing rows get 'USD' automatically,
and old application code that does not know about the column will simply ignore it.
Step 2: Update the application code
Update the API to use the new column:
// Updated order creation endpoint
app.post("/api/orders", async (req, res) => {
const { amount, userId, currency } = req.body;
const order = await prisma.order.create({
data: {
amount,
userId,
currency: currency ?? "USD", // use provided currency or default
},
});
res.status(201).json(order);
});
Step 3: Run the migration in CI/CD
Add a migration step to your CI/CD pipeline that runs before the application deployment:
# In your GitHub Actions workflow
jobs:
migrate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Node
uses: actions/setup-node@v4
with:
node-version: "20"
- name: Install dependencies
run: npm ci
- name: Run database migrations
run: npx prisma migrate deploy
env:
DATABASE_URL: ${{ secrets.DATABASE_URL }}
deploy:
needs: migrate # Deploy only after migrations succeed
runs-on: ubuntu-latest
steps:
- name: Update deployment image
run: |
kubectl set image deployment/myapp \
myapp=myapp:${{ github.sha }}
- name: Wait for rollout
run: kubectl rollout status deployment/myapp --timeout=300s
Step 4: Verify
After the deploy, verify everything is working:
# Check the migration was applied
npx prisma migrate status
# Test the API
curl -X POST https://api.example.com/api/orders \
-H "Content-Type: application/json" \
-d '{"amount": 29.99, "userId": 1, "currency": "EUR"}'
# Verify the response includes the currency
curl https://api.example.com/api/orders/1
Because we used an additive change with a default value, the entire process was zero-downtime. Old
pods that do not know about the currency column kept serving traffic while new pods rolled out.
No conflicts, no errors, no interruption.
Migration checklist
Before running any migration in production, go through this checklist:
- Is the migration additive? Adding columns (nullable or with defaults), adding tables, and adding indexes are safe. Everything else needs extra care.
- Can old code work with the new schema? During a rolling update, old and new code run simultaneously. Make sure the old code will not break.
- Can new code work with the old schema? If the migration fails or is delayed, can the new code still function?
- Have you tested the migration on a copy of production data? Your dev database has 100 rows. Production has 10 million. What takes 1 second in dev might take 10 minutes in production.
- Do you have a rollback plan? What SQL would you run to undo this migration? Write it down before you deploy.
- Are you using
CONCURRENTLYfor index creation? On large tables, index creation locks the table. UseCREATE INDEX CONCURRENTLYin PostgreSQL.- Are you batching large data migrations? Do not update millions of rows in a single transaction. Batch them.
What comes next
We now know how to make database changes safely, deploy application code without downtime, and roll back when things go wrong. In the next article, we will explore security in the CI/CD pipeline: scanning for vulnerabilities, managing secrets, and hardening your deployment process.
Hope you found this useful and enjoyed reading it, until next time!
Errata
If you spot any error or have any suggestion, please send me a message so it gets fixed.
Also, you can check the source code and changes in the sources here
$ Comments
Online: 0Please sign in to be able to write comments.