How We Migrated a 1.2TB PostgreSQL Database with Zero Downtime: A Vietnam Offshore Case Study

(Case Studies) - Migrating a 1.2TB production PostgreSQL database without a single minute of downtime sounds like a nightmare. Here's exactly how we did it with a team of 4 senior engineers in Ho Chi Minh City, using logical replication and a carefully orchestrated cutover plan.

How We Migrated a 1.2TB PostgreSQL Database with Zero Downtime: A Vietnam Offshore Case Study

Database migrations are the stuff of nightmares for most engineering teams. Especially when you’re talking about a 1.2TB PostgreSQL instance running a critical SaaS platform with a 99.99% uptime SLA.

Ever tried migrating a multi-terabyte database while the app stays live? It’s like changing the tires on a moving car.

How We Helped an EdTech Startup Handle 50,000 Concurrent Users Without Crashing

How We Helped an EdTech Startup Handle 50,000 Concurrent Users Without Crashing

How We Helped an EdTech Startup Handle 50,000 Concurrent Users Without Crashing Their platform was dying. Not slowly—in… ...

But we did it. Zero downtime. Zero data loss. Here’s the ugly, technical truth of how it happened.

The Client: A US-Based Logistics SaaS Platform

The client ran a real-time shipment tracking platform. Think 10,000+ API calls per minute, 500+ database tables, and a PostgreSQL 12 instance that had grown organically over 6 years. Their on-premise infrastructure was hitting limits. They needed to migrate to AWS RDS PostgreSQL 15.

How We Rebuilt a Legacy Logistics Platform in 6 Weeks: A Real Vietnam Offshore Case Study

How We Rebuilt a Legacy Logistics Platform in 6 Weeks: A Real Vietnam Offshore Case Study

How We Rebuilt a Legacy Logistics Platform in 6 Weeks: A Real Vietnam Offshore Case Study Let me… ...

The constraints:

  • Maximum allowed downtime: 0 minutes
  • Data loss tolerance: 0%
  • Migration window: 4 weeks from kickoff to cutover
  • Budget: Fixed, no room for heroics

They’d tried this internally twice before. Both attempts failed. The second one caused a 47-minute outage that cost them roughly $120,000 in lost revenue.

They called us.

Why Logical Replication, Not pg_dump

Most teams reach for `pg_dump` or `pg_restore` for migrations. Fine for small databases. Suicide for 1.2TB.

Here’s the math:

  • `pg_dump` at ~150MB/s compression: ~2.5 hours
  • `pg_restore` with indexes and constraints: ~6-8 hours
  • Total downtime: 8-10 hours minimum

That wasn’t going to work.

We chose logical replication using PostgreSQL’s built-in `pgoutput` plugin. It’s slower to set up, but the cutover is nearly instant.

The architecture:


Source (PG 12 on-prem)  →  Logical Replication Slot  →  Target (PG 15 on RDS)
                              ↓
                     Change Data Capture (CDC)
                              ↓
                    Continuous sync (sub-second lag)

The key insight? Logical replication streams individual transaction logs. The target stays in sync in near real-time. When you’re ready to cut over, you just stop writes to the source and point traffic to the target. Takes about 30 seconds.

The Team: 4 Senior Engineers in Ho Chi Minh City

We assembled a dedicated team of 4 senior engineers from our Ho Chi Minh City hub. Not juniors. Not mid-levels. Four people who’d done PostgreSQL migrations before and understood the difference between “it works in staging” and “it works at 2AM when production is on fire.”

  • 2 database engineers handled replication setup, schema migration, and performance tuning
  • 1 DevOps engineer managed the AWS infrastructure, monitoring, and rollback scripts
  • 1 QA engineer built the validation pipeline that caught data inconsistencies

They worked overlapping shifts with the US client’s team. That’s the real advantage of the Vietnam timezone — we had 6 hours of overlap with US East Coast, plus full coverage during their night.

The 3 Biggest Technical Challenges We Faced

Challenge 1: Schema Incompatibility Between PG 12 and PG 15

PostgreSQL 15 introduced breaking changes. Some data types in PG 12 don’t map cleanly to PG 15. Specifically, we had several tables using `citext` extension with custom collation settings that broke during replication.

The fix: We created a migration layer that transformed the schema on the fly. Instead of replicating raw DDL, we used a custom wrapper that mapped incompatible types.

sql
-- Before (PG 12)
CREATE TABLE shipments (
    id UUID PRIMARY KEY,
    tracking_code CITEXT NOT NULL,
    status VARCHAR(50)
);

-- After (PG 15, with migration layer)
CREATE TABLE shipments (
    id UUID PRIMARY KEY,
    tracking_code TEXT NOT NULL,
    status VARCHAR(50)
);
CREATE INDEX idx_tracking_code_lower ON shipments (LOWER(tracking_code));

We lost the case-insensitive indexing, but gained compatibility. The application layer handled the rest.

Challenge 2: Replication Lag Under Write-Heavy Workloads

The source database handled about 2,500 writes per second during peak hours. Logical replication couldn’t keep up initially. We saw lag spike to 12 seconds during one afternoon.

That’s bad. If the source crashes during a 12-second lag window, you lose data.

The fix: We tuned the replication parameters aggressively.


wal_level = logical
max_replication_slots = 10
max_wal_senders = 10
max_logical_replication_workers = 8
max_worker_processes = 32

We also increased `wal_buffers` from 16MB to 64MB and set `wal_writer_delay` to 200ms. After tuning, lag dropped to under 200ms consistently.

Challenge 3: The Cutover Orchestration

The actual cutover was the scariest part. You’re flipping a switch that affects thousands of paying customers. One mistake and you’re writing an apology blog post.

We built a 3-phase cutover script:

  1. Phase 1 (Pre-cutover): Verify replication lag < 1 second for 5 consecutive minutes. Disable cron jobs and background workers on source.
  2. Phase 2 (Cutover): Set source database to read-only. Wait for replication to catch up. Verify row counts match across all 500+ tables.
  3. Phase 3 (Validation): Run 200 automated integration tests against the target. If any fail, execute rollback.

Here’s the rollback plan we had ready:

Scenario Rollback Action Recovery Time
Replication lag > 5s Abort cutover, keep source live 0 minutes
Row count mismatch Abort, investigate, retry 30-60 minutes
Integration test failure Rollback DNS, restore source writes 5 minutes
Performance degradation Scale target RDS instance up 10 minutes

We never needed the rollback. But having it gave everyone the confidence to proceed.

The Result: 47 Seconds of “Downtime” (Mostly DNS Propagation)

On cutover day, we executed the plan. Total time from setting source to read-only to serving traffic from the target: 47 seconds.

Technically, that’s not zero downtime. But 47 seconds of read-only mode during a 3AM maintenance window? The client’s SLA had a 5-minute monthly allowance. We used less than 1% of it.

Post-migration metrics (30 days after cutover):

  • Query latency: Reduced by 34% (PG 15 optimizer improvements)
  • Connection pool utilization: Down 22%
  • Storage costs: Reduced by 41% (RDS gp3 vs. on-premise SAN)
  • No data integrity issues found

Why This Worked (And Previous Attempts Didn’t)

The client’s internal team tried this twice. Both times, they underestimated the complexity of logical replication under real-world workloads. They tested on staging with synthetic data that didn’t match production patterns.

We did something different. We replicated the actual production traffic patterns for 2 weeks before cutover. We ran the migration script against a snapshot of production data every single night. We broke things. We fixed them. We broke them again.

That’s the value of having a dedicated team that can focus on nothing else for 4 weeks. No context switching. No “oh by the way, can you fix this bug too?” Just pure, focused engineering.

Honestly, the client could have hired a local US team for this. But at $3,000/month per senior engineer? They got 4 people for the price of 1. That’s not just cost-saving. That’s capability.

The Role of AI Orchestration in This Migration

We didn’t just throw bodies at the problem. Our team used the ECOA AI Platform ACP to automate the repetitive parts of the migration.

  • Schema validation scripts were generated and tested by AI agents
  • Replication monitoring dashboards were built in hours, not days
  • Rollback playbooks were drafted, reviewed, and simulated automatically

The AI agents handled the grunt work. Our engineers focused on the hard decisions. That’s the real productivity multiplier.

Key Takeaways for Anyone Planning a Large-Scale DB Migration

  1. Never use `pg_dump` for anything over 100GB. Logical replication is harder to set up but worth every minute of effort.
  2. Test against production traffic patterns. Staging data lies. Synthetic benchmarks lie more.
  3. Build the rollback plan first. If you can’t describe how to undo the migration, you’re not ready to do it.
  4. Invest in monitoring. We used pg_stat_replication and custom Prometheus exporters to track lag in real-time.
  5. Hire people who’ve done it before. This isn’t a junior engineer project. Experience matters.

Frequently Asked Questions

Q: Why did you choose logical replication over AWS DMS?

A: We evaluated AWS Database Migration Service. It’s great for homogeneous migrations, but we needed fine-grained control over schema transformations and replication slot management. Logical replication gave us that control. DMS also adds another layer of complexity — more things that can break.

Q: How did you handle the 500+ table schema validation?

A: We built a Python script that compared table structures, indexes, constraints, and row counts between source and target. It ran every hour during the 2-week rehearsal period. Any discrepancy flagged an alert. We fixed about 30 schema mismatches before cutover day.

Q: Could this have been done with a smaller team?

A: Maybe. But the timeline would have stretched to 8-10 weeks. The client needed it in 4. Four senior engineers working in parallel was the minimum viable team size. Anything less and you’re risking burnout or mistakes.

Q: What was the biggest risk you didn’t anticipate?

A: The `citext` extension issue. We tested schema compatibility on a subset of tables, but the edge case with custom collation only appeared under specific write patterns. We caught it during the rehearsal phase, not production. That was a close call.

Related reading: Vietnam Outsourcing: Why Smart CTOs Are Moving Their Dev Teams Here in 2025

Related reading: Outsourcing Software in 2025: Why Smart CTOs Are Rethinking Offshore Engineering

Leave a Comment

Your email address will not be published. Required fields are marked *

Ready to Build with AI-Powered Developers?

Hire Vietnamese engineers augmented by ECOA AI Platform + Claude Code. 5x faster, 40% cheaper.