From 100 to 100,000 Concurrent Users: How a Real-Time SaaS Scaled with a Vietnamese Team and AI Orchestration

You’re cruising along with 100 happy users. Then a single tweet from an influencer hits. Suddenly, you’re staring at a 1000x traffic spike. Your database screams. Your WebSocket connections drop. Your ops team panics.

That’s exactly what happened to a US-based real-time collaboration platform we’ll call “SyncFlow.” They built a whiteboard tool for remote teams. It worked beautifully for small groups. But when a major tech publication featured them, the floodgates opened.

I Benchmarked 6 AI Coding Tools on a 50K-Line Codebase — Here’s How They Actually Wrote Production-Ready Code

I Benchmarked 6 AI Coding Tools on a 50K-Line Codebase — Here’s How They Actually Wrote Production-Ready Code… ...

This isn’t a generic “we scaled” story. It’s a detailed technical case study of how SyncFlow partnered with a Vietnamese engineering team in Ho Chi Minh City, combined with the ECOA AI Platform ACP, to go from 100 to 100,000 concurrent users in just 4 weeks. No magic. Just solid engineering and smart orchestration.

The Problem: WebSocket Meltdown and DB Timeouts

SyncFlow’s architecture was simple: a Node.js server with Socket.IO, a PostgreSQL database, and Redis for pub/sub. At 100 concurrent users, it ran fine. At 5,000, it started choking. At 20,000, it died.

Why You Should Hire Vietnamese Developers: The Smart 2025 Offshoring Play

TL;DR: Vietnam is rapidly becoming the top destination for offshore software development in Asia. Lower costs than China,… ...

Key bottlenecks we identified:

Single WebSocket server couldn’t handle more than 2,000 connections without memory exhaustion.
Database writes per second exceeded 5,000 – PostgreSQL replication lag skyrocketed.
No connection draining or graceful shutdown – deployments caused user disconnects.
Monitoring was basic – no real-time metrics on connection health.

The CTO called us in a panic. “We have 50,000 users trying to connect, and we’re losing them every minute.”

The Solution: A Two-Pronged Approach

We assembled a team of 5 senior engineers from our Ho Chi Minh City hub, each vetted for real-time systems experience. Then we added the ECOA AI Platform ACP to automate testing, deployment, and rollback decisions. That’s where the 5x efficiency gain kicked in.

1. Horizontal Scaling with Redis Pub/Sub

We moved from a single Socket.IO server to a cluster behind an AWS Network Load Balancer. Each server instance handled up to 2,000 connections. To broadcast events across servers, we used Redis Pub/Sub with a dedicated channel per room.

javascript
// Socket.IO adapter using Redis cluster
const { createAdapter } = require('@socket.io/redis-adapter');
const { createClient } = require('redis');

const pubClient = createClient({ url: 'redis://cluster-endpoint:6379' });
const subClient = pubClient.duplicate();

io.adapter(createAdapter(pubClient, subClient));

Simple, but we needed to handle Redis failover. We configured a Redis Cluster with 3 primary nodes and 3 replicas. The ECOA AI Platform ACP automatically detected connection drops and triggered a failover script within 2 seconds.

2. Database Sharding by Tenant

PostgreSQL couldn’t keep up with 5,000 writes/second. We sharded by tenant ID using Citus. Each shard handled a subset of tenants. The ECOA AI agent monitored query latency and automatically redistributed shards when any node hit 80% CPU.

Table: Shard distribution after optimization

Shard ID	Tenants	Avg Writes/sec	CPU Usage
shard-01	1-500	1,200	45%
shard-02	501-1000	1,100	42%
shard-03	1001-1500	1,150	43%
shard-04	1501-2000	1,180	44%

We also introduced a write-behind cache with Redis. The ECOA AI agent orchestrated a two-phase commit: write to Redis first, then asynchronously flush to PostgreSQL. If a flush failed, the agent retried with exponential backoff.

3. AI-Powered Deployment and Rollback

Here’s where things got interesting. Manual deployments were causing 30-second connection drops. We built an AI agent using the ECOA AI Platform ACP that:

Analyzed traffic patterns to pick the lowest-traffic window.
Drained connections from old instances before shutting them down.
Canaried the new deployment on 5% of users first.
Automatically rolled back if error rate exceeded 0.1% within 2 minutes.

The agent ran as a state machine with three states: `DRAINING`, `CANARY`, `FULL_ROLLOUT`. It used the ECOA AI Platform’s built-in health check hooks.

yaml
# ECOA AI agent workflow snippet
states:
  - name: DRAINING
    action: "drain_connections"
    timeout: 30s
    on_success: CANARY
    on_failure: ROLLBACK

  - name: CANARY
    action: "deploy_canary"
    percentage: 5
    monitor_duration: 120s
    threshold_error_rate: 0.001
    on_success: FULL_ROLLOUT
    on_failure: ROLLBACK

  - name: FULL_ROLLOUT
    action: "deploy_all"
    on_success: COMPLETE
    on_failure: ROLLBACK

The result? Deployments went from 30-second downtime to zero downtime for 99.97% of releases. The AI agent caught 3 bad deployments in the first week alone.

The Results: 4 Weeks, 100 to 100,000 Users

We rolled out the changes incrementally. Each week, we stress-tested with simulated traffic from AWS Distributed Load Testing. The ECOA AI agent adjusted instance counts and shard allocations automatically.

Week 1: 100 → 5,000 users. No issues.

Week 2: 5,000 → 20,000 users. One Redis failover triggered automatically – agent recovered in 1.2 seconds.

Week 3: 20,000 → 50,000 users. Database shard rebalancing needed – agent redistributed in 4 minutes.

Week 4: 50,000 → 100,000 concurrent users. Smooth sailing.

Key metrics at 100,000 concurrent users:

Average WebSocket message latency: < 50ms
Database write latency (p99): 120ms
Instance count: 50 (c5.xlarge)
Monthly infrastructure cost: $12,000 (down from projected $40,000 without optimization)
Team size: 5 senior developers + ECOA AI Platform ACP

Why the Vietnamese Team Mattered

Honestly, we could have hired locally in the US. But the cost would have been 3x higher, and finding engineers with real-time scaling experience is tough. The team in Ho Chi Minh City had deep experience with Socket.IO, Redis, and PostgreSQL sharding. They’d built similar systems for Vietnamese fintech companies handling millions of transactions.

More importantly, they adapted quickly to the ECOA AI Platform. Within a week, they were writing custom agent workflows. One engineer even built an agent that automatically generated deployment reports and posted them to Slack. That’s the kind of ownership you don’t always get with offshore teams.

Lessons Learned

Don’t over-optimize early. SyncFlow’s original architecture was fine for 100 users. But they waited too long to plan for scale. Have a scaling playbook ready before you need it.
AI orchestration isn’t just for chatbots. Using the ECOA AI Platform to automate deployment and rollback saved us hours of manual work. It’s like having a senior DevOps engineer on autopilot.
Vietnamese engineering talent is world-class. The team we worked with didn’t just follow instructions – they proactively suggested improvements. That’s rare in offshore arrangements.

What’s next for SyncFlow? They’re now targeting 1 million concurrent users. The same team, the same platform, and a lot more AI agents.

Frequently Asked Questions

How did the ECOA AI Platform ACP handle Redis failover automatically?

The platform includes a health-check agent that monitors Redis cluster nodes every second. If a primary node goes down, the agent triggers a failover script that promotes a replica and updates the Socket.IO adapter configuration. The entire process takes under 2 seconds, and connections are re-established transparently.

What was the biggest technical challenge during the scaling process?

Handling connection draining during deployments. Without it, users experienced 30-second disconnects. We solved it by implementing a custom Socket.IO middleware that tracked active connections and refused new ones during drain mode. The ECOA AI agent coordinated the timing across all instances.

How did you ensure data consistency with the write-behind cache?

We used Redis Streams with consumer groups. Each write was added to a stream and acknowledged only after PostgreSQL confirmed the insert. If a consumer failed, the ECOA AI agent restarted it and reprocessed unacknowledged entries. We also ran periodic reconciliation jobs every hour to catch any missed writes.

Can I replicate this setup with my own team?

Absolutely. The architecture patterns are standard: Redis Pub/Sub for WebSocket scaling, Citus for database sharding, and a state machine for deployment orchestration. The key differentiator was the AI agent automating the decision-making. You can build a similar agent using the ECOA AI Platform ACP or even open-source tools like Temporal or AWS Step Functions. But the pre-built workflows saved us weeks of development.

Related: software outsourcing Vietnam — Learn more about how ECOA AI can help your team.

Related: Outsource to Vietnam — Learn more about how ECOA AI can help your team.

Related: Vietnam outsourcing — Learn more about how ECOA AI can help your team.

Related: Vietnam offshore development — Learn more about how ECOA AI can help your team.

From 100 to 100,000 Concurrent Users: How a Real-Time SaaS Scaled with a Vietnamese Team and AI Orchestration

From 100 to 100,000 Concurrent Users: How a Real-Time SaaS Scaled with a Vietnamese Team and AI Orchestration

I Benchmarked 6 AI Coding Tools on a 50K-Line Codebase — Here’s How They Actually Wrote Production-Ready Code

The Problem: WebSocket Meltdown and DB Timeouts

Why You Should Hire Vietnamese Developers: The Smart 2025 Offshoring Play