From 100 to 100,000 Concurrent Users: How a Real-Time SaaS Scaled with a Vietnamese Team and AI Orchestration
You’re cruising along with 100 happy users. Then a single tweet from an influencer hits. Suddenly, you’re staring at a 1000x traffic spike. Your database screams. Your WebSocket connections drop. Your ops team panics.
That’s exactly what happened to a US-based real-time collaboration platform we’ll call “SyncFlow.” They built a whiteboard tool for remote teams. It worked beautifully for small groups. But when a major tech publication featured them, the floodgates opened.
Why Your AI Agent Orchestration Needs a State Machine (Not Just a DAG)
Why Your AI Agent Orchestration Needs a State Machine (Not Just a DAG) I’ve been building multi-agent systems… ...
This isn’t a generic “we scaled” story. It’s a detailed technical case study of how SyncFlow partnered with a Vietnamese engineering team in Ho Chi Minh City, combined with the ECOA AI Platform ACP, to go from 100 to 100,000 concurrent users in just 4 weeks. No magic. Just solid engineering and smart orchestration.
The Problem: WebSocket Meltdown and DB Timeouts
SyncFlow’s architecture was simple: a Node.js server with Socket.IO, a PostgreSQL database, and Redis for pub/sub. At 100 concurrent users, it ran fine. At 5,000, it started choking. At 20,000, it died.
Why Smart CTOs Hire Vietnamese Developers: The Data-Driven Case for Vietnam’s Tech Talent
TL;DR: Hiring Vietnamese developers offers a unique blend of strong technical skills, competitive rates, and time zone alignment… ...
Key bottlenecks we identified:
- Single WebSocket server couldn’t handle more than 2,000 connections without memory exhaustion.
- Database writes per second exceeded 5,000 – PostgreSQL replication lag skyrocketed.
- No connection draining or graceful shutdown – deployments caused user disconnects.
- Monitoring was basic – no real-time metrics on connection health.
The CTO called us in a panic. “We have 50,000 users trying to connect, and we’re losing them every minute.”
The Solution: A Two-Pronged Approach
We assembled a team of 5 senior engineers from our Ho Chi Minh City hub, each vetted for real-time systems experience. Then we added the ECOA AI Platform ACP to automate testing, deployment, and rollback decisions. That’s where the 5x efficiency gain kicked in.
1. Horizontal Scaling with Redis Pub/Sub
We moved from a single Socket.IO server to a cluster behind an AWS Network Load Balancer. Each server instance handled up to 2,000 connections. To broadcast events across servers, we used Redis Pub/Sub with a dedicated channel per room.
javascript
// Socket.IO adapter using Redis cluster
const { createAdapter } = require('@socket.io/redis-adapter');
const { createClient } = require('redis');
const pubClient = createClient({ url: 'redis://cluster-endpoint:6379' });
const subClient = pubClient.duplicate();
io.adapter(createAdapter(pubClient, subClient));
Simple, but we needed to handle Redis failover. We configured a Redis Cluster with 3 primary nodes and 3 replicas. The ECOA AI Platform ACP automatically detected connection drops and triggered a failover script within 2 seconds.
2. Database Sharding by Tenant
PostgreSQL couldn’t keep up with 5,000 writes/second. We sharded by tenant ID using Citus. Each shard handled a subset of tenants. The ECOA AI agent monitored query latency and automatically redistributed shards when any node hit 80% CPU.
Table: Shard distribution after optimization
| Shard ID | Tenants | Avg Writes/sec | CPU Usage |
|---|---|---|---|
| shard-01 | 1-500 | 1,200 | 45% |
| shard-02 | 501-1000 | 1,100 | 42% |
| shard-03 | 1001-1500 | 1,150 | 43% |
| shard-04 | 1501-2000 | 1,180 | 44% |
We also introduced a write-behind cache with Redis. The ECOA AI agent orchestrated a two-phase commit: write to Redis first, then asynchronously flush to PostgreSQL. If a flush failed, the agent retried with exponential backoff.
3. AI-Powered Deployment and Rollback
Here’s where things got interesting. Manual deployments were causing 30-second connection drops. We built an AI agent using the ECOA AI Platform ACP that:
- Analyzed traffic patterns to pick the lowest-traffic window.
- Drained connections from old instances before shutting them down.
- Canaried the new deployment on 5% of users first.
- Automatically rolled back if error rate exceeded 0.1% within 2 minutes.
The agent ran as a state machine with three states: `DRAINING`, `CANARY`, `FULL_ROLLOUT`. It used the ECOA AI Platform’s built-in health check hooks.
yaml
# ECOA AI agent workflow snippet
states:
- name: DRAINING
action: "drain_connections"
timeout: 30s
on_success: CANARY
on_failure: ROLLBACK
- name: CANARY
action: "deploy_canary"
percentage: 5
monitor_duration: 120s
threshold_error_rate: 0.001
on_success: FULL_ROLLOUT
on_failure: ROLLBACK
- name: FULL_ROLLOUT
action: "deploy_all"
on_success: COMPLETE
on_failure: ROLLBACK
The result? Deployments went from 30-second downtime to zero downtime for 99.97% of releases. The AI agent caught 3 bad deployments in the first week alone.
The Results: 4 Weeks, 100 to 100,000 Users
We rolled out the changes incrementally. Each week, we stress-tested with simulated traffic from AWS Distributed Load Testing. The ECOA AI agent adjusted instance counts and shard allocations automatically.
Week 1: 100 → 5,000 users. No issues.
Week 2: 5,000 → 20,000 users. One Redis failover triggered automatically – agent recovered in 1.2 seconds.
Week 3: 20,000 → 50,000 users. Database shard rebalancing needed – agent redistributed in 4 minutes.
Week 4: 50,000 → 100,000 concurrent users. Smooth sailing.
Key metrics at 100,000 concurrent users:
- Average WebSocket message latency: < 50ms
- Database write latency (p99): 120ms
- Instance count: 50 (c5.xlarge)
- Monthly infrastructure cost: $12,000 (down from projected $40,000 without optimization)
- Team size: 5 senior developers + ECOA AI Platform ACP
Why the Vietnamese Team Mattered
Honestly, we could have hired locally in the US. But the cost would have been 3x higher, and finding engineers with real-time scaling experience is tough. The team in Ho Chi Minh City had deep experience with Socket.IO, Redis, and PostgreSQL sharding. They’d built similar systems for Vietnamese fintech companies handling millions of transactions.
More importantly, they adapted quickly to the ECOA AI Platform. Within a week, they were writing custom agent workflows. One engineer even built an agent that automatically generated deployment reports and posted them to Slack. That’s the kind of ownership you don’t always get with offshore teams.
Lessons Learned
- Don’t over-optimize early. SyncFlow’s original architecture was fine for 100 users. But they waited too long to plan for scale. Have a scaling playbook ready before you need it.
- AI orchestration isn’t just for chatbots. Using the ECOA AI Platform to automate deployment and rollback saved us hours of manual work. It’s like having a senior DevOps engineer on autopilot.
- Vietnamese engineering talent is world-class. The team we worked with didn’t just follow instructions – they proactively suggested improvements. That’s rare in offshore arrangements.
What’s next for SyncFlow? They’re now targeting 1 million concurrent users. The same team, the same platform, and a lot more AI agents.
Frequently Asked Questions
How did the ECOA AI Platform ACP handle Redis failover automatically?
The platform includes a health-check agent that monitors Redis cluster nodes every second. If a primary node goes down, the agent triggers a failover script that promotes a replica and updates the Socket.IO adapter configuration. The entire process takes under 2 seconds, and connections are re-established transparently.
What was the biggest technical challenge during the scaling process?
Handling connection draining during deployments. Without it, users experienced 30-second disconnects. We solved it by implementing a custom Socket.IO middleware that tracked active connections and refused new ones during drain mode. The ECOA AI agent coordinated the timing across all instances.
How did you ensure data consistency with the write-behind cache?
We used Redis Streams with consumer groups. Each write was added to a stream and acknowledged only after PostgreSQL confirmed the insert. If a consumer failed, the ECOA AI agent restarted it and reprocessed unacknowledged entries. We also ran periodic reconciliation jobs every hour to catch any missed writes.
Can I replicate this setup with my own team?
Absolutely. The architecture patterns are standard: Redis Pub/Sub for WebSocket scaling, Citus for database sharding, and a state machine for deployment orchestration. The key differentiator was the AI agent automating the decision-making. You can build a similar agent using the ECOA AI Platform ACP or even open-source tools like Temporal or AWS Step Functions. But the pre-built workflows saved us weeks of development.
Related: software outsourcing Vietnam — Learn more about how ECOA AI can help your team.
Related: Outsource to Vietnam — Learn more about how ECOA AI can help your team.
Related: Vietnam outsourcing — Learn more about how ECOA AI can help your team.
Related: Vietnam offshore development — Learn more about how ECOA AI can help your team.
Related reading: Why Top CTOs Hire Vietnamese Developers: The 2025 Offshoring Playbook