Real-Time Multiplayer at Scale: How We Built a Game Backend for 50K Concurrent Players with a Vietnamese Team
I’ll be honest — when the client first came to us, I was skeptical.
A mid-size mobile gaming studio from Berlin wanted a real-time multiplayer backend for their upcoming battle royale-style game. Their existing prototype barely handled 500 concurrent players before the WebSocket connections started dropping like flies. They needed 50,000. Simultaneously. With sub-100ms latency.
Outsourcing Software in 2025: The Playbook for CTOs Who Want Results
TL;DR: Outsourcing software isn’t about cost reduction anymore—it’s about speed and quality. This playbook shows you how to… ...
And they wanted it shipped in 10 weeks.
Here’s the thing about game backends — they’re *nothing* like typical CRUD apps. You’re not just serving data. You’re managing state synchronization across thousands of clients, handling tick rates, resolving conflicts, and dealing with players who have zero tolerance for lag. Get it wrong, and your game dies before it launches.
Your Open Source PRs Are Getting Rejected: Here’s the Exact Data on Why (And How to Fix It)
Your Open Source PRs Are Getting Rejected: Here’s the Exact Data on Why (And How to Fix It)… ...
We got it right. Here’s exactly how.
The Problem: A Single-Threaded Monocle That Couldn’t See Past 500 Users
The studio’s prototype backend was a Node.js monolith running on a single EC2 instance. Every player connection consumed a full thread. State updates were broadcast via naive `for` loops over an array of socket references.
Every time a player fired a weapon, the server iterated *all* connected clients to check who needed the update. O(n) per event. Across 500 players, that was already painful. At 5,000, it would literally melt.
The codebase had no sharding, no message queue, no state delta compression. Just raw optimism and a lot of `socket.emit()` calls.
They knew they needed a complete rewrite. But their in-house team was tapped out maintaining the client-side Unity code. They needed backend specialists who understood distributed systems, real-time networking, and game state management.
That’s where we came in.
The Team: 5 Vietnamese Engineers, 1 AI Platform, 8 Weeks
ECOA AI assembled a dedicated team of 5 Vietnamese developers within 72 hours of the kickoff call:
- 2 senior backend engineers (Go + Rust experience)
- 1 senior DevOps engineer (AWS, Kubernetes, Redis clustering)
- 1 mid-level game server specialist (previous experience with Photon and Nakama)
- 1 junior engineer (handling documentation, test automation, and CI/CD)
All of them were onboarded onto the ECOA AI Platform ACP on day one. That platform was the force multiplier — it gave each engineer an AI agent orchestrator that handled boilerplate code generation, test creation, and deployment automation.
Without the ACP, we’d have needed 12-15 engineers to hit the same velocity. The platform didn’t replace the team — it supercharged them.
The Architecture: Event-Driven, Sharded, and Resilient
Let me walk you through what we actually built. This isn’t theoretical — this is what survived production.
Core Components
| Component | Tech | Purpose |
|---|---|---|
| Gateway Layer | Go + WebSocket | Connection management, TLS termination, rate limiting |
| Game State Service | Rust | Tick-based state simulation, physics, collision detection |
| Matchmaking Service | Go | Elo-based player matching, queue management |
| Room Manager | Go | Game room lifecycle, player assignment |
| State Cache | Redis Cluster + RedisGears | In-memory game state with pub/sub triggers |
| Persistence Layer | PostgreSQL (Citus) | Player profiles, match history, analytics |
| Message Bus | NATS JetStream | Reliable event delivery between services |
Why We Chose Rust for the State Service
This was the critical path. The game state service processes 60 ticks per second, each tick updating positions, health, ammo, and collision states for up to 100 players per match.
Go would have worked. But Rust gave us:
- Predictable garbage collection — zero GC pauses meant consistent tick times
- Fine-grained memory control — we could pre-allocate arenas for player states
- Better CPU cache utilization — the struct-of-arrays pattern we used kept hot data contiguous
The state service handled 4,800 events per second per match during peak load. Average tick processing time: 2.1ms. That left plenty of headroom for network latency.
The Sharding Strategy That Saved Us
Here’s the key insight: *most game state is local to a match*. Players in match #123 don’t care about what’s happening in match #456.
We sharded by match ID across multiple Rust worker pods. Each pod owned exactly 20-30 matches. This meant:
- No cross-pod state synchronization needed
- Linear scaling — add more pods to handle more matches
- Failure isolation — one pod crashing affected only 20-30 matches
The Redis cluster mirrored this sharding. Each match’s state lived in a dedicated Redis hash, keyed by `match:{match_id}`. The `Room Manager` service routed all state operations to the correct Redis shard.
This single decision eliminated 90% of the coordination complexity. Most game backend failures I’ve seen come from trying to maintain global state across thousands of concurrent matches. Don’t do it. Shard by match.
Handling the Websocket Stampede
When a match starts, 100 players all connect simultaneously. The gateway needs to handle that spike without dropping connections.
We used Go goroutines for connection handling — each player gets their own goroutine, and Go’s runtime efficiently multiplexes them across OS threads. The gateway scaled horizontally behind an AWS Network Load Balancer with sticky sessions (important for WebSocket persistence).
During the 50K player stress test, the gateway layer handled 25,000 simultaneous connections per pod across 2 pods (with 2 more as hot spares). Connection setup time: 4.3ms average. Zero timeouts.
The AI Platform Effect: Where ACP Actually Made a Difference
I want to be specific about this because “AI-powered” gets thrown around too loosely. The ECOA AI Platform ACP contributed in three concrete ways:
1. Automated Test Generation for State Transitions
Game state machines are complex. A match goes through: `LOBBY → COUNTDOWN → ACTIVE → POSTGAME → CLOSED`. Each transition has rules — you can’t go from LOBBY to POSTGAME without passing through ACTIVE, for example.
The ACP analyzed our state machine definitions and generated 187 edge case tests we hadn’t thought of. Stuff like: “What happens if a player disconnects during the COUNTDOWN phase?” or “Can a match be closed while players are still in the ACTIVE state?”
It caught 3 bugs that would have caused match corruption in production. One of them would have left players stuck in an infinite LOBBY state. That alone paid for the platform.
2. CI/CD Pipeline Acceleration
Our DevOps engineer set up a GitHub Actions pipeline that:
- Built all 6 services
- Ran unit and integration tests
- Deployed to a staging cluster
- Executed a 10-minute load test against the staging cluster
Without ACP, this pipeline took 22 minutes from push to completion. The ACP added intelligent caching — it detected unchanged services and skipped their build and test stages. Pipeline time dropped to 9 minutes. That’s a 59% reduction.
Our junior engineer handled most of this pipeline work. The ACP’s agent generated the initial YAML configurations and suggested optimization patterns. He just reviewed and approved.
3. Code Review Assistance
Here’s where I was skeptical but became a believer. The ACP doesn’t write PR reviews for you — it flags *anomalies*. Things like:
- A function that’s suspiciously similar to another function elsewhere
- A variable name that doesn’t match the team’s conventions
- A Redis key pattern that deviates from the established format
It caught 14 convention violations in the first week alone. More importantly, it reduced our code review cycle time from an average of 6 hours to 45 minutes. The senior engineers spent their review time on actual logic, not formatting nits.
The Results: What We Actually Measured
After 8 weeks of development and 2 weeks of hardening, we ran a 48-hour production simulation with simulated players.
| Metric | Target | Actual |
|---|---|---|
| Concurrent players | 50,000 | 52,300 (peak) |
| Average tick time | <5ms | 2.1ms |
| P99 state update latency | <100ms | 47ms |
| P99 matchmaking time | <10s | 3.2s |
| Uptime (48h test) | 99.9% | 99.98% |
| Connection drop rate | <1% | 0.3% |
| Total infrastructure cost (monthly) | $18,000 | $14,200 |
The client was frankly stunned. Their CTO told me: *”We budgeted 6 months for this. You delivered in 8 weeks with a team we’ve never met in person.”*
But honestly, the credit goes to the team in Vietnam. They understood the domain deeply — our senior Rust engineer had previously contributed to game server open-source projects in his spare time. The AI platform amplified their velocity, but the architectural decisions came from their experience.
One thing I learned: never underestimate the power of a team that’s done this before. Domain expertise beats raw talent every time.
Key Takeaways for Anyone Building Real-Time Systems
Here’s what I’d tell any CTO or engineering lead considering a similar project:
- Shard by session, not by service. The most common mistake is building services that need global state. If your system naturally divides into isolated sessions (game matches, chat rooms, video calls), structure your architecture around that isolation.
- Don’t fight the language. We used Go for I/O-heavy services and Rust for compute-heavy services. Could we have used one language for everything? Sure. But we’d have paid for it in performance or complexity.
- Invest in real load testing early. We started running 10,000-player simulations in week 3. It exposed a race condition in our matchmaking service that would have been catastrophic at scale.
- Trust the team, not just the location. Our developers were in Ho Chi Minh City and Can Tho. They were distributed. But they were also highly experienced and motivated. The AI platform removed friction; it didn’t create capability where none existed.
- AI orchestration is a force multiplier, not a replacement. The ACP helped us move faster, but it didn’t design the architecture. That came from the senior engineers who had been building distributed systems for years.
Final Thoughts
A year ago, I would have told you building a 50K-player game backend in 8 weeks with a offshore team was unrealistic. Today, I’ve lived it.
The combination works. Elite Vietnamese engineers, AI-augmented workflows, and a client willing to trust the process. That’s the formula.
The game hasn’t launched yet — the studio is still polishing the client-side experience. But the backend is battle-tested, documented, and running on autopilot. When they hit the “Launch” button, they won’t be wondering if the servers will hold.
They know.
Frequently Asked Questions
How did the Vietnamese team handle the domain complexity of game backend development?
Our team wasn’t starting from zero. Two of the senior engineers had previously built real-time systems for fintech and IoT — domains with similar latency and state synchronization requirements. The mid-level engineer had direct game server experience with Nakama. We also invested 3 days in a “domain sprint” during week 1 where the team played the game prototype and studied the existing state machine. That upfront context eliminated most of the domain ambiguity.
What was the biggest technical risk, and how did you mitigate it?
The Rust game state service. It was the most performance-critical component and the team had limited production Rust experience. We mitigated this by: (1) having the two senior engineers pair-program the core tick loop, (2) using the ACP to generate unit tests for every state transition, and (3) running load tests continuously from week 3 onward. The first production simulation revealed a memory leak in the player allocation arena — we fixed it in 4 hours.
How does the ECOA AI Platform ACP compare to using Copilot or Cursor for this type of work?
Different tools, different purposes. Copilot and Cursor are great for inline code completion — they help you write functions faster. The ACP operates at the workflow level: it orchestrates CI/CD, generates test cases from architectural definitions, and monitors code consistency across a team. It’s more like having a senior DevOps engineer + QA lead in a box. For this project, the ACP saved us more time in pipeline automation and test coverage than in code writing.
Can this architecture work for non-gaming real-time applications?
Absolutely. The same patterns apply to any real-time multi-session system: chat platforms, collaborative editing tools, live auction systems, or real-time monitoring dashboards. The key components — WebSocket gateway, session-based state sharding, Redis pub/sub, and event-driven message bus — are domain-agnostic. We’ve since reused this architecture for a client in the live events ticketing space. It handled 200,000 concurrent users during a ticket sale with zero issues.
Related reading: Vietnam Outsourcing: Why Smart CTOs Are Ditching India and Philippines in 2025
Related reading: Outsourcing Software in 2025: How to Build a Remote Engineering Team That Actually Ships