How We Built a Custom Internal Developer Platform for a Fintech Unicorn — A Vietnam Offshore + AI Orchestration Case Study
Platform engineering is having its moment. And for good reason.
When your organization runs 12+ microservice teams, each with their own deployment pipelines, environment configurations, and CI/CD quirks, you don’t have a scaling problem. You have a chaos problem.
AI Agent State Management: Best Practices for Scalable Systems | ECOA AI
TL;DR AI agent state management is the backbone of reliable multi-agent systems – it ensures agents remember context,… ...
That’s exactly where a fintech unicorn—let’s call them PayFlow—found themselves in early 2024. They had 180 engineers across 12 squads, processing $4.2B in annual transaction volume. But their developer experience was a nightmare.
The numbers before we started:
Your Multi-Agent Orchestrator Is a Central Brain That Will Fail: Why You Need a Distributed Coordinator
Your Multi-Agent Orchestrator Is a Central Brain That Will Fail: Why You Need a Distributed Coordinator Let me… ...
- Average deployment lead time: 4.2 days
- Environment-related failure rate: 23% of all deployments
- New developer onboarding time: 3-4 weeks
- Infrastructure cost per dev team: $18K/month (duplicated tooling)
PayFlow’s CTO had tried internal tools before. Two in-house platform teams had attempted to build an Internal Developer Platform (IDP) over 18 months. Both failed. The first was too rigid. The second tried to do everything at once and burned out.
They needed a fresh approach. So they came to us.
Why a Vietnamese Team with AI Orchestration Was the Right Call
Let’s be direct about this.
Building an IDP isn’t a feature project. It’s a product that serves developers. You need engineers who understand developer workflows deeply, can write clean infrastructure code, and work at the speed of a startup—not a consulting firm.
We staffed the project with a team of 6 engineers from our Ho Chi Minh City hub:
- 1 Senior Platform Engineer (team lead, Kubernetes + Go)
- 2 Middle Backend Engineers (Go, PostgreSQL, gRPC)
- 1 Senior DevOps Engineer (Terraform, Crossplane, ArgoCD)
- 1 Middle Frontend Engineer (React, Backstage)
- 1 Junior Developer (automation scripts, test infra)
All 6 were vetted for English fluency and had prior experience building internal tooling for product companies. None of them needed “ramp-up time.” They shipped on day one.
We also equipped them with the ECOA AI Platform ACP for agent orchestration. This wasn’t a gimmick. It was a force multiplier.
Here’s why: Building an IDP involves dozens of repetitive, interconnected tasks—generating Terraform modules, writing scaffolded service templates, validating YAML configurations, creating documentation stubs. ACP allowed us to deploy specialized AI agents that handled these tasks autonomously, while the senior engineers focused on architecture and code review.
The Architecture: What We Actually Built
You can’t build a good IDP without understanding what developers actually hate. So we started with a survey.
Top 3 pain points from PayFlow’s engineers:
- “Deploying a new microservice takes 3 days of manual config work.”
- “Staging environments are never in sync with production.”
- “I have no idea which services depend on mine.”
We designed the IDP around these three problems.
Here’s the high-level architecture:
┌────────────────────────────────────────────────────┐
│ Developer Portal │
│ (Backstage + Custom Plugins) │
├────────────────────────────────────────────────────┤
│ Service Catalog & Scoring │
├────────────────────────────────────────────────────┤
│ Provisioning Layer │ Deployment Layer │
│ (Crossplane + Terraform) │ (ArgoCD + K8s) │
├──────────────────────────┴─────────────────────────┤
│ Golden Path Templates │
│ (Go service, Python service, Cron job, etc.) │
├────────────────────────────────────────────────────┤
│ ECOA AI Platform ACP Agents │
│ (Template scaffold → Config gen → Docs → Review) │
└────────────────────────────────────────────────────┘
The key insight? We didn’t build everything from scratch. We extended Backstage with custom plugins and wrapped it with ACP agents that automated the grunt work.
Golden Path Templates with AI Scaffolding
Every new service at PayFlow now starts with a single command:
bash
idp create service payment-gateway --type go-service
Behind the scenes, ACP orchestrates a sequence of agents:
- Template Agent pulls the latest Go service template from a curated registry
- Config Agent generates Kubernetes manifests, Helm charts, and Terraform modules based on service type and team namespace
- Dependency Agent scans the service catalog and auto-registers the new service with its upstream dependencies
- Docs Agent generates an OpenAPI spec stub, a README, and a local development docker-compose file
- Review Agent submits a PR to the platform team’s GitHub repo with all generated code
The entire process takes 12-14 minutes. Previously, it took a senior engineer a full day.
go
// Example: A simplified agent orchestration task on ACP
task := acp.NewTask("scaffold_service").
WithInput(map[string]interface{}{
"service_name": "payment-gateway",
"language": "go",
"team": "payments",
"dependencies": []string{"auth-service", "ledger-service"},
}).
WithAgents([]string{"template-agent", "config-agent", "docs-agent"}).
WithPostHook("pr_submission")
result, err := orchestrator.Execute(ctx, task)
The Self-Service Environment Manager
Environment conflicts were killing PayFlow’s velocity. One team’s staging deployment would overwrite another team’s database migration. Classic multi-tenant problem.
We built an environment manager that uses Kubernetes namespaces with resource quotas and TTL-based cleanup.
Each developer gets a personal ephemeral environment that lives for 8 hours (configurable). These environments are provisioned on demand via Crossplane, which creates the namespace, deploys the service, and wires up dependencies using service mesh routing.
The AI twist: ACP monitors environment usage patterns. If a developer’s environment hasn’t received traffic in 2 hours, an agent sends a Slack reminder. If it’s idle for 4 hours, it auto-destroys the environment and archives the logs. This alone cut infrastructure costs by 37%.
Service Catalog with Real-Time Dependency Graphs
We built a Backstage plugin that ingests data from multiple sources:
- Kubernetes custom resource definitions (CRDs) for service-to-service communication
- OpenTelemetry trace data for runtime dependencies
- GitHub repository metadata for ownership and code quality
The result? A live dependency graph that shows exactly which services talk to each other, their latency percentiles, and their deployment status.
ACP agents run a daily health check on every service in the catalog. If a service hasn’t been deployed in 30 days, the agent flags it for potential decommissioning. PayFlow’s platform team used to do this manually once a quarter. Now it happens every night.
The Metrics That Matter
We shipped the IDP in 11 weeks with the Vietnamese team. Here’s what changed in the first 90 days after rollout:
| Metric | Before | After | Improvement |
|---|---|---|---|
| Deployment lead time (new service) | 4.2 days | 45 minutes | 99.3% |
| Environment failure rate | 23% | 2.1% | 90.9% |
| New dev onboarding time | 3-4 weeks | 2 days | 90%+ |
| Infra cost per team/month | $18K | $5.3K | 70.6% |
| Change failure rate | 17% | 4.6% | 72.9% |
Honestly, the onboarding number surprised even us. We didn’t expect a 2-day onboarding. What happened was: the Golden Path templates and self-service environments eliminated the “setup hell” that usually consumes the first week. New developers could deploy a working service on day one and explore the architecture using the dependency graph.
Where the AI Orchestration Actually Shined
Not every AI tool lives up to the hype. But ACP earned its place in this project.
The most impactful use case wasn’t code generation. It was configuration validation.
PayFlow’s infrastructure stack includes 14 different types of configuration files per service: Helm values, Terraform variables, Crossplane claims, ArgoCD application specs, service mesh policies, monitoring dashboards, and alerting rules. Before our IDP, a single misconfigured field could block a deployment for hours.
We built an ACP agent that parses every generated config file against a set of 147 validation rules. These rules cover security policies (no hardcoded secrets), resource limits (every container must have CPU/memory limits), and naming conventions (services must match a regex pattern).
When the agent finds a violation, it doesn’t just report it. It auto-fixes the config and re-validates. If the fix is ambiguous, it comments on the PR with the exact line and suggested correction.
In the first month alone, this agent caught 843 configuration errors before they hit production. That’s 843 incidents that never happened.
The Hard Lessons
It wasn’t all smooth sailing. A few things we’d do differently:
Don’t over-automate too early. In weeks 1-3, we tried to make ACP agents handle everything—including PR descriptions and commit messages. Developers hated it. The AI-generated descriptions were too verbose and missed context. We scaled back and only automated what was truly repetitive. Let the humans write the narrative.
Golden Paths need regular pruning. Our initial set of service templates was too broad (12 templates). Developers kept picking the wrong one. We reduced it to 4 base types and made the rest composable extensions. Adoption jumped from 34% to 89% overnight.
The team in Can Tho saved our timeline. Halfway through the project, we hit a staffing bottleneck. PayFlow needed an additional Kafka integration that our HCMC team didn’t have capacity for. We spun up a 2-person sub-team from our Can
Related reading: Outsourcing Software Right: A CTO’s Guide to Offshore Engineering That Actually Works
Related reading: Why Smart CTOs Hire Vietnamese Developers: A 2024 Offshoring Playbook