Stop Hardcoding Agent Names: How a Dynamic Agent Registry Saved Our Multi-Agent System from Rot
I’ve seen it happen more times than I’d like to admit. A team builds a multi-agent system with three agents. They hardcode the agent names, URLs, and capabilities directly into the orchestrator’s config file. It works beautifully for the demo.
Then they add a fourth agent. Then a fifth. Six months later, someone fat-fingers a config, an agent goes offline for maintenance, and the whole pipeline silently drops tasks.
Build a Custom AI-Powered Unit Test Generator with Python and GPT-4o: A Step-by-Step Developer Tutorial
Build a Custom AI-Powered Unit Test Generator with Python and GPT-4o: A Step-by-Step Developer Tutorial Let’s be honest.… ...
Hardcoding agent metadata is tech debt with a smile.
Here’s the problem: static agent definitions don’t scale. They’re brittle, require manual updates, and turn your orchestrator into a fragile monolith that must know everything about every agent at startup. In production, agents come and go – during scaling events, rolling deployments, or partial outages. Your orchestrator needs to handle that without a config push.
Build a Custom AI-Powered PR Reviewer with Claude API and GitHub Webhooks — Here’s the Exact Code
Build a Custom AI-Powered PR Reviewer with Claude API and GitHub Webhooks — Here’s the Exact Code I’ve… ...
Let me show you what we built for a logistics client in Ho Chi Minh City. They had 27 specialized agents processing shipments, fraud checks, route optimization, etc. Hardcoding wasn’t just painful—it was dangerous.
The Anatomy of a Dynamic Agent Registry
The core idea is simple: agents register themselves with a central store (Redis) at startup, send periodic heartbeats, and the orchestrator discovers agents via the registry instead of config files.
We chose Redis for three reasons:
- Built-in TTL – automatic agent expiration if heartbeat misses
- Pub/Sub – real-time notifications of agent join/leave
- Sorted sets – for capability-based routing with scores
Here’s the registration flow:
- Agent starts → connects to Redis → sets a hash with its metadata (name, endpoint, capabilities) + a TTL of 15 seconds
- Agent spawns a background goroutine that refreshes the TTL every 5 seconds
- Orchestrator queries Redis for agents matching required capabilities, sorted by load
- If orchestrator can’t find a capable agent, it either waits for a Pub/Sub “agent joined” notification or fails fast
Let’s look at the code.
Agent Registration (Python)
python
import redis
import json
import time
import threading
class AgentRegistry:
def __init__(self, agent_name: str, endpoint: str, capabilities: list[str],
redis_client: redis.Redis, ttl: int = 15, heartbeat_interval: int = 5):
self.agent_name = agent_name
self.endpoint = endpoint
self.capabilities = capabilities
self.redis = redis_client
self.ttl = ttl
self.heartbeat_interval = heartbeat_interval
self.running = False
def register(self):
key = f"agent:{self.agent_name}"
data = {
"name": self.agent_name,
"endpoint": self.endpoint,
"capabilities": json.dumps(self.capabilities),
"load": 0,
"last_heartbeat": time.time()
}
self.redis.hset(key, mapping=data)
self.redis.expire(key, self.ttl)
# Add to capability-based sets for routing
for cap in self.capabilities:
self.redis.zadd(f"capability:{cap}", {self.agent_name: 0})
# Notify orchestrators
self.redis.publish("agent:registered", json.dumps({"name": self.agent_name, "capabilities": self.capabilities}))
def start_heartbeat(self):
self.running = True
def beat():
while self.running:
key = f"agent:{self.agent_name}"
self.redis.hset(key, "last_heartbeat", time.time())
self.redis.hset(key, "load", self.current_load()) # implement load logic
self.redis.expire(key, self.ttl)
time.sleep(self.heartbeat_interval)
threading.Thread(target=beat, daemon=True).start()
def current_load(self):
# Return number of in-flight tasks or CPU usage – simplified here
return 0
def deregister(self):
self.running = False
key = f"agent:{self.agent_name}"
self.redis.delete(key)
for cap in self.capabilities:
self.redis.zrem(f"capability:{cap}", self.agent_name)
self.redis.publish("agent:deregistered", json.dumps({"name": self.agent_name}))
Orchestrator Discovery (Python)
python
class Orchestrator:
def __init__(self, redis_client: redis.Redis):
self.redis = redis_client
self.pubsub = redis_client.pubsub()
self.pubsub.subscribe("agent:registered", "agent:deregistered")
self.agent_cache = {}
def find_agent(self, required_capability: str):
# Try cache first, 50ms TTL
if required_capability in self.agent_cache:
ts, agents = self.agent_cache[required_capability]
if time.time() - ts < 0.05:
if agents:
return agents[0] # lowest load
# Query Redis sorted set – agents with load score
agents = self.redis.zrange(f"capability:{required_capability}", 0, 0, withscores=True)
if agents:
name, load = agents[0]
# Get agent details from hash
details = self.redis.hgetall(f"agent:{name}")
if details:
self.agent_cache[required_capability] = (time.time(), [details])
return details
# Listen for new agents (blocking wait, max 2 seconds)
message = self.pubsub.get_message(timeout=2)
if message and message['type'] == 'message':
data = json.loads(message['data'])
if required_capability in json.loads(self.redis.hget(f"agent:{data['name']}", "capabilities") or '[]'):
details = self.redis.hgetall(f"agent:{data['name']}")
if details:
return details
return None
Why TTL Matters More Than You Think
In our first version, we used explicit deregistration on agent shutdown. It worked fine until a container was killed by Kubernetes HPA without a chance to clean up. The orchestrator continued routing tasks to a dead agent for minutes.
Switching to TTL-based registration with heartbeat refresh solved that. If an agent dies suddenly, its metadata auto-expires after 15 seconds. The orchestrator then routes elsewhere.
But there's a subtlety: you need to handle the race between TTL expiry and heartbeat refresh. We set TTL = 3 * heartbeat_interval to give enough slack. In our setup, 15-second TTL with 5-second heartbeat works reliably even under network jitter.
Capability-Based Routing with Load Balancing
Using Redis sorted sets for capabilities was a game-changer. Each capability key holds agents sorted by their current load score. When the orchestrator picks an agent, it grabs the one with the lowest score (least loaded). This gives us weighted routing for free.
We also added an `agent:online` sorted set to track all alive agents for fallback tasks that don't require a specific capability.
The Results
After deploying this pattern for our client in Ho Chi Minh City:
- Zero configuration changes during scaling events
- Agent replacement time dropped from minutes to < 10 seconds (including health check)
- Missed task rate fell from ~2% to 0.01% (the 0.01% were race conditions we later fixed with a retry)
- Incident response improved because we could kill a misbehaving agent and let auto-registration bring up a fresh one without touching the orchestrator
We did this with a team of 4 developers (3 in Ho Chi Minh City, 1 in the US) over a 3-week sprint. Honestly, the Vietnamese engineers on the team caught most of the edge cases around concurrent registration and Redis transaction safety.
FAQ: Dynamic Agent Registry
Q: Should I use Redis or etcd for the registry?
A: Redis wins for simplicity and Pub/Sub. etcd is better if you already have a Kubernetes-native stack and need strong consistency with watchers. For most teams, Redis gets you 90% of the way there with fewer dependencies.
Q: What happens if Redis goes down?
A: Bad things. We run Redis Sentinel with automatic failover. The orchestrator falls back to a cached agent list (stale, but better than nothing) for 30 seconds. If still down, we fail tasks with a clear "orchestrator unavailable" error instead of silent drops.
Q: How do you handle agent overload without a central rate limiter?
A: Each agent reports its own load (e.g., number of in-flight tasks). The orchestrator picks the least-loaded agent. For backpressure, agents can reject tasks by returning a 429, and the orchestrator retries on another agent. Works well in practice.
Q: Can I use this with gRPC instead of HTTP?
A: Absolutely. Just store the gRPC endpoint in the registry. The discovery pattern is protocol-agnostic. We used HTTP for simplicity, but one of our later iterations switched to gRPC for internal agent-to-agent calls.
Related reading: Why Vietnam Outsourcing Is Reshaping Global Software Development in 2025
Related reading: Outsourcing Software Development in 2025: The Playbook for CTOs Who Actually Ship