Stop Hardcoding Agent Names: How a Dynamic Agent Registry Saved Our Multi-Agent System from Rot

I’ve seen it happen more times than I’d like to admit. A team builds a multi-agent system with three agents. They hardcode the agent names, URLs, and capabilities directly into the orchestrator’s config file. It works beautifully for the demo.

Then they add a fourth agent. Then a fifth. Six months later, someone fat-fingers a config, an agent goes offline for maintenance, and the whole pipeline silently drops tasks.

Build a Custom AI-Powered Unit Test Generator with Python and GPT-4o: A Step-by-Step Developer Tutorial

Build a Custom AI-Powered Unit Test Generator with Python and GPT-4o: A Step-by-Step Developer Tutorial Let’s be honest.… ...

Hardcoding agent metadata is tech debt with a smile.

Here’s the problem: static agent definitions don’t scale. They’re brittle, require manual updates, and turn your orchestrator into a fragile monolith that must know everything about every agent at startup. In production, agents come and go – during scaling events, rolling deployments, or partial outages. Your orchestrator needs to handle that without a config push.

Build a Custom AI-Powered PR Reviewer with Claude API and GitHub Webhooks — Here’s the Exact Code

Build a Custom AI-Powered PR Reviewer with Claude API and GitHub Webhooks — Here’s the Exact Code I’ve… ...

Let me show you what we built for a logistics client in Ho Chi Minh City. They had 27 specialized agents processing shipments, fraud checks, route optimization, etc. Hardcoding wasn’t just painful—it was dangerous.

The Anatomy of a Dynamic Agent Registry

The core idea is simple: agents register themselves with a central store (Redis) at startup, send periodic heartbeats, and the orchestrator discovers agents via the registry instead of config files.

We chose Redis for three reasons:

Built-in TTL – automatic agent expiration if heartbeat misses
Pub/Sub – real-time notifications of agent join/leave
Sorted sets – for capability-based routing with scores

Here’s the registration flow:

Agent starts → connects to Redis → sets a hash with its metadata (name, endpoint, capabilities) + a TTL of 15 seconds
Agent spawns a background goroutine that refreshes the TTL every 5 seconds
Orchestrator queries Redis for agents matching required capabilities, sorted by load
If orchestrator can’t find a capable agent, it either waits for a Pub/Sub “agent joined” notification or fails fast

Let’s look at the code.

Agent Registration (Python)

python
import redis
import json
import time
import threading

class AgentRegistry:
    def __init__(self, agent_name: str, endpoint: str, capabilities: list[str],
                 redis_client: redis.Redis, ttl: int = 15, heartbeat_interval: int = 5):
        self.agent_name = agent_name
        self.endpoint = endpoint
        self.capabilities = capabilities
        self.redis = redis_client
        self.ttl = ttl
        self.heartbeat_interval = heartbeat_interval
        self.running = False

    def register(self):
        key = f"agent:{self.agent_name}"
        data = {
            "name": self.agent_name,
            "endpoint": self.endpoint,
            "capabilities": json.dumps(self.capabilities),
            "load": 0,
            "last_heartbeat": time.time()
        }
        self.redis.hset(key, mapping=data)
        self.redis.expire(key, self.ttl)
        # Add to capability-based sets for routing
        for cap in self.capabilities:
            self.redis.zadd(f"capability:{cap}", {self.agent_name: 0})
        # Notify orchestrators
        self.redis.publish("agent:registered", json.dumps({"name": self.agent_name, "capabilities": self.capabilities}))

    def start_heartbeat(self):
        self.running = True
        def beat():
            while self.running:
                key = f"agent:{self.agent_name}"
                self.redis.hset(key, "last_heartbeat", time.time())
                self.redis.hset(key, "load", self.current_load())  # implement load logic
                self.redis.expire(key, self.ttl)
                time.sleep(self.heartbeat_interval)
        threading.Thread(target=beat, daemon=True).start()

    def current_load(self):
        # Return number of in-flight tasks or CPU usage – simplified here
        return 0

    def deregister(self):
        self.running = False
        key = f"agent:{self.agent_name}"
        self.redis.delete(key)
        for cap in self.capabilities:
            self.redis.zrem(f"capability:{cap}", self.agent_name)
        self.redis.publish("agent:deregistered", json.dumps({"name": self.agent_name}))

Orchestrator Discovery (Python)

python
class Orchestrator:
    def __init__(self, redis_client: redis.Redis):
        self.redis = redis_client
        self.pubsub = redis_client.pubsub()
        self.pubsub.subscribe("agent:registered", "agent:deregistered")
        self.agent_cache = {}

    def find_agent(self, required_capability: str):
        # Try cache first, 50ms TTL
        if required_capability in self.agent_cache:
            ts, agents = self.agent_cache[required_capability]
            if time.time() - ts < 0.05:
                if agents:
                    return agents[0]  # lowest load
        # Query Redis sorted set – agents with load score
        agents = self.redis.zrange(f"capability:{required_capability}", 0, 0, withscores=True)
        if agents:
            name, load = agents[0]
            # Get agent details from hash
            details = self.redis.hgetall(f"agent:{name}")
            if details:
                self.agent_cache[required_capability] = (time.time(), [details])
                return details
        # Listen for new agents (blocking wait, max 2 seconds)
        message = self.pubsub.get_message(timeout=2)
        if message and message['type'] == 'message':
            data = json.loads(message['data'])
            if required_capability in json.loads(self.redis.hget(f"agent:{data['name']}", "capabilities") or '[]'):
                details = self.redis.hgetall(f"agent:{data['name']}")
                if details:
                    return details
        return None

Why TTL Matters More Than You Think

In our first version, we used explicit deregistration on agent shutdown. It worked fine until a container was killed by Kubernetes HPA without a chance to clean up. The orchestrator continued routing tasks to a dead agent for minutes.

Switching to TTL-based registration with heartbeat refresh solved that. If an agent dies suddenly, its metadata auto-expires after 15 seconds. The orchestrator then routes elsewhere.

But there's a subtlety: you need to handle the race between TTL expiry and heartbeat refresh. We set TTL = 3 * heartbeat_interval to give enough slack. In our setup, 15-second TTL with 5-second heartbeat works reliably even under network jitter.

Capability-Based Routing with Load Balancing

Using Redis sorted sets for capabilities was a game-changer. Each capability key holds agents sorted by their current load score. When the orchestrator picks an agent, it grabs the one with the lowest score (least loaded). This gives us weighted routing for free.

We also added an `agent:online` sorted set to track all alive agents for fallback tasks that don't require a specific capability.

The Results

After deploying this pattern for our client in Ho Chi Minh City:

Zero configuration changes during scaling events
Agent replacement time dropped from minutes to < 10 seconds (including health check)
Missed task rate fell from ~2% to 0.01% (the 0.01% were race conditions we later fixed with a retry)
Incident response improved because we could kill a misbehaving agent and let auto-registration bring up a fresh one without touching the orchestrator

We did this with a team of 4 developers (3 in Ho Chi Minh City, 1 in the US) over a 3-week sprint. Honestly, the Vietnamese engineers on the team caught most of the edge cases around concurrent registration and Redis transaction safety.

FAQ: Dynamic Agent Registry

Q: Should I use Redis or etcd for the registry?

A: Redis wins for simplicity and Pub/Sub. etcd is better if you already have a Kubernetes-native stack and need strong consistency with watchers. For most teams, Redis gets you 90% of the way there with fewer dependencies.

Q: What happens if Redis goes down?

A: Bad things. We run Redis Sentinel with automatic failover. The orchestrator falls back to a cached agent list (stale, but better than nothing) for 30 seconds. If still down, we fail tasks with a clear "orchestrator unavailable" error instead of silent drops.

Q: How do you handle agent overload without a central rate limiter?

A: Each agent reports its own load (e.g., number of in-flight tasks). The orchestrator picks the least-loaded agent. For backpressure, agents can reject tasks by returning a 429, and the orchestrator retries on another agent. Works well in practice.

Q: Can I use this with gRPC instead of HTTP?

A: Absolutely. Just store the gRPC endpoint in the registry. The discovery pattern is protocol-agnostic. We used HTTP for simplicity, but one of our later iterations switched to gRPC for internal agent-to-agent calls.