Skip to content

ADR 010: Inter-session agent communication

Date: 2026-05-27 (drafted) · Ratified: 2026-05-30 (by José, after the sC field-input + ASCII-name-style folds) Status: Accepted — rollout in progress: active.md schema (Presence column + Session · Name cell), start-session.sh name-pool, session-heartbeat.sh post-commit hook, lefthook.yml registration. The git-log fallback for liveness (per § 4) is always-on regardless of hook installation.

Context

2–3 Claude Code orchestrator sessions (sA/sB/sC) work the same repo concurrently, each in its own git worktree, committing direct to main. They are separate processes, not always running simultaneously (gaps of hours), with no shared memory. They must coordinate: hand off work, claim ownership, ask/answer, and relay human (José) instructions.

Today's mechanisms — git-committed queue/to-<session>.md files, the active.md claim scoreboard, and Telegram (which reaches only one session) — work but leak in seven documented ways: the shared-tree race (closed by Layer 0), the Telegram single-consumer gap (José hand-relayed two tasks to sA on 2026-05-27 because Telegram is single-holder), lossy human relay, ownership ambiguity, a tracker outage (today: tasks-prod tunnel down ~10:00→10:58Z UTC, ~60 min, backlog.md bridged), plus two failure modes sC surfaced from the field on 2026-05-28: (6) stale-claim / liveness gap — an active.md row can be "true" for hours across breaks, crashes, or sleep with no mechanical liveness signal (today: a session row stood for ~36h across a domestic break); the 6h reap rule is a heuristic, not enforcement. (7) identity-vs-process driftsX is a per-process prefix, not a stable identity; a fresh Claude that re-claims sA shares the prefix but not the prior process's memory. Cross-session continuity rides on git history + memory files, not on a real identity. The findings doc surveys A2A, MCP-as-transport, multi-agent frameworks, RabbitMQ, Postgres, and the git/Riff incumbents against nine requirements.

Decisive constraint: sessions are not always connected (hours-long gaps). This rules out liveness/broker transports (RabbitMQ, A2A daemons) and favours a durable shared store — the blackboard pattern. Git already provides durability, worktree-isolated concurrency (Layer 0), audit (git log), and a human bridge, with no new always-on infra. The only genuine gaps are presence and a lossless ack chain.

Decision

Adopt a git-blackboard spine + Riff task-channel + Telegram human-bridge three-tier protocol. Reject RabbitMQ as the bus (solves live fan-out, not our async-durable need). Defer A2A to scale; defer Postgres session_bus to a felt real-time need.

Identity & addressing (foundation)

A session has two identities, both addressable:

  • Technical id sX (sA/sB/sC, …) — load-bearing: worktree suffix, commit-msg lefthook check, scoreboard column, branch convention, [sX] commit prefix. Ephemeral per-process — a fresh Claude that re-claims sA inherits the ID but not the prior process's memory.
  • Display name (per-project, short, memorable, project-themed) — durable role label for human coordination and Riff assignee. Mitigates failure mode 7 (identity drift): a new process re-claims sA and adopts the name from active.md, so the cross-session role identity is preserved even when in-memory continuity is not. Suggested pools per José's style ratification (2026-05-28): po-platform = Portuguese places (Sintra, Douro, Algarve, Tejo, Madeira, Porto, Lisboa, Coimbra, Cascais, Faro) or seafaring (Caravela, Bussola, Sextante, Quadrante, Cabo); codecomedy-platform = comedy-themed (owner's call); generic fallback = constellations / weather / herbs / board-game pieces (still ASCII-only). Mechanical rules (firm): plain ASCII letters [A-Za-z] only — no accents, diacritics, or special chars (US-keyboard ergonomics: accented characters cost 2 strokes apiece and compound across a session) — ≤12 chars, no spaces, unique within the project's pool. Style within those bounds is the project owner's call.

Address resolution. A message's to: may use either form, plus a lane alias or broadcast — all resolve via the active.md row:

Form Resolves via Example
to:sA (technical id) active.md row matching sX direct, exact session
to:Sintra (display name) active.md row matching name same target, human-friendly
to:web-presence-owner (lane alias) active.md row whose capabilities claim the lane role-based, no need to know who holds it
to:all every active row broadcast

start-session.sh is the canonical place to assign/prompt a name from the project's pool and persist it in the worktree's active.md row alongside sX, last_seen, and capabilities.

1. Message schema (Tier 1 — queue/to-<session>.md, and queue/to-all.md for broadcast)

Each message is a markdown block with a structured header:

## <ISO-8601-ts> · from:<sX> · to:<sY|all> · type:<msg|task|handoff|ack|question|answer|relay> · ref:<Riff#|commit|path|—> · status:<sent>
<body — markdown>
→ status:seen <ts> · acked <ts> · resolved <ts>     (owner advances in place)
  • from/to — session IDs; to:all is broadcast; a lane alias (to:web-presence-owner) resolves to the current owner via active.md.
  • typerelay marks a human→agent message injected via Telegram (keeps the bridge lossless + audited).
  • ref — the Riff #, commit SHA, or file path the message is about.

2. State machine (the ack chain)

sent ──► seen ──► acked ──► resolved
                   └──► superseded

The recipient/owner advances the status (edits the → status: line in their own commit). sent→seen proves the message was caught; acked = will act; resolved = done; superseded = obsoleted by a later message. A relay is lossless because the sender can later read the advanced status, not just assume delivery.

3. Ownership (single-owner invariant)

  • A unit of work (Riff or lane) has at most one owner at a time. The owner is recorded in both the active.md claim row and the Riff assignee (durable record-of-truth for task-scoped work).
  • Verbs: claim (write the row / set assignee), release (clear it), handoff(to:sX) (a type:handoff message + reassign). No distributed lock — cooperative; the visible claim is the lock.

4. Presence beacon (mechanical, not hand-maintained)

active.md rows carry last_seen (UTC) + capabilities (e.g. has-telegram, web-presence-lane). A lefthook post-commit hook bumps last_seen for the worktree's sX row on each commit → presence ≈ recent commit activity. Fallback when hooks are absent: derive liveness from the latest [sX] commit in git log (the existing 6h reap rule is the coarse form). Capabilities let a sender route by ability ("who has Telegram?") without asking.

5. Riff as the active channel (task-scoped)

Task-coupled coordination lives in tasks-prod (DB-backed, multi-client, no single-holder, human web UI): assignee = owner, labels = lane/role, comments = ack/decision log, dependencies = handoff ordering. MVP needs no schema change. Atomic handoff = assignee + status swap in a single update_task call (per sC's field practice — the doctrine "comment is human narrative; status is the protocol" applies). The active-channel feature upgrade (a to_session field, an "addressed to me, unread" inbox query backed by a per-(task, user) last_seen_at so unread = updated_at > last_seen_at, and handoff/ack comment kinds) is specified in the findings doc §4 and filed as Riff #221.

Implementation note (2026-05-31, sA review of codecomedy-platform PRs #233–#236, draft). The feature was built as five slices; code reviewed against this ADR. Two "owner's-call" items were resolved by the implementer and accepted on review: - P2 addressing = approach B (name-shape inference; a single assignee TEXT field where a registered session's id sA and display name Sintra are interchangeable, resolved via a session_presence registry). The illustrative "to_session field" wording above is superseded — no new column; the requirement (both forms resolve to the same target) is met by resolveAssigneeForms expanding the filter to both forms, falling back to a literal match for humans / unregistered strings. - P5 presence = explicit-only (sessions call upsert_session_presence; no server-side auto-bump on other MCP calls). ⚠️ This partially deviates from §4's "mechanical, not hand-maintained" intent on the Riff tier: an active session that never self-registers is invisible to resolve_session, the P2 assignee expansion, and list_active_sessions. Accepted for this round (registration must be explicit anyway, since it carries display_name/lane/capabilities), with a recommended fast-follow: auto-bump last_seen_at on cheap authenticated calls (list_tasks/get_task/update_task/add_comment) for already-registered sessions, so working sessions don't silently fall off presence — making the Riff beacon as mechanical as the git-side post-commit hook. - Schema keys on username TEXT (Authentik subject ids), not uuid — fine for po-platform, which addresses sessions by setting TASKS_USER=<session-id> per MCP.

Integration prerequisites (po-platform side; gated on cc-platform merge + prod-deploy + MCP restart): 1. Each session's MCP must run with a distinct TASKS_USER (sA/sB/sC), or all sessions share one inbox/presence row and the per-session semantics collapse. 2. Sessions must self-register via upsert_session_presence at start and heartbeat it — the in-Riff mirror of the active.md claim + post-commit beacon. Until the PRs deploy, the git blackboard (Tier 1) remains the sole live channel. The migrations land additively on the shared cc_prod instance (po-platform's Riff project lives there too); merge + deploy timing is the human's go.

6. Human bridge

Telegram remains the single-holder human↔agent bridge — appropriate for a bridge, unfit as the bus. The holding session relays José's instructions into Tier 1 as type:relay blocks. José can also edit any queue file / Riff directly to inject or arbitrate.

Consequences

Positive: no new always-on infrastructure; durable + concurrent + auditable by construction; the lossy relay becomes tracked (relay + status); presence stops being guesswork; ownership has explicit verbs + a single-owner invariant; everything degrades gracefully (a session with no MCP/Telegram can still read/write the bus on git pull).

Negative / costs: still poll-on-pull — no push; near-real-time handoffs wait for the next git pull. Status advancement is a discipline backed only by a hook, not hard-enforced. The Riff channel depends on tunnel uptime (down ~2h on 2026-05-27) → it complements, never replaces, the git spine.

Upgrade triggers (revisit this ADR), in increasing weight: - (a) Lightweight push — file-watcher notify (sC's primary ask, 2026-05-28). A tiny per-session loop watching git log origin/main..main -- docs/ai/sessions/queue/to-<me>.md surfaces "N new messages" on the next interactive turn. Zero new infra, zero schema change; closes the wake-up gap that turned the Telegram→sA relay twice-painful today. Recommended first upgrade if poll-on-pull latency hurts. - (b) Heavier real-timePostgres session_bus + LISTEN/NOTIFY for true push + row-locks; costs a daemon + "only up with the dev stack." - (c) At scaleA2A (Agent Cards + Tasks) for cross-machine / many-agent / untrusted peers.

RabbitMQ stays rejected: sC's field check confirmed a session shell can reach it (docker exec po-rabbitmq …), but (i) mixing with the business messaging vhost is a category error; (ii) the persistent-consumer requirement clashes with async/idle sessions that spawn, work in bursts, and respawn with zero in-memory state; (iii) auditability beats real-time for our workflow — git log -- queue/to-sX.md remains grep-able months later, broker messages are ephemeral once consumed.

Rollout: operationalized as Layer 6 of parallel-sessions.md v4 (message schema, status lifecycle, presence cells, ownership verbs). The post-commit heartbeat hook and the active.md last_seen/capabilities columns are the only mechanical additions; the message-header + status convention is documentation + discipline, adoptable immediately.

Cross-project extension (Layer 7, 2026-05-31)

Layers 0–6 cover sessions within one repo (sA/sB/sC on po-platform). A second axis surfaced once the Riff product itself was built by an agent in a different project (codecomedy-platform): how do sessions in different projects coordinate?

The constraint that picks the channel is the same one as §Decision, one level up: Tier 1 (the git blackboard) is per-repo — a session in repo A cannot git pull repo B's queue/; Tier 3 (Telegram) is a single-holder human bridge. Only the shared tasks-prod instance (Tier 2) is reachable from sessions in different projects. So cross-project coordination must ride Tier 2 — it is the only cross-project bus we have.

Identity must be project-qualified. sX and display names are unique only within a project's poolpo:sAfit:sA. Cross-project addressing therefore uses a project-qualified handle: <project-tag>:<sX> or <project-tag>:<DisplayName> (po:sA, cc:Bicho), <tag>:* to broadcast a project's sessions.

Decision (José, 2026-05-31): a convention-only channel now; the mechanical backing later. - Now (convention): a dedicated tasks-prod project — "Agent Comms (cross-project)" (9cf68a60-fa35-4707-9371-c775f2542bf5) — is the shared board. One task per thread; title <from> → <to>: <subject>; labels from:<tag>/to:<tag>; body carries the ADR §1 header at cross-project scope; comments carry replies + ack. Zero new code; reuses tasks/comments/labels. Its pinned CONVENTION task is the normative usage doc. - Later (mechanical): project-qualified session_presence (PK (project_id, session_id)) + cross-project resolve_session + a typed cross-project message primitive. Filed as RIFF-004 in the "Riff - Agents Feedback" product backlog. This is also why migration 059's missing project_id is a pre-merge blocker (flagged on PR #236 / Riff #221): the same shared-instance collision, one layer down.

This extension is additive to Layers 0–6: intra-project sessions keep using the per-repo git blackboard as their spine; only genuinely cross-project traffic goes to the Agent Comms board.