Why inter-session communication exists (and how it works)¶
This is the understanding document. It explains the problem, the design choices,
and the model — but contains no commands. For the install procedure see
how-to-onboard.md; for a hands-on walkthrough see
tutorial.md.
The situation: more than one agent, same repo¶
When you run a non-trivial project with Claude Code, you quickly discover that one session is often not enough. You want to ship a frontend slice while a backend refactor proceeds, or have one session chase a UAT loop while another drains the backlog. So you open a second terminal, and a third, each running its own Claude Code orchestrator against the same git repository.
These sessions have three properties that make them surprisingly hard to coordinate:
- They are separate processes with no shared memory. Session A cannot "see" what session B is thinking. The only thing they share is the filesystem and the git history.
- They are not always running at the same time. A session sleeps when you close the laptop, crashes, or simply finishes for the day. Gaps of hours are normal. Any coordination mechanism that assumes "both endpoints are online right now" will fail.
- A "session" is not a stable identity. Tomorrow's fresh Claude that picks up where session A left off is a brand-new process — it inherits the name but none of the in-memory context.
Naïvely sharing a repo between two live agents goes wrong fast. The protocol exists because we watched it go wrong, repeatedly, and catalogued exactly how.
The seven failure modes¶
Every rule in the protocol traces back to a concrete failure we hit on po-platform. Naming them is the fastest way to understand the design.
- Shared-tree race. Two sessions sharing one working tree also share one
git index (
.git/index). When session A runsgit addand session B runsgit commita few seconds later, B's commit silently captures A's staged files. This happened five times (May 17–22, 2026) — e.g. 31 of one session's files landing inside another session's "wordmark" commit. Discipline ("alwaysgit add <paths>, never-A") helped but never fully stopped it. - Telegram single-consumer gap. Telegram is a single-holder connection — only one session can hold it; connecting in a second drops the first. When the human sends an instruction over Telegram, only one session hears it, and that session must hand-relay it to the others. On 2026-05-27 the human had to relay the same two tasks by hand because the addressed session wasn't the one holding the bridge.
- Lossy human relay. Following from #2 — relayed instructions get summarised, dropped, or misattributed. There was no record that a relay even happened.
- Ownership ambiguity. Two sessions both "fix" a red build, producing the "who actually fixed it?" confusion that destroys trust in the whole setup.
- Tracker outage. The task tracker (Riff / tasks-prod) runs over a tunnel that went down for ~60 minutes on 2026-05-27. Any protocol that depends on the tracker being up is fragile.
- Stale-claim / liveness gap. A scoreboard row can say "session A is working on X" and be technically true for 36 hours across a domestic break — with no mechanical signal that A is actually alive. A time-based "reap after 6h" rule is a heuristic, not a fact.
- Identity-vs-process drift. The
sA/sB/sCprefix is per-process. A fresh Claude that re-claimssAshares the label but not the memory. Without a durable role identity, cross-session continuity is fragile.
A good mental test for any proposed mechanism: does it survive all seven? The protocol does.
The decisive constraint, and what it rules out¶
The single most important fact is failure mode #2 + the hours-long gaps: sessions are not always connected.
That one constraint eliminates a whole class of otherwise-attractive solutions:
- Message brokers (RabbitMQ). A broker is built for liveness — fan-out to connected consumers. A session that sleeps for six hours and respawns with zero in-memory state is the opposite of a persistent consumer. (We confirmed a session's shell can reach the in-stack RabbitMQ, then rejected it anyway: wrong fit, plus mixing with the business-messaging vhost is a category error.)
- Agent-to-agent live protocols (A2A daemons). Same problem — they assume peers are online and discoverable now.
- A real-time database bus (Postgres
LISTEN/NOTIFY). Needs a daemon that's "only up with the dev stack."
What does survive hours-long gaps is a durable shared store — the classic blackboard pattern. One agent writes a note; it stays written; another agent reads it whenever it next wakes up. And we already have a perfect durable shared store sitting in the repo: git.
Git gives us, for free:
- Durability — a committed message is there until someone removes it.
- Concurrency isolation — git worktrees give each session its own checkout and its own index (this is what kills failure mode #1).
- Audit —
git logis a permanent, grep-able record of every message and every claim, months later. - A human bridge — the human can edit any queue file or claim directly.
- No new always-on infrastructure — nothing to deploy, monitor, or keep alive.
The only things git doesn't give us out of the box are presence ("is A actually alive?") and a lossless ack chain ("did B actually catch my message?"). The protocol adds exactly those two, mechanically, and nothing more.
The three tiers¶
The protocol is a git-blackboard spine + a task channel + a human bridge:
Tier 1 — the git blackboard (the spine)¶
Two kinds of plain-markdown files, committed to the repo:
- The claim scoreboard (
docs/ai/sessions/active.md) — a single table of who is working on what, right now. One row per live session, carrying its identity, what it's doing, when it was last seen, and its capabilities. - Addressed message queues (
docs/ai/sessions/queue/to-<sX>.md) — one file per recipient. To send session C a message, you append a block toto-sC.mdand commit it. C reads it on its nextgit pull.
This tier is always available — if you can git pull, you can receive; if you
can git commit && git push, you can send. It is the durability spine, and every
other tier complements it without replacing it.
Tier 2 — the Riff task channel¶
The task tracker (Riff, backed by tasks-prod) is task-coupled coordination:
who owns task #N, what blocks it, the decision log. It maps onto existing tracker
primitives — assignee = owner, labels = lane/role, comments = ack/decision
log, dependencies = handoff order — so it needs no schema change to start.
The key invariant: an atomic handoff is an assignee + status swap in a
single tracker call. The doctrine that keeps it clean: "the comment is the human
narrative; the status is the protocol." You read the status to know the state;
you read the comment to know the story.
Tier 2 complements, never replaces Tier 1 — because the tracker tunnel can be
down (failure mode #5). A planned feature upgrade (a real "addressed to me,
unread" inbox; handoff/ack comment kinds) is tracked as Riff #221; until
it lands, the git queue carries the load.
Tier 3 — Telegram (the human bridge, never the bus)¶
Telegram is single-holder, so it is unfit to be the inter-session bus. But it is
exactly right as a human↔agent bridge: the human sends an instruction, the
session currently holding the connection relays it into Tier 1 as a type:relay
message (which makes the relay auditable — see failure mode #3), and from there it
reaches every session through the durable blackboard.
The load-bearing rule: all inter-session traffic goes through the git blackboard. Telegram and Riff feed into it; they never become a parallel, connection-held channel that some sessions can hear and others can't.
Identity: two names, on purpose¶
Failure mode #7 (identity drift) is solved by giving each session two identities:
- A technical id —
sX(sA,sB,sC, …). This is load-bearing machinery: it's the worktree directory suffix, thecommit-msghook check, the scoreboard column, the branch name, and the[sX]commit-subject prefix. It is ephemeral per-process — a fresh Claude that re-claimssAgets the id but not the prior memory. - A display name — a short, memorable, project-themed label (po-platform uses
Portuguese places:
Sintra,Douro,Algarve, …, and seafaring instruments:Caravela,Bussola, …). This is the durable role identity. When a new process re-claimssA, it also adopts the same name from the scoreboard, so the human-facing role ("Sintra is doing the comms work") survives even though the in-memory continuity didn't.
A message can be addressed by either form — to:sA (exact) or to:Sintra
(human-friendly) — and both resolve to the same row. A lane alias like
to:web-presence-owner resolves via the capabilities cell, so you can address
the role without knowing which session currently holds it.
Why ASCII-only names matter (a real ergonomic ruling). Names must be plain
ASCII letters [A-Za-z], ≤12 chars, no spaces, no accents or diacritics. The
reason is keyboard ergonomics: an accented character costs two keystrokes apiece
and that cost compounds every time a human or agent types the name across a
session. Style within those bounds (places vs. weather vs. comedy) is the project
owner's call.
The cooperation layers¶
The three tiers describe how messages move. Layered on top is a set of
cooperation rules that prevent the harmful collisions in the first place. They
are numbered (Layers 0–6) in parallel-sessions.md;
the essentials:
- Layer 0 — worktree isolation. Each session runs in its own git worktree
(
../<project>-sX), so the shared-index race (#1) is structurally impossible, not merely discouraged. Acommit-msghook enforces that a[sX]commit can only happen from the matching-sXworktree. - Layers 1–2 — identity & naming. Claim a session id; prefix everything
(branches, commit subjects, PR titles, plan slugs, tracker labels) with
[sX]. - Layer 3 — the claim scoreboard. Write your row when you start, remove it when you stop; a 6-hour reap rule clears abandoned rows.
- Layer 4 — broken-main ownership. Your commit broke
main→ you fix or revert within 30 minutes. Another session's commit broke it → you ping the breaker and stand down. Never two sessions speculatively fixing the same break. - Layer 5 — shared-doc partition. Files that many sessions touch
(
backlog.md,CHANGELOG.md, ADRs, the memory index) have one editor per cycle; everyone else queues a request. - Layer 6 — messaging + presence. The message header, the ack state machine, and the mechanical presence beacon described above.
How the mechanics enforce it (three small scripts + two hooks)¶
The protocol is mostly documentation and discipline — but three pieces are mechanical, because discipline alone failed five times on failure mode #1:
start-session.sh— run once at session start. It picks the lowest unusedsX, creates the../<project>-sXworktree fromorigin/main, picks an unused display name from the pool, and atomically commits your claim row. This is the only reliable way to get worktree isolation without thinking about it.check-worktree-prefix.sh(wired as acommit-msghook) — aborts any commit whose[sX]subject doesn't match the worktree's-sXsuffix. This is the backstop that makes Layer 0 enforced rather than merely recommended.session-heartbeat.sh(wired as apost-commithook) — bumps thelast_seentimestamp in your scoreboard row on every commit, so presence ≈ recent commit activity (closing failure mode #6). It deliberately does not auto-commit the bump (that would cause hook recursion and doctored history) — the change sits in your working tree for the next commit to pick up. And if the hook isn't installed at all, liveness still derives from the latest[sX]commit ingit log. The mechanism degrades to the git-log fallback; it never becomes load-bearing in a way that breaks if absent.
What we deliberately deferred¶
Good design is also about what you don't build. The protocol explicitly defers (see ADR-010 "Upgrade triggers"):
- Lightweight push (a file-watcher). A tiny per-session loop watching
git log origin/main..main -- queue/to-<me>.mdcould surface "N new messages" on the next turn — zero new infra. This is the recommended first upgrade if poll-on-pull latency starts to hurt. It is not built yet. - Heavier real-time (Postgres
session_bus+LISTEN/NOTIFY) — only if the file-watcher proves insufficient. - A2A / cross-machine multi-agent — only at a scale (untrusted peers, many machines) we are nowhere near.
The current protocol is poll-on-pull: you receive messages when you next
git pull. For sessions that already pull constantly, that latency is small and
the simplicity is worth it.
The one honest limitation¶
This is not a real-time system. A handoff is seen on the recipient's next
git pull, not the instant it's sent. Status advancement is backed by a hook and
discipline, not hard enforcement. The Riff tier depends on tunnel uptime. These
are deliberate trade-offs in exchange for zero new infrastructure and durability
by construction — and every one of them has a documented upgrade path if the
trade stops being worth it.
For the exact grammar and schemas, see reference.md. To bring
this to your own project, see how-to-onboard.md.