# Multi-Agent Coordination for Engram

**Date:** 2026-04-06
**Status:** Design (not yet implemented)
**Priority:** High — largest architectural gap in Engram as an agent memory system
**Depends on:** Agent Work-Claim & Liveness System (separate spec)

---

## Problem Statement

Engram was built as a single-agent memory vault. It now needs to serve as the shared knowledge layer for multiple concurrent AI agents (Claude Code, Gemini CLI, Copilot, Rook, and future agents) across multiple devices (desktop, Jarvis Nano, the client's Jetson).

Current state:
- All MCP tool calls are anonymous — no concept of caller identity
- No concurrent write protection beyond atomic single-file writes
- All 49K entities in one flat bucket with no agent namespacing
- No way for agents to discover what other agents have learned
- No conflict detection for contradictory memories
- No session handoff mechanism between agents

---

## Design Principles

1. **Read-all, write-own** — Every agent can search and read everything. Agents can only modify memories they created. This is a coordination guard, not a security boundary.
2. **Identity is required, not optional** — Anonymous writes are rejected in production. Dev mode allows anonymous for backward compatibility.
3. **Eventual consistency over locking** — Agents operate independently. Conflicts are detected and surfaced, not prevented with locks.
4. **Filesystem-first** — Agent identity lives in frontmatter, not a separate database. The Markdown files remain the source of truth.
5. **Graceful degradation** — If identity isn't configured, Engram still works (single-agent mode). Multi-agent features activate when identity is present.

---

## Component 1: Agent Identity

### How Agents Identify Themselves

Each agent sets an environment variable before connecting:

```bash
export ENGRAM_AGENT_ID=claude-code       # Claude Code sessions
export ENGRAM_AGENT_ID=gemini-cli        # Gemini CLI sessions
export ENGRAM_AGENT_ID=rook              # Rook agent on Jetson
export ENGRAM_AGENT_ID=copilot           # GitHub Copilot
```

For MCP-over-SSE connections from remote devices, the agent ID can also be passed as a query parameter on the SSE endpoint:

```
http://100.74.129.19:8001/mcp/sse?agent_id=rook
```

### Identity Resolution Order

1. MCP request header `X-Engram-Agent-ID` (if present)
2. SSE query parameter `agent_id` (if present)
3. Environment variable `ENGRAM_AGENT_ID`
4. Default: `"anonymous"` (allowed in dev, rejected in production)

### Agent Registry

File: `~/.engram/agents.json` (auto-populated on first contact)

```json
{
  "claude-code": {
    "first_seen": "2026-04-06T14:30:00Z",
    "last_seen": "2026-04-06T16:45:00Z",
    "memory_count": 342,
    "device": "desktop"
  },
  "rook": {
    "first_seen": "2026-03-28T09:00:00Z",
    "last_seen": "2026-04-06T12:00:00Z",
    "memory_count": 1205,
    "device": "jarvis"
  },
  "gemini-cli": {
    "first_seen": "2026-04-05T20:00:00Z",
    "last_seen": "2026-04-06T15:30:00Z",
    "memory_count": 87,
    "device": "desktop"
  }
}
```

This is informational only — not an allowlist. Any agent ID is accepted.

---

## Component 2: Memory Ownership

### Frontmatter Changes

Every memory gets an `owner_agent` field:

```yaml
---
topics: [coordination, Engram]
importance: high
memory_type: semantic
owner_agent: claude-code
created_at: 2026-04-06T14:30:00+00:00
access_count: 0
---
```

### Enforcement Rules

| Operation | Rule |
|-----------|------|
| `save_memory` | Auto-sets `owner_agent` from caller identity. Cannot be overridden. |
| `update_memory` | Rejects if `owner_agent` doesn't match caller. Returns error with owner info. |
| `delete_memory` | Rejects if `owner_agent` doesn't match caller. Returns error with owner info. |
| `search_memory` | No restriction. Full read access for all agents. |
| `memory_at` | No restriction. |
| `memory_history` | No restriction. |
| `graph_query` | No restriction. |
| `consolidate_memory` | Only consolidates memories owned by the calling agent. Cross-agent consolidation requires explicit `force=True` parameter. |

### Backward Compatibility

Existing 49K entities have no `owner_agent` field. These are treated as `owner_agent: "legacy"`. Any agent can modify legacy entities. A migration script can optionally assign ownership based on content analysis or creation patterns.

### Error Response

When an agent tries to modify another agent's memory:

```json
{
  "error": "ownership_mismatch",
  "message": "Memory owned by 'gemini-cli', you are 'claude-code'. Use search_memory to read it.",
  "owner_agent": "gemini-cli",
  "your_agent_id": "claude-code",
  "tip": "To save your own version, use save_memory with the content you want to preserve."
}
```

This teaches the agent the correct pattern: read theirs, save your own.

---

## Component 3: Concurrent Write Safety

### Problem

Two agents generating the same filename at the same millisecond:
```
20260406_user_preferences_a1b2.md  ← Claude
20260406_user_preferences_a1b2.md  ← Gemini (overwrites!)
```

### Solution: Agent Prefix in Filenames

```
20260406_claude-code_user_preferences_a1b2.md
20260406_gemini-cli_user_preferences_a1b2.md
```

The agent ID becomes part of the filename, eliminating collisions without locks.

### Consolidation Locking

The Dream Cycle (consolidation) is the one operation that must be serialized. Two concurrent consolidations can archive the same file.

Solution: Advisory file lock at `ENGRAM_ROOT/.consolidation.lock`

```python
import fcntl

lock_path = ENGRAM_ROOT / ".consolidation.lock"
lock_fd = open(lock_path, "w")
try:
    fcntl.flock(lock_fd, fcntl.LOCK_EX | fcntl.LOCK_NB)
except BlockingIOError:
    return {"error": "Consolidation already running", "status": "skipped"}

try:
    # ... run consolidation ...
finally:
    fcntl.flock(lock_fd, fcntl.LOCK_UN)
    lock_fd.close()
```

Non-blocking: if another agent is consolidating, the second one gets a clear "already running" response instead of waiting.

---

## Component 4: Memory Namespacing

### Query Filtering

New optional parameter on `search_memory`:

```python
search_memory(
    query="user preferences",
    owner_agent="claude-code",  # filter to one agent's memories
    # or
    owner_agent="*",            # explicit: search all (default behavior)
)
```

### Stats by Agent

`get_memory_stats()` returns per-agent breakdown:

```json
{
  "entity_count": 49342,
  "by_agent": {
    "claude-code": 342,
    "rook": 1205,
    "gemini-cli": 87,
    "legacy": 47708
  }
}
```

### Directory Structure (Optional Enhancement)

For large-scale deployments, entities could be organized by agent:

```
entities/
├── claude-code/
│   └── 20260406_user_preferences_a1b2.md
├── gemini-cli/
│   └── 20260406_user_preferences_c3d4.md
├── rook/
│   └── 20260406_tool_routing_e5f6.md
└── legacy/
    └── (existing 49K files)
```

This is optional — the flat structure with agent prefix in filename works fine. Subdirectories are a future optimization if the vault grows past 100K entities.

---

## Component 5: Cross-Agent Event Notifications

### Problem

Agent A saves a critical memory. Agent B won't know unless it searches.

### Solution: Event Stream via SSE

New endpoint: `GET /events/stream?agent_id=claude-code`

Returns Server-Sent Events for memory mutations:

```
event: memory_saved
data: {"agent": "gemini-cli", "path": "...", "topics": ["preferences"], "importance": "high"}

event: memory_updated
data: {"agent": "rook", "path": "...", "changed_fields": ["topics"]}

event: memory_deleted
data: {"agent": "claude-code", "path": "...", "archived": true}

event: consolidation_complete
data: {"agent": "rook", "merged": 3, "pruned": 7}
```

### Filtering

Agents can filter the event stream:

```
/events/stream?topics=preferences,security    # only events touching these topics
/events/stream?importance=high,critical        # only important changes
/events/stream?exclude_self=true               # don't echo own mutations
```

### Implementation

The audit log we already built provides the data. The event stream is a thin SSE wrapper over audit log entries, broadcast to connected clients.

```python
# In server.py
from asyncio import Queue

_event_subscribers: dict[str, Queue] = {}

@app.get("/events/stream")
async def event_stream(agent_id: str = "anonymous"):
    queue = Queue()
    _event_subscribers[agent_id] = queue
    try:
        async def generate():
            while True:
                event = await queue.get()
                yield f"event: {event['operation']}\ndata: {json.dumps(event)}\n\n"
        return StreamingResponse(generate(), media_type="text/event-stream")
    finally:
        _event_subscribers.pop(agent_id, None)
```

The `log_operation()` function in `audit.py` also pushes to all subscriber queues.

---

## Component 6: Contradiction Detection

### Problem

```
Memory A (claude-code): "User prefers verbose, detailed explanations"
Memory B (gemini-cli):  "User wants short, concise responses"
```

Both are "true" — they reflect different moments or contexts. But agents consuming both get confused.

### Approach: Semantic Similarity + Conflict Flagging

**Not automatic resolution** — flag conflicts for the user to resolve.

### Detection

During consolidation or on-demand via a new MCP tool:

```python
@mcp.tool()
def detect_contradictions(
    topics: list[str] | None = None,
    limit: int = 20,
) -> list[dict]:
    """Find potentially contradictory memories across agents."""
```

Algorithm:
1. Group memories by topic overlap (same as consolidation)
2. Within each group, compute semantic similarity (dense search)
3. High similarity + different agents + different sentiment/stance = potential contradiction
4. Return pairs with explanation of why they might conflict

### Resolution Options

When contradictions are found, the user (not the agent) decides:

1. **Keep both** — mark as "context-dependent" (e.g., verbose for learning, concise for status updates)
2. **Keep newer** — archive the older one
3. **Merge** — create a new memory that reconciles both (e.g., "User prefers concise for status, detailed for explanations")
4. **Override** — one agent's memory takes precedence

Resolution is stored as a relationship in the knowledge graph:

```
Memory_A --[supersedes]--> Memory_B
Memory_A --[context: learning]--> (no conflict, different context)
```

---

## Component 7: Session Handoff

### Problem

User is in a Claude Code session working on a feature. Wants to switch to Gemini CLI to continue. Currently, Gemini starts cold — no context about what Claude was doing.

### Solution: Handoff Memory

New MCP tool:

```python
@mcp.tool()
def create_handoff(
    target_agent: str,
    context: str,
    active_files: list[str] | None = None,
    decisions_made: list[str] | None = None,
    next_steps: list[str] | None = None,
) -> dict:
    """Create a handoff memory for another agent to pick up your work."""
```

This creates a special memory with `memory_type: "handoff"`:

```yaml
---
topics: [handoff, engram-refactor]
importance: critical
memory_type: handoff
owner_agent: claude-code
target_agent: gemini-cli
handoff_status: pending  # pending → accepted → completed
created_at: 2026-04-06T16:00:00+00:00
---

## Handoff: Engram Bug Fixes

### Context
Working on Engram code review. Fixed 10 bugs in category 1, 6 security fixes in category 2.

### Active Files
- src/engram/consolidator.py (modified — tag filtering added)
- src/engram/server.py (modified — rate limiting added)
- src/engram/audit.py (new file)
- src/engram/ratelimit.py (new file)

### Decisions Made
- Write-own, read-all memory model for multi-agent
- Token bucket rate limiting over slowapi (no external dep)
- JSONL audit log format

### Next Steps
1. Implement multi-agent identity layer
2. Add observability (metrics endpoint)
3. Write production ops runbooks
```

The receiving agent can search for pending handoffs:

```python
search_memory(query="handoff", tags=["handoff"], mode="keyword")
# Filter for target_agent matching self
```

And mark it accepted:

```python
update_memory(path, handoff_status="accepted")
# Ownership check skipped for handoff acceptance
```

---

## Implementation Phases

### Phase 1: Identity + Ownership (Quick Wins)
**Effort:** ~2 hours
**Files:** `mcp_tools.py`, `config.py`, `audit.py`

- Add `_get_agent_id()` resolution chain
- Add `owner_agent` to frontmatter on save
- Enforce ownership on update/delete
- Add agent prefix to filenames
- Backward compat for legacy entities (no owner = modifiable by all)

### Phase 2: Concurrent Safety + Namespacing
**Effort:** ~1 hour
**Files:** `mcp_tools.py`, `consolidator.py`

- Advisory lock for consolidation
- `owner_agent` filter parameter on `search_memory`
- Per-agent stats in `get_memory_stats`

### Phase 3: Event Stream
**Effort:** ~2 hours
**Files:** `server.py`, `audit.py`

- SSE endpoint for memory mutation events
- Topic/importance filtering
- Wire `log_operation()` to broadcast to subscribers

### Phase 4: Contradiction Detection
**Effort:** ~3 hours
**Files:** New `contradiction.py`, `mcp_tools.py`

- Semantic similarity comparison across agents
- Conflict flagging with explanation
- User-facing resolution workflow
- KG relationship storage for resolutions

### Phase 5: Session Handoff
**Effort:** ~2 hours
**Files:** `mcp_tools.py`

- `create_handoff` and `accept_handoff` MCP tools
- Special `memory_type: handoff` with status tracking
- Ownership exception for handoff acceptance

---

## Open Questions

1. **Agent ID spoofing** — Should we care? Any agent can claim to be any other agent. Since this is a coordination guard (not security), probably not. But worth noting.

2. **Legacy migration** — Should we batch-assign ownership to the 49K existing entities? Options:
   - Leave as `legacy` (simplest)
   - Assign based on creation date patterns (before Engram MCP = legacy, after = analyze)
   - Ask user to run a one-time migration script

3. **Cross-device identity** — Claude Code on desktop and Claude Code on Jarvis are both `claude-code`. Should they be `claude-code@desktop` and `claude-code@jarvis`? Probably yes for auditability.

4. **Handoff expiry** — How long should a pending handoff stay active? If nobody picks it up in 24 hours, should it downgrade to a regular memory?

5. **Event stream backpressure** — If an agent subscribes to events but doesn't consume them, the queue grows unbounded. Need max queue size with oldest-dropped policy.

6. **Contradiction threshold** — What semantic similarity score constitutes a "contradiction" vs. "related but different"? Needs tuning with real data.

---

## Success Criteria

- [ ] Any agent can save a memory and it's tagged with their identity
- [ ] No agent can modify another agent's memory (except legacy entities)
- [ ] `search_memory` returns results from all agents by default
- [ ] `get_memory_stats` shows per-agent breakdown
- [ ] Two concurrent consolidations don't corrupt the archive
- [ ] Agent filenames never collide even at millisecond granularity
- [ ] Audit log shows who did what and when
- [ ] Event stream delivers real-time mutation notifications
- [ ] Contradictions between agents are flagged for user review
- [ ] Session handoff works between Claude and Gemini
