# Engram Architecture

Technical architecture document for the Engram MCP memory server.

Last updated: 2026-04-03

---

## 1. Overview

Engram is a local-first, recovery-hardened memory server for autonomous AI agents.
It exposes persistent, searchable memory over the Model Context Protocol (MCP) and
a FastAPI HTTP interface. The system was built to recover 49,000+ corrupted session
files from a real infrastructure failure and to ensure that recovered metadata is
trustworthy through ground-truth verification.

Three design pillars govern every decision:

- **Local-first.** All data lives on the local filesystem as Markdown files with
  YAML frontmatter. No cloud dependency. No external database in the critical path.
- **Recovery-first.** Every pipeline is resumable, checkpointed, and fault-tolerant.
  Corrupted files are quarantined, not retried. Progress is never lost.
- **Verified-retrieval.** Tags proposed by LLMs are verified against source text
  using ripgrep before they enter the index. Hallucinated metadata is pruned at
  ingest time, not discovered at query time.

---

## 2. System Architecture Diagram

```
                        +---------------------------+
                        |   AI Agent                |
                        |   (Claude / AI agent / etc.) |
                        +------------+--------------+
                                     |
                          MCP stdio  |  HTTP POST /search
                                     |
                        +------------v--------------+
                        |       server.py           |
                        |   FastAPI + MCP hub       |
                        |                           |
                        |  +---------------------+  |
                        |  | models.py           |  |
                        |  | Pydantic validation |  |
                        |  +---------------------+  |
                        +---+------------------+----+
                            |                  |
               metadata     |                  |  text search
               filters      |                  |
          +-----------------v---+    +---------v-----------+
          |     index.py        |    |   ripgrep (rg)      |
          | SQLite FTS5 index   |    |   subprocess exec   |
          | vault_index.sqlite  |    |   async, no shell   |
          +---------------------+    +---------+-----------+
                                               |
                        +----------------------v--------------------+
                        |                entities/                   |
                        |  Markdown vault with YAML frontmatter     |
                        |  (49k+ session files, keyword-tagged)     |
                        +----+-------------+---------------+--------+
                             |             |               |
                    +--------v---+  +------v--------+  +---v-----------+
                    | corrupted/ |  | agent_memory/ |  | telemetry/    |
                    | (DLQ)      |  | (personas)    |  | (CLI traces)  |
                    +------------+  +---------------+  +---------------+

  OFFLINE PIPELINES (batch, cron-driven)

  +----------------+     +----------------+     +-----------------+
  | librarian.py   |     | guardrail.py   |     | consolidator.py |
  | ETL indexer    |     | tag verifier   |     | dream cycle     |
  | checkpointed   |     | alias + rg     |     | merge + prune   |
  +-------+--------+     +-------+--------+     +--------+--------+
          |                       |                       |
          v                       v                       v
    vault_index.sqlite     YAML frontmatter        .archive/
    librarian_state.json   PRUNED_DICTIONARY.md    consolidation_report.json

  +----------------+
  | segregation.py |
  | block splitter |
  +-------+--------+
          |
          +-----> agent_memory/{Planning,Coding,Review,Research}/
          +-----> telemetry/*.log
```

---

## 3. Core Modules

### 3.1 server.py -- FastAPI Hub

The central HTTP server and MCP endpoint. Provides three routes:

| Route          | Method | Purpose                                    |
|----------------|--------|--------------------------------------------|
| `/`            | GET    | Service info and endpoint listing           |
| `/health`      | GET    | Liveness check (status, timestamp, version) |
| `/search`      | POST   | Full-text and metadata search               |
| `/status`      | GET    | Configuration paths and service state       |

**Search flow:**

1. Accepts a `SessionSearchInput` (Pydantic-validated).
2. If metadata filters (tags, date range) are present and the SQLite index exists,
   queries `index.py` first to narrow candidate file paths.
3. If a text query is present, spawns `rg` (ripgrep) as an async subprocess
   (`asyncio.create_subprocess_exec`) over the candidate paths or the full
   `entities/` directory.
4. If only metadata filters are present (no text query), reads the first 500 chars
   of each candidate file and returns snippets.
5. Truncates output to `MAX_RESPONSE_CHARS` (4000) to fit LLM context budgets.

**Security:**

- Optional Bearer token auth via `ENGRAM_API_TOKEN` env var.
- CORS middleware with configurable origins.
- All subprocess calls use argument lists -- never `shell=True`.
- Session IDs validated against `^[A-Za-z0-9_.-]+$` to prevent path traversal.
- Query length capped at 1000 chars (server-side) / 500 chars (model-side).

### 3.2 models.py -- Pydantic Validation

Defines `SessionSearchInput`, the single input schema for search:

| Field        | Type          | Constraints                              |
|--------------|---------------|------------------------------------------|
| `query`      | str           | Max 500 chars, valid regex               |
| `session_id` | str or None   | Alphanumeric + dots/hyphens/underscores  |
| `tags`       | list[str]     | OR-semantics filter against index        |
| `date_from`  | str or None   | YYYY-MM-DD format enforced               |
| `date_to`    | str or None   | YYYY-MM-DD format enforced               |

A model-level validator ensures at least one filter is provided (no empty queries).

### 3.3 librarian.py -- Generator-Based ETL

The Librarian is the batch indexer. It scans `entities/`, parses YAML frontmatter,
applies keyword-based topic tagging, and populates the SQLite metadata index.

**Key properties:**

- **Generator scan.** Uses `os.scandir()` recursively. The full 49k+ path list
  never materialises in memory. Yields one `Path` at a time.
- **Idempotent checkpointing.** A JSON checkpoint (`librarian_state.json`) tracks
  every processed file by relative path. Interruption at file N resumes at N+1.
  Checkpoints are written every `--batch-size` files (default 100) using atomic
  `os.replace`.
- **Dead Letter Queue.** Files that fail to read or write are moved to
  `entities/corrupted/` and logged. They are never retried automatically.
- **Keyword tagging.** Regex patterns from `config.KEYWORD_TOPICS` are matched
  against file content. Matching topics are written into YAML frontmatter.
- **Index population.** Each processed file is upserted into `vault_index.sqlite`
  via `index.upsert_entity`.
- **Reconciliation.** After the scan, `index.reconcile` removes index entries
  whose files no longer exist on disk.
- **Thematic profile.** Generates `MASTER_PROFILE.md` with topic frequency counts.

**Single-threaded by design.** At ~1ms per file, the full 49k corpus completes in
~90 seconds. `ProcessPoolExecutor` was evaluated and rejected: IPC serialisation
overhead, checkpoint race conditions, and increased memory pressure (the exact
failure mode being recovered from) outweigh the marginal throughput gain.

### 3.4 guardrail.py -- Anti-Hallucination Verification

The Verified-Retrieval Layer. Prevents LLM-hallucinated tags from entering the
index.

**Algorithm:**

1. For each proposed tag, look up its aliases in `config.ALIAS_MAP`.
   Example: `"edge"` maps to `["edge", "nano", "nvidia", "orin", "board", "jetpack"]`.
2. Run `rg -iw <alias> <file>` for each alias. If any alias produces a match,
   the tag is verified.
3. Tags with zero evidence hits are pruned.
4. Verified tags are written back to YAML frontmatter. Pruned tags are appended
   to `PRUNED_DICTIONARY.md` with occurrence counts for audit.

**Why ripgrep for verification:** The check is binary and deterministic -- either
the alias appears in the source text or it does not. This is not a probabilistic
embedding similarity. It cannot hallucinate.

### 3.5 index.py -- SQLite FTS5 Metadata Index

Provides structured queries (tag filtering, date ranges) that ripgrep cannot
efficiently express.

**Schema:**

```sql
CREATE TABLE entities (
    path       TEXT PRIMARY KEY,
    mtime      REAL NOT NULL,
    topics     TEXT,          -- JSON array
    summary    TEXT,
    created_at TEXT
);

-- FTS5 virtual table (if available in the SQLite build)
CREATE VIRTUAL TABLE entities_fts USING fts5(
    path, topics, summary,
    content='entities',
    content_rowid='rowid'
);
```

**Features:**

- **Graceful FTS5 degradation.** If the SQLite build lacks FTS5, the module falls
  back to `LIKE` queries. The `_fts5_available` flag is probed once and cached.
- **WAL mode.** `PRAGMA journal_mode=WAL` for concurrent read access while the
  Librarian writes.
- **Integrity check on open.** `PRAGMA integrity_check` runs at connection time.
  Corrupt databases return `None` instead of a connection, allowing callers to
  proceed without the index.
- **Reconciliation.** `reconcile()` deletes index entries for files that no longer
  exist on disk, keeping the index consistent with the filesystem.

### 3.6 consolidator.py -- Dream Cycle

Biological-inspired memory consolidation. Merges high-overlap entities and prunes
stale, low-importance files.

**Three phases:**

1. **Discovery.** Walks `entities/` (excluding `.archive/`) and loads each file
   into an `Entity` dataclass (path, frontmatter, body, tags).
2. **Consolidation.** For each pair of entities, computes similarity:
   - Tag overlap via Jaccard similarity (threshold: 0.8).
   - Content keyword overlap for words longer than 5 characters (threshold: 0.7).
   Groups above threshold are merged. The first entity (sorted by filename) becomes
   the master. Fragments are archived to `.archive/` with a citation chain
   recording original filenames and timestamps.
3. **Pruning.** Files with `importance: low` in frontmatter and `mtime` older than
   30 days are moved to `.archive/`.

A JSON report is saved to `telemetry/consolidation_report.json` after each cycle.

### 3.7 segregation.py -- Agent/CLI Block Splitter

Extracts agent reasoning blocks and CLI command traces from entity files into
typed subdirectories.

- **Agent blocks.** Regex patterns from `config.AGENT_PATTERNS` identify agent
  personas (Planning, Coding, Review, Research). Matched blocks are extracted to
  `agent_memory/{persona}/` and replaced with back-references in the source file.
- **CLI traces.** Regex `config.CLI_REGEX` matches command-line tool invocations
  (`rg`, `ls`, `sqlite3`, `crontab`, `git`, `pip`, `npm`). Matched blocks are
  extracted to `telemetry/` as `.log` files.

Uses the same checkpoint pattern as the Librarian (`segregation_state.json`).

### 3.8 config.py -- Centralized Configuration

All paths, regex patterns, and constants in one module:

| Constant            | Value / Purpose                                    |
|---------------------|----------------------------------------------------|
| `ENGRAM_DATA`       | `~/.local/share/engram/` -- top-level data directory |
| `ENTITIES_DIR`      | `~/.local/share/engram/entities/` -- Markdown vault              |
| `TELEMETRY_DIR`     | `~/.local/share/engram/telemetry/` -- extracted CLI traces       |
| `AGENT_MEMORY_DIR`  | `~/.local/share/engram/agent_memory/` -- extracted agent blocks  |
| `LOG_DIR`           | `~/.local/share/engram/logs/` -- pipeline logs                   |
| `INDEX_PATH`        | `~/.local/share/engram/vault_index.sqlite` -- metadata index     |
| `KEYWORD_TOPICS`    | Regex-to-topic mapping for keyword tagging          |
| `AGENT_PATTERNS`    | Regex-to-persona mapping for block segregation      |
| `ALIAS_MAP`         | Canonical-tag-to-aliases for guardrail verification |
| `MAX_QUERY_LENGTH`  | 500 characters                                      |
| `MAX_RESPONSE_CHARS`| 4000 characters                                     |

### 3.9 utils.py -- Shared Utilities

| Function            | Purpose                                            |
|---------------------|----------------------------------------------------|
| `setup_logging`     | File + stderr logging with named logger            |
| `parse_frontmatter` | Extract YAML frontmatter dict and body from MD     |
| `atomic_dump`       | Write JSON via temp file + `os.replace` (no partial writes) |

---

## 4. Data Flow: Search Request

```
Agent sends: POST /search {"query": "edge", "tags": ["Hardware"]}
                |
                v
     [1] Pydantic validation (models.SessionSearchInput)
         - query length check
         - regex validity check
         - session_id whitelist check
         - at-least-one-filter check
                |
                v
     [2] Metadata pre-filter (index.py)
         - Open vault_index.sqlite
         - query_index(tags=["Hardware"]) -> candidate paths
         - Filter to paths that exist on disk
                |
                v
     [3] Text search (ripgrep subprocess)
         - Build command: rg --no-heading --with-filename "edge" <candidates>
         - asyncio.create_subprocess_exec (no shell)
         - Await stdout/stderr
                |
                v
     [4] Response assembly
         - Decode stdout, truncate to MAX_RESPONSE_CHARS
         - Return SearchResponse(query, results, elapsed_ms)
```

When only metadata filters are provided (no text query), step 3 is skipped.
Instead, the first 500 characters of each candidate file are read and returned
as snippets.

---

## 5. Storage Layout

```
~/.local/share/engram/           (ENGRAM_DATA — runtime data, separate from git repo)
|
+-- entities/                    Primary vault. Markdown files with YAML frontmatter.
|   +-- Chatrecall_Session_*.md  Session transcripts (49k+ files)
|   +-- reflect-*.md             Periodic reflection entries
|   +-- corrupted/               Dead Letter Queue for unprocessable files
|   +-- .archive/                Consolidated/pruned files (from dream cycle)
|
+-- agent_memory/                Extracted agent reasoning blocks
|   +-- Planning/                Planning-persona blocks
|   +-- Coding/                  Coding-persona blocks
|   +-- Review/                  Review-persona blocks
|   +-- Research/                Research-persona blocks
|
+-- telemetry/                   Extracted CLI command traces (.log files)
|
+-- logs/                        Pipeline execution logs
|   +-- librarian.log
|   +-- consolidator.log
|   +-- ingestion.log
|
+-- vault_index.sqlite           SQLite FTS5 metadata index (WAL mode)
+-- librarian_state.json         Librarian checkpoint (processed file set)
+-- segregation_state.json       Segregation checkpoint
```

**File format -- entity Markdown:**

```yaml
---
topics:
  - edge
  - Hardware
summary: edge device provisioning session.
related_ids: []
created_at: "2026-03-15"
---
(body text)
```

---

## 6. Design Decisions

### 6.1 Why ripgrep over SQLite FTS for primary search

The primary search mechanism is ripgrep, not SQLite FTS5. This is deliberate:

- **No stale index.** The filesystem IS the index. If a file is modified outside
  the pipeline (by an agent, a script, or a human), the next ripgrep search
  reflects the change immediately. SQLite FTS requires re-indexing.
- **Performance is sufficient.** At 50k files on NVMe, ripgrep returns in ~45ms
  (p50). At 200k files, ~165ms. Both are under the 200ms LLM tool-call budget.
- **Operational simplicity.** No index rebuild, no migration, no schema versioning.
  `rg` is a single static binary with no dependencies.

SQLite FTS5 is used as a **secondary index** for structured metadata queries (tag
filtering, date ranges) that ripgrep cannot express. The two systems complement
each other: ripgrep for text, SQLite for metadata.

### 6.2 Why filesystem-as-vault

- **Auditability.** Every entity is a plain-text Markdown file. `git log`, `diff`,
  `grep` all work natively. No proprietary format to decode.
- **Recovery.** When the system failed, files were recoverable with standard Unix
  tools. A database crash requires database-specific recovery tooling.
- **Agent interoperability.** Any agent or tool that can read files can read the
  vault. No adapter required.
- **Air-gap friendly.** Files can be transferred via USB, rsync, or sneakernet
  to edge nodes (edge) without database export/import.

### 6.3 Why generator-based pipelines

The original failure was caused by `sorted(glob("*.md"))` materialising 49k+ Path
objects into a list, triggering swap pressure under low-memory conditions. Every
Engram pipeline uses generators (`os.scandir` + `yield`) to process files one at
a time. The full file list never exists in memory.

### 6.4 Why single-threaded processing

At ~1ms per file (YAML parse + keyword scan + write-back), 49k files complete in
~90 seconds single-threaded. Multiprocessing would add:

- IPC serialisation overhead per file
- Checkpoint race conditions requiring locking
- Increased memory pressure from worker processes

The bottleneck is disk I/O, not CPU. Simplicity and correctness win.

---

## 7. Recovery Patterns

### 7.1 Checkpointing

Both the Librarian and Segregation engine use identical checkpoint patterns:

1. A JSON file tracks the set of processed relative paths.
2. Checkpoints are written every N files (default 100) via `atomic_dump`
   (write to `.tmp`, then `os.replace`).
3. On startup, the checkpoint is loaded. Previously-processed files are skipped.
4. `--full-index` flag resets the checkpoint for a clean re-run.

**Crash at file 25,000 resumes at 25,001.** No reprocessing. No data loss.

### 7.2 Idempotency

- **Librarian:** Frontmatter is parsed, topics are de-duplicated (`set`), and
  the file is rewritten. Running the Librarian twice on the same file produces
  the same output. The checkpoint prevents redundant work, but correctness does
  not depend on it.
- **Index upserts:** `INSERT OR REPLACE` ensures repeated indexing of the same
  file overwrites rather than duplicates.

### 7.3 Dead Letter Queue (DLQ)

Files that fail to read or write (encoding errors, permission issues, corrupted
YAML) are moved to `entities/corrupted/`. This:

- Prevents known-bad files from being retried on every run.
- Provides an audit trail for manual inspection.
- Keeps the main pipeline moving -- one bad file does not block 48,999 good ones.

### 7.4 Atomic Writes

All checkpoint and report files use `atomic_dump`: write to a `.tmp` file, then
`os.replace` to the final path. This is an atomic filesystem operation on POSIX
systems. A crash during write leaves either the old file or the new file -- never
a partial, corrupted file.

### 7.5 SQLite Integrity

`index.open_index` runs `PRAGMA integrity_check` at connection time. If the
database is corrupt, it returns `None` and the caller proceeds without the index
(ripgrep search still works). WAL mode provides crash-safe writes and allows
concurrent readers.

---

## 8. CLI Entry Points

Defined in `pyproject.toml`:

| Command         | Module                  | Purpose                          |
|-----------------|-------------------------|----------------------------------|
| `engram-serve`  | `engram.server`         | Start FastAPI/MCP server         |
| `engram-index`  | `engram.librarian:main` | Run batch indexer                |
| `engram-dream`  | `engram.consolidator`   | Run dream cycle consolidation    |

---

## 9. Dependencies

| Dependency  | Role                                      |
|-------------|-------------------------------------------|
| FastAPI     | HTTP server and route definitions          |
| Pydantic    | Input validation and serialization         |
| uvicorn     | ASGI server                                |
| PyYAML      | YAML frontmatter parsing                   |
| ripgrep     | External binary for full-text search       |
| SQLite      | Metadata index (stdlib, no extra install)  |
