# Phase 3: Metadata Indexing — Design Spec

**Goal:** Add a SQLite metadata index that makes date/tag/session_id filtering O(log n) while preserving the filesystem-as-source-of-truth invariant.

**Architecture:** A new `engram.index` module manages a WAL-mode SQLite database (`vault_index.sqlite`) with an FTS5 virtual table for topic search. The Librarian upserts index rows during its existing processing loop. The MCP server queries the index to generate a candidate path whitelist, then passes those paths to ripgrep for the actual search. If the index is missing or corrupt, the server falls back to full-vault ripgrep.

**Tech Stack:** Python `sqlite3` (stdlib), FTS5 (compiled into most CPython builds, with LIKE fallback), existing Pydantic/ripgrep stack.

---

## 1. Storage Format

**SQLite with WAL mode.** A single `vault_index.sqlite` file at `INDEX_PATH` (defined in `engram.config` as `ENGRAM_ROOT / "vault_index.sqlite"`), added to `.gitignore` (derivative data, never version-controlled).

### Schema

```sql
CREATE TABLE entities (
    path       TEXT PRIMARY KEY,
    mtime      REAL NOT NULL,
    topics     TEXT,
    summary    TEXT,
    created_at TEXT
);

CREATE INDEX idx_created ON entities(created_at);
CREATE INDEX idx_mtime   ON entities(mtime);
```

If FTS5 is available:

```sql
CREATE VIRTUAL TABLE entities_fts USING fts5(
    path, topics, summary,
    content='entities', content_rowid='rowid'
);
```

If FTS5 is unavailable, topic queries fall back to `WHERE topics LIKE '%tag%'`.

**FTS5 content-sync:** Because `entities_fts` uses `content='entities'`, inserts and updates require manual sync. On upsert, the implementation must:
1. `DELETE FROM entities_fts WHERE path = ?` (remove stale FTS row)
2. `INSERT OR REPLACE INTO entities ...` (update main table)
3. `INSERT INTO entities_fts(path, topics, summary) SELECT path, topics, summary FROM entities WHERE path = ?` (rebuild FTS row)

On reconciliation, deleting from `entities` also requires a corresponding `DELETE FROM entities_fts`.

**Connection pragmas:**

```python
conn.execute("PRAGMA journal_mode=WAL")
conn.execute("PRAGMA synchronous=NORMAL")
```

WAL + NORMAL = crash-safe writes + fast commits. A killed process leaves a WAL file that SQLite auto-recovers on next open.

---

## 2. Module: `src/engram/index.py`

Single-responsibility module with four public functions:

| Function | Purpose |
|----------|---------|
| `open_index(path) -> Connection | None` | Open (or create) the index. Returns `None` if corrupt — caller falls back to ripgrep-only. |
| `upsert_entity(conn, rel_path, mtime, topics, summary, created_at)` | Insert or replace one entity row + FTS5 sync. Called by Librarian per file. |
| `query_index(conn, tags, date_from, date_to) -> list[str]` | Return candidate relative paths matching the structured filters. |
| `reconcile(conn, entities_root) -> int` | Purge orphaned rows where `(entities_root / row.path).exists()` is False. Returns count removed. Commits after purge. |

Internal helpers:
- `_has_fts5(conn) -> bool` — probe FTS5 availability by attempting `CREATE VIRTUAL TABLE _fts5_probe USING fts5(x)` in a try/except, then dropping it. Cached as a module-level bool after first call.
- `_create_schema(conn, use_fts5)` — idempotent schema creation (uses `CREATE TABLE IF NOT EXISTS`).

---

## 3. Ingestion Flow

### 3.1 Return type change for `_process_file()`

Currently `_process_file()` returns `dict[str, int] | None` (topic counts). To surface frontmatter data for indexing, the return type changes to a dataclass:

```python
@dataclasses.dataclass
class ProcessResult:
    topics_delta: dict[str, int]     # existing: topic → count
    topics: list[str]                # full topic list for index
    summary: str                     # frontmatter summary
    created_at: str                  # ISO-8601 date or "" if absent
```

Returns `None` on failure (unchanged). All existing callers that check `result is None` continue to work. Topic-count aggregation uses `result.topics_delta` instead of iterating `result.items()`.

### 3.2 `created_at` extraction

Most legacy files lack a `created_at` frontmatter field. The Librarian adds a `setdefault`:

```python
fm.setdefault("created_at", "")
```

For files without an explicit date, `created_at` is stored as `""` in the index. Date-range queries treat `""` as "no date" and exclude these rows (via `WHERE created_at != '' AND created_at >= ?`). This is an honest representation — we don't fabricate dates.

### 3.3 Normal (incremental)

The Librarian's existing processing loop gains one call after successful processing:

```python
result = _process_file(filepath, ...)
if result is not None:
    upsert_entity(conn, rel, filepath.stat().st_mtime,
                  result.topics, result.summary, result.created_at)
```

Index commits (`conn.commit()`) happen at the same checkpoint interval as the JSON checkpoint (every `--batch-size` files, default 100). A final `conn.commit()` is called before `conn.close()` at the end of `main()` to flush any uncommitted rows (WAL mode does not autocommit on close).

### 3.4 Connection lifecycle

The Librarian opens **one connection** at the start of `main()` and closes it at the end. Commits happen at each checkpoint interval (every N files). No connection pooling, no per-file open/close. WAL mode prevents WAL growth issues during long batch runs because commits flush regularly.

### 3.5 Full rebuild (`--full-index`)

1. Delete the `vault_index.sqlite` file entirely (cleanest reset, avoids schema drift).
2. `open_index()` recreates a fresh database with the current schema.
3. All files are reprocessed (existing behavior) and re-indexed.

### 3.6 Reconciliation

After the main loop, call `reconcile(conn, ENTITIES_DIR)`. This iterates all index rows, checks `os.path.exists()` for each, and deletes orphans. Cost: one `stat()` per row (~50ms at 50k files on NVMe).

---

## 4. Search Flow

### 4.1 Unstructured query (no tags/date filters)

Unchanged. Ripgrep across full `entities/`. Index not consulted.

### 4.1a Metadata-only query (tags/dates but no text query)

When `query=""` with tags or dates present, the server skips ripgrep entirely:

1. `query_index()` returns candidate paths.
2. Filter through `os.path.exists()`.
3. Read the first N lines of each matching file (frontmatter + first paragraph).
4. Return concatenated results (capped at `MAX_RESPONSE_CHARS`).

No ripgrep call is made — the index result IS the answer. This is the key performance win for structured queries.

### 4.2 Structured query (tags and/or date filters present)

1. `query_index(conn, tags, date_from, date_to)` returns candidate relative paths.
2. Filter candidates through `os.path.exists()` (query-time stale-cache guard).
3. Pass surviving paths to ripgrep. Since `rg --glob` takes glob patterns (not literal paths), the implementation uses **one `--glob` per candidate** with the exact filename: `--glob "session_042.md"`. For large candidate sets (>100), write paths to a temp file and use `rg` with explicit file arguments instead of globs.
4. If index is `None` (missing/corrupt), skip steps 1-3 and run full-vault ripgrep.

### 4.3 Graceful degradation

```python
def _open_index(path):
    try:
        conn = sqlite3.connect(str(path))
        conn.execute("SELECT 1 FROM entities LIMIT 1")
        return conn
    except (sqlite3.DatabaseError, sqlite3.OperationalError):
        return None
```

The system never fails because the index is unavailable.

---

## 5. MCP Schema Extension

Three new optional fields on `SessionSearchInput`:

```python
query: str = Field(
    default="",           # CHANGED: was required (min_length=1)
    max_length=500,
    description="Search term (regex OK). Empty string with tags/dates = metadata-only query.",
)
tags: list[str] | None = Field(default=None, description="Filter by topic tags")
date_from: str | None = Field(default=None, description="ISO-8601 date lower bound")
date_to: str | None = Field(default=None, description="ISO-8601 date upper bound")
```

**Breaking change:** `query` becomes optional (default `""`). This enables metadata-only queries like "all sessions tagged #Jetson" without requiring a text search term. A custom validator ensures at least one of `query`, `tags`, `date_from`, or `session_id` is provided — fully empty requests are still rejected.

```python
@model_validator(mode="after")
def at_least_one_filter(self) -> "SessionSearchInput":
    if not self.query and not self.tags and not self.date_from and not self.date_to and not self.session_id:
        raise ValueError("At least one of query, tags, date_from, or session_id is required")
    return self
```

Date fields validated with `field_validator` to reject non-ISO-8601 strings (`YYYY-MM-DD` format).

Backward compatible — existing calls with `query` and optional `session_id` work identically.

---

## 6. Stale-Cache Mitigation

Two-layer defense:

| Layer | When | What |
|-------|------|------|
| Batch reconciliation | End of every `engram-index` run | `reconcile()` purges orphaned index rows |
| Query-time existence check | Every structured query | `os.path.exists()` on each candidate before passing to ripgrep |

**Accepted eventual consistency:** If a file's frontmatter is edited outside the pipeline, the index will be stale until the next `engram-index` run. This is acceptable — the alternative (mtime comparison on every query-time candidate) adds latency for a scenario that shouldn't happen in normal operation.

---

## 7. Atomic Safety

SQLite WAL mode handles crash safety natively:

| Failure | Recovery |
|---------|----------|
| Process killed mid-write | WAL journal replays on next `sqlite3.connect()` |
| Index file deleted | Librarian recreates on next `engram-index` run |
| Index corrupted beyond repair | `open_index()` returns `None`; server uses ripgrep-only |

---

## 8. Files Changed

| File | Action |
|------|--------|
| `src/engram/index.py` | **Create** — SQLite index module |
| `src/engram/config.py` | **Modify** — add `INDEX_PATH` constant |
| `src/engram/librarian.py` | **Modify** — integrate `upsert_entity()` and `reconcile()`; refactor `_process_file()` return type to `ProcessResult` |
| `src/engram/models.py` | **Modify** — add `tags`, `date_from`, `date_to` fields; make `query` optional; add `model_validator` |
| `src/engram/server.py` | **Modify** — open index connection at startup; query index for structured filters; generate path whitelist for ripgrep; handle metadata-only queries |
| `tests/test_index.py` | **Create** — index module tests |
| `tests/test_server.py` | **Modify** — update tests for optional `query` field |
| `tests/test_librarian.py` | **Modify** — update tests for `ProcessResult` return type |
| `.gitignore` | **Modify** — add `vault_index.sqlite` |

---

## 9. Testing Strategy

| Test | What it verifies |
|------|-----------------|
| `test_empty_index` | `query_index` on empty DB returns `[]`, no crash |
| `test_upsert_and_query` | Insert entity, query by tag → returns correct path |
| `test_reconcile_removes_orphans` | Delete file on disk, reconcile → row removed from index |
| `test_fts5_fallback` | When FTS5 unavailable, tag queries still work via LIKE |
| `test_graceful_degradation` | Corrupt/missing index file → `open_index` returns None |
| `test_date_range_filtering` | Query with date_from/date_to returns correct subset |

---

## 10. What This Does NOT Do

- Does not replace ripgrep. The index narrows; ripgrep searches.
- Does not add external dependencies. `sqlite3` is stdlib.
- Does not change the Librarian's execution model (single-threaded, sync, generator-based).
- Does not require the index to exist. Graceful degradation to current behavior.
- Does not trust the index blindly. Query-time existence checks + batch reconciliation.