# Engram Operations Runbook

**Last updated:** 2026-04-06
**Audience:** Anyone operating Engram in production (desktop, Jetson, or both)

---

## Table of Contents

1. [Service Architecture](#1-service-architecture)
2. [Starting and Stopping](#2-starting-and-stopping)
3. [Health Checks](#3-health-checks)
4. [Backup and Restore](#4-backup-and-restore)
5. [Disaster Recovery](#5-disaster-recovery)
6. [Performance Baselines](#6-performance-baselines)
7. [Capacity Planning](#7-capacity-planning)
8. [Troubleshooting](#8-troubleshooting)
9. [Maintenance Tasks](#9-maintenance-tasks)
10. [Monitoring and Alerting](#10-monitoring-and-alerting)

---

## 1. Service Architecture

### What's Running

| Component | Port | Process | Purpose |
|-----------|------|---------|---------|
| Engram Hub (Docker) | 8000 | `uvicorn engram.server:app` | System-level server, Docker container `engram-hub` |
| Engram Hub (user) | 8001 | `python -m uvicorn engram.server:app` | User-level server for MCP tools |
| MCP SSE | 8001/mcp/sse | Mounted inside user server | Claude Code / agent MCP endpoint |

**Which is canonical?** The user process on :8001 is the primary for MCP tool use. The Docker container on :8000 is the system-level service for dashboard and remote API access. Both share the same data directory.

### Data Locations

| Data | Path | Size (approx) | Backing |
|------|------|---------------|---------|
| Entity files | `~/engram/entities/` | ~3.5 GB | 49K+ Markdown files |
| SQLite metadata index | `~/engram/vault_index.sqlite` | ~81 MB | FTS5 full-text search |
| Knowledge graph DB | `~/engram/memory_store.db` | ~37 KB | kg_entities + kg_edges tables |
| Worker tracking | `~/engram/workers.sqlite` | ~12 KB | Jetson fleet heartbeats |
| Version snapshots | `~/engram/entities/.versions/` | Varies | Immutable content snapshots |
| Archive (deleted/merged) | `~/engram/entities/.archive/` | Varies | Soft-deleted entities |
| Quarantine (corrupted) | `~/engram/entities/corrupted/` | Varies | Files that failed parsing |
| Logs | `~/engram/logs/` | Varies | Text + JSONL logs |
| Audit log | `~/engram/logs/audit.jsonl` | Grows | Every save/update/delete |
| Telemetry | `~/engram/telemetry/` | Varies | Consolidation reports, agent analytics |
| Encryption keys | `~/engram/.encryption/` | ~100 bytes | Fernet key (if enabled) |

### Configuration

All config is in `src/engram/config.py` with no external config file. Environment variables override:

| Variable | Default | Purpose |
|----------|---------|---------|
| `ENGRAM_API_TOKEN` | None (no auth) | Bearer token for API endpoints |
| `ENGRAM_HOST` | 127.0.0.1 | Server bind address |
| `ENGRAM_PORT` | 8000 | Server port |
| `ENGRAM_CORS_ORIGINS` | localhost:3000 | Comma-separated allowed origins |
| `ENGRAM_AGENT_ID` | anonymous | Agent identity for audit/ownership |
| `ENGRAM_ENCRYPTION_KEY` | None | Fernet encryption key (base64) |
| `ANTHROPIC_API_KEY` | None | For KG entity extraction + reflection |

---

## 2. Starting and Stopping

### User Process (Primary for MCP)

```bash
# Start
cd ~/engram
python -m uvicorn engram.server:app --host 0.0.0.0 --port 8001 &

# Stop
pkill -f "uvicorn engram.server:app.*8001"

# Restart
pkill -f "uvicorn engram.server:app.*8001" && sleep 2 && \
cd ~/engram && python -m uvicorn engram.server:app --host 0.0.0.0 --port 8001 &
```

### Systemd Service

```bash
# If engram-server.service is installed:
sudo systemctl start engram-server
sudo systemctl stop engram-server
sudo systemctl restart engram-server
sudo systemctl status engram-server
journalctl -u engram-server -f   # follow logs
```

### Docker Container

```bash
# Status
docker ps | grep engram

# Restart
docker restart engram-hub

# Stop
docker stop engram-hub

# Logs
docker logs engram-hub --tail 50 -f
```

### Safe Restart (avoid interrupting consolidation)

Before restarting, check if consolidation is running:

```bash
# Check metrics for consolidation state
curl -s localhost:8001/metrics | grep consolidation_running

# If engram_consolidation_running = 1.0, wait for it to finish
# Or check logs:
tail -20 ~/engram/logs/consolidator.log
```

If you must restart during consolidation, it's safe — consolidation uses atomic writes and checkpointing. The next run will re-process any incomplete batch.

---

## 3. Health Checks

### Quick Check

```bash
curl -s localhost:8001/health | python3 -m json.tool
```

Expected healthy response:
```json
{
  "status": "healthy",
  "version": "1.0.0",
  "checks": {
    "entities_dir": true,
    "index_rows": 48523,
    "index_readable": true,
    "ripgrep_installed": true,
    "disk_free_gb": 45.23,
    "index_age_seconds": 3600
  }
}
```

### Status Meanings

| Status | Meaning | Action |
|--------|---------|--------|
| `healthy` | All subsystems operational | None |
| `degraded` | Missing index, no ripgrep, or entities dir gone | Check `checks` for details |
| `warning` | Disk space < 1 GB | Free space immediately |

### Metrics

```bash
curl -s localhost:8001/metrics
```

Key metrics to watch:

| Metric | Normal Range | Concern If |
|--------|-------------|------------|
| `engram_entity_count` | 48K-50K | Drops suddenly (data loss) |
| `engram_disk_free_gb` | > 10 GB | < 2 GB |
| `engram_index_age_seconds` | < 86400 (1 day) | > 604800 (1 week) |
| `engram_search_latency_ms` (p95) | < 500 ms keyword, < 2000 ms hybrid | > 5000 ms |
| `engram_quarantine_count` | 0-10 | > 100 (mass corruption) |
| `engram_archive_count` | Grows slowly | Sudden spike (runaway consolidation) |
| `engram_consolidation_duration_ms` | < 60000 (1 min) | > 300000 (5 min) |

---

## 4. Backup and Restore

### What to Back Up

**Critical (data loss = unrecoverable):**
- `~/engram/entities/` — all memory files (3.5 GB)
- `~/engram/entities/.versions/` — version snapshots
- `~/engram/.encryption/` — Fernet key (if encryption enabled)

**Important (can be rebuilt, but slow):**
- `~/engram/vault_index.sqlite` — metadata index (rebuild with `engram-index`)
- `~/engram/memory_store.db` — knowledge graph (rebuild with `graph_ingest`)

**Nice to have:**
- `~/engram/logs/audit.jsonl` — audit trail
- `~/engram/telemetry/` — consolidation reports

### Manual Backup

```bash
# Full backup (compressed)
BACKUP_DATE=$(date +%Y%m%d_%H%M)
tar czf ~/engram-backup-${BACKUP_DATE}.tar.gz \
  --exclude='~/engram/entities/.archive' \
  ~/engram/entities/ \
  ~/engram/vault_index.sqlite \
  ~/engram/memory_store.db \
  ~/engram/.encryption/ \
  ~/engram/logs/audit.jsonl

echo "Backup: ~/engram-backup-${BACKUP_DATE}.tar.gz"
ls -lh ~/engram-backup-${BACKUP_DATE}.tar.gz
```

### Automated Backup (Cron)

Add to crontab (`crontab -e`):

```cron
# Daily Engram backup at 3 AM
0 3 * * * BACKUP_DATE=$(date +\%Y\%m\%d) && tar czf ~/backups/engram-${BACKUP_DATE}.tar.gz --exclude='entities/.archive' ~/engram/entities/ ~/engram/vault_index.sqlite ~/engram/memory_store.db ~/engram/.encryption/ && find ~/backups/engram-*.tar.gz -mtime +7 -delete
```

This creates daily backups and deletes backups older than 7 days.

### Restore from Backup

```bash
# 1. Stop Engram
pkill -f "uvicorn engram.server:app"
docker stop engram-hub 2>/dev/null

# 2. Move current data aside (don't delete yet)
mv ~/engram ~/engram.broken.$(date +%s)

# 3. Restore from backup
mkdir -p ~/engram
cd ~/engram
tar xzf ~/engram-backup-YYYYMMDD_HHMM.tar.gz --strip-components=2

# 4. Rebuild index if not included in backup
cd ~/engram && python -m engram.librarian

# 5. Restart
cd ~/engram && python -m uvicorn engram.server:app --host 0.0.0.0 --port 8001 &

# 6. Verify
curl -s localhost:8001/health | python3 -m json.tool
```

### Remote Backup (Google Drive via rclone)

```bash
# If rclone is configured:
rclone sync ~/engram/entities/ gdrive:engram-backup/entities/ \
  --exclude=".archive/**" \
  --transfers=4 \
  --progress

rclone copy ~/engram/vault_index.sqlite gdrive:engram-backup/
rclone copy ~/engram/memory_store.db gdrive:engram-backup/
```

**WARNING:** Do NOT background `rclone authorize` during OAuth — it causes Google security loops (known issue).

---

## 5. Disaster Recovery

### Scenario: SQLite Index Corrupted

**Symptoms:** `/health` shows `index_readable: false`, search returns 503.

**Recovery:**

```bash
# 1. Check corruption
sqlite3 ~/engram/vault_index.sqlite "PRAGMA integrity_check;"
# If "ok" → not corrupted, likely permissions. Check: ls -la ~/engram/vault_index.sqlite

# 2. If corrupted, move aside and rebuild
mv ~/engram/vault_index.sqlite ~/engram/vault_index.sqlite.corrupted
cd ~/engram && python -m engram.librarian
# This rebuilds the index from the entity Markdown files (source of truth)
# Takes ~5-10 minutes for 49K files

# 3. Verify
curl -s localhost:8001/health | python3 -m json.tool
```

**Time to recovery:** 5-10 minutes. No data loss — entities are the source of truth, index is derived.

### Scenario: Entity Files Deleted

**Symptoms:** `engram_entity_count` drops, searches return nothing.

**Recovery (from backup):**

```bash
# 1. Restore entities from most recent backup
tar xzf ~/backups/engram-YYYYMMDD.tar.gz --strip-components=2 -C ~/engram/ '*/entities/*'

# 2. Rebuild index
cd ~/engram && python -m engram.librarian

# 3. Rebuild vector index (if using semantic search)
cd ~/engram && python -m engram.vector
```

**Recovery (from archive):** If files were archived (not deleted), they're in `entities/.archive/`:

```bash
# List archived files
ls ~/engram/entities/.archive/ | head -20

# Restore specific file
mv ~/engram/entities/.archive/filename.md ~/engram/entities/

# Restore all (careful — includes intentionally consolidated files)
mv ~/engram/entities/.archive/*.md ~/engram/entities/
```

### Scenario: Disk Full

**Symptoms:** Saves fail, server returns 500, `disk_free_gb: 0`.

**Immediate actions:**

```bash
# 1. Check what's using space
du -sh ~/engram/entities/ ~/engram/entities/.archive/ ~/engram/entities/.versions/ ~/engram/logs/

# 2. Quick wins (safe to delete):
# Archive: merged/pruned files, safe to remove
rm -rf ~/engram/entities/.archive/*

# Old version snapshots: keep last 50 per entity, delete rest
cd ~/engram && python -c "
from engram.versions import prune_versions
from engram.config import INDEX_PATH
import sqlite3
conn = sqlite3.connect(str(INDEX_PATH))
paths = [r[0] for r in conn.execute('SELECT DISTINCT entity_path FROM entity_versions').fetchall()]
conn.close()
total = 0
for p in paths:
    total += prune_versions(p, max_versions=20)
print(f'Pruned {total} old snapshots')
"

# Old logs:
find ~/engram/logs/ -name "*.log" -mtime +7 -delete

# 3. Verify space freed
df -h ~/engram/
```

### Scenario: Knowledge Graph Database Corrupted

**Symptoms:** `graph_query` returns empty, `graph_ingest` fails.

**Recovery:**

```bash
# 1. Check corruption
sqlite3 ~/engram/memory_store.db "PRAGMA integrity_check;"

# 2. If corrupted, rebuild from entities
mv ~/engram/memory_store.db ~/engram/memory_store.db.corrupted

# Re-ingest top entities (by mention count, stored in index)
cd ~/engram && python -c "
from engram.graph import _get_graph
from engram.config import ENTITIES_DIR
import os

kg = _get_graph()
count = 0
for root, dirs, files in os.walk(str(ENTITIES_DIR)):
    if '.archive' in dirs:
        dirs.remove('.archive')
    for f in files:
        if f.endswith('.md'):
            path = os.path.join(root, f)
            rel = os.path.relpath(path, str(ENTITIES_DIR))
            content = open(path, encoding='utf-8', errors='ignore').read()
            kg.ingest(rel, content)
            count += 1
            if count % 100 == 0:
                print(f'Ingested {count} entities...')
            if count >= 1000:  # limit to avoid API costs
                break
    if count >= 1000:
        break
kg.close()
print(f'Done. Ingested {count} entities.')
"
```

**Note:** Full KG rebuild requires ANTHROPIC_API_KEY and costs API tokens. Rebuild selectively.

### Scenario: Encryption Key Lost

**If `~/engram/.encryption/engram.key` is deleted:**

- Plaintext entities: unaffected, still readable
- Encrypted entities (`.enc` files in `.versions/`): **unrecoverable**
- Encrypted version snapshots: lost forever

**Prevention:** Back up `~/engram/.encryption/` to a secure offsite location. This is the ONE file that, if lost, causes permanent data loss.

### Scenario: Server Won't Start

**Checklist:**

```bash
# 1. Port already in use?
ss -tlnp | grep -E '800[01]'
# Kill conflicting process:
# fuser -k 8001/tcp

# 2. Python path issue?
cd ~/engram && python -c "import engram; print(engram.__file__)"

# 3. Missing dependency?
cd ~/engram && pip install -e .

# 4. Corrupted __pycache__?
find ~/engram -name __pycache__ -type d -exec rm -rf {} + 2>/dev/null
cd ~/engram && python -m uvicorn engram.server:app --host 0.0.0.0 --port 8001

# 5. Check logs for the actual error
tail -50 ~/engram/logs/*.log
```

---

## 6. Performance Baselines

Measured on desktop (AMD/Intel, NVMe SSD, 32GB RAM) with ~49K entities.

### Search Latency (Expected)

| Mode | p50 | p95 | p99 | Notes |
|------|-----|-----|-----|-------|
| Keyword (ripgrep) | 30-50 ms | 100-200 ms | 500 ms | Depends on query selectivity |
| Semantic (vector KNN) | 100-200 ms | 500 ms | 1500 ms | First query loads model (~3s) |
| Hybrid (dense+sparse) | 200-400 ms | 800 ms | 2000 ms | Both pipelines + merge |
| Pyramid | 300-600 ms | 1500 ms | 3000 ms | Hybrid + progressive expansion |
| KG retrieval | 400-800 ms | 2000 ms | 5000 ms | BFS traversal + hybrid merge |

### Indexing Throughput

| Operation | Speed | Notes |
|-----------|-------|-------|
| Librarian (metadata index) | ~5K files/min | CPU-bound, YAML parsing |
| Vector bulk index | ~500 files/min | Model inference, batch size 256 |
| KG ingest (per file) | ~2-3 sec/file | API call to Anthropic |

### Memory Usage

| Component | RSS | Notes |
|-----------|-----|-------|
| Server (idle) | ~80 MB | FastAPI + uvicorn |
| Server (during search) | ~120 MB | ripgrep subprocess |
| Server (semantic search) | ~350 MB | FastEmbed model loaded |
| Consolidation (running) | ~200-500 MB | Depends on batch size |

### Disk I/O

| Operation | Pattern |
|-----------|---------|
| Keyword search | Sequential read (ripgrep streams) |
| Semantic search | Random read (KNN results → file reads) |
| Save memory | Single file write + index update |
| Consolidation | Batch read + merge write + archive move |

### When to Worry

- Keyword search > 1 second consistently → check disk I/O, ripgrep version
- Semantic search > 5 seconds → vector index may need rebuild
- Entity count drops > 5% between checks → investigate immediately
- Consolidation > 5 minutes → check batch size, entity count growth

---

## 7. Capacity Planning

### Current State (as of 2026-04-06)

| Metric | Value |
|--------|-------|
| Total entities | ~49,000 |
| Total storage | ~3.5 GB |
| SQLite index | ~81 MB |
| Average entity size | ~70 KB |
| Growth rate | ~50-100 entities/day (varies with usage) |

### Projections

| Timeframe | Estimated Entities | Estimated Storage | Action Needed |
|-----------|-------------------|-------------------|---------------|
| 6 months | ~58K | ~4.2 GB | None |
| 1 year | ~67K | ~4.8 GB | None |
| 2 years | ~85K | ~6.0 GB | Consider archive pruning |
| 5 years | ~130K | ~9.5 GB | SQLite performance review |

### Known Limits

| Component | Soft Limit | Hard Limit | Symptom |
|-----------|-----------|------------|---------|
| SQLite FTS5 | ~500K rows | ~2M rows | Search latency degrades |
| sqlite-vec (KNN) | ~100K vectors | ~1M vectors | Memory pressure |
| Filesystem (ext4) | ~1M files/dir | ~10M | `ls` and `os.walk` slow |
| ripgrep | ~1M files | Unknown | Timeout at 30s |
| Single entity size | ~1 MB | ~10 MB | Frontmatter parsing slow |

### Scaling Strategies (When Needed)

1. **Archive pruning schedule** — Run monthly, delete `.archive/` files older than 90 days
2. **Entity directory sharding** — Move from flat to `entities/YYYY/MM/` hierarchy (requires librarian update)
3. **Vector index partitioning** — Split by date range, search recent first
4. **SQLite WAL checkpoint** — Run `PRAGMA wal_checkpoint(TRUNCATE)` weekly to reclaim WAL space
5. **Read replica** — For multi-device setups, sync a read-only copy to Jetson via rsync

---

## 8. Troubleshooting

### Common Issues

**"ripgrep (rg) not installed"**
```bash
# Ubuntu/Debian
sudo apt install ripgrep

# Verify
rg --version
```

**"sqlite3.OperationalError: database is locked"**
```bash
# Check who has the database open
fuser ~/engram/vault_index.sqlite

# Force WAL checkpoint to release locks
sqlite3 ~/engram/vault_index.sqlite "PRAGMA wal_checkpoint(TRUNCATE);"
```

**"FastEmbed model download failed"**
```bash
# First-time use downloads BAAI/bge-small-en-v1.5 (~130 MB)
# If behind proxy or no internet:
# 1. Download model manually on a connected machine
# 2. Place in ~/.cache/fastembed/

# Or disable vector search and use keyword mode only
```

**"ANTHROPIC_API_KEY not set — skipping entity extraction"**
This is informational. KG features require the API key. Without it, graph tools return empty results but everything else works.

**MCP connection refused from Claude Code**
```bash
# Check server is running
curl -s localhost:8001/health

# Check .mcp.json points to correct URL
cat ~/Jetson/.mcp.json
# Should show: "url": "http://127.0.0.1:8001/mcp/sse"

# Check Claude Code settings enable engram
cat ~/Jetson/.claude/settings.local.json
# Should include: "enabledMcpjsonServers": ["engram"]
```

**Consolidation produces empty report**
```bash
# Check entities directory has files
ls ~/engram/entities/*.md | wc -l

# Check consolidation log
tail -50 ~/engram/logs/consolidator.log

# Run manually with debug logging
cd ~/engram && python -c "
import logging
logging.basicConfig(level=logging.DEBUG)
from engram.consolidator import MemoryConsolidator
c = MemoryConsolidator()
c.run(batch_size=100)
print(c.report)
"
```

**Search returns truncated results**
Results are capped at `MAX_RESPONSE_CHARS = 4000`. This is by design to prevent overwhelming MCP responses. If you need full results, use the REST API directly with a custom limit.

---

## 9. Maintenance Tasks

### Daily (Automated via Cron/Systemd Timer)

| Task | Command | Schedule |
|------|---------|----------|
| Backup entities + index | See backup section | 3:00 AM |
| Log rotation | `find ~/engram/logs/ -name "*.log" -mtime +7 -delete` | 4:00 AM |
| Audit log rotation | `gzip ~/engram/logs/audit.jsonl && touch ~/engram/logs/audit.jsonl` | Weekly |

### Weekly

| Task | Command | Why |
|------|---------|-----|
| WAL checkpoint | `sqlite3 ~/engram/vault_index.sqlite "PRAGMA wal_checkpoint(TRUNCATE);"` | Reclaim WAL disk space |
| Version snapshot pruning | See disk full recovery section | Prevent version bloat |
| Archive cleanup | `find ~/engram/entities/.archive/ -mtime +30 -delete` | Free disk space |

### Monthly

| Task | Command | Why |
|------|---------|-----|
| Vector index rebuild | `cd ~/engram && python -m engram.vector` | Catch unindexed new entities |
| Metadata index reconciliation | `cd ~/engram && python -m engram.librarian` | Fix stale/orphan index entries |
| Disk usage review | `du -sh ~/engram/entities/ ~/engram/entities/.versions/ ~/engram/entities/.archive/` | Capacity planning |

### Quarterly

| Task | Command | Why |
|------|---------|-----|
| Full consolidation | Run Dream Cycle with large batch | Merge accumulated duplicates |
| KG rebuild (selective) | Re-ingest high-importance entities | Refresh entity relationships |
| Performance baseline check | Run `scripts/benchmark.py` | Detect degradation trends |
| Dependency update | `pip install -U engram` or `pip install -e .` | Security patches |

---

## 10. Monitoring and Alerting

### Prometheus Scrape Config

If you have Prometheus running:

```yaml
scrape_configs:
  - job_name: 'engram'
    scrape_interval: 60s
    static_configs:
      - targets: ['localhost:8001']
    metrics_path: '/metrics'
```

### Manual Monitoring Script

Save as `~/engram/scripts/health_check.sh`:

```bash
#!/bin/bash
# Quick health check — run from cron or manually

HEALTH=$(curl -s --max-time 5 localhost:8001/health 2>/dev/null)
if [ $? -ne 0 ]; then
    echo "CRITICAL: Engram not responding on :8001"
    exit 2
fi

STATUS=$(echo "$HEALTH" | python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))" 2>/dev/null)
case "$STATUS" in
    healthy)
        echo "OK: Engram healthy"
        exit 0
        ;;
    degraded)
        echo "WARNING: Engram degraded — $(echo $HEALTH | python3 -m json.tool)"
        exit 1
        ;;
    warning)
        echo "WARNING: Engram warning — check disk space"
        exit 1
        ;;
    *)
        echo "UNKNOWN: Engram status=$STATUS"
        exit 3
        ;;
esac
```

### Key Alerts to Configure

| Alert | Condition | Severity |
|-------|-----------|----------|
| Engram down | `/health` unreachable for > 2 min | Critical |
| Disk space low | `engram_disk_free_gb < 2` | Warning |
| Disk space critical | `engram_disk_free_gb < 0.5` | Critical |
| Entity count dropped | `engram_entity_count` decreased > 5% | Critical |
| Index stale | `engram_index_age_seconds > 604800` (1 week) | Warning |
| Quarantine growing | `engram_quarantine_count > 50` | Warning |
| Search latency high | `engram_search_latency_ms` p95 > 5000 | Warning |
| Consolidation stuck | `engram_consolidation_running = 1` for > 10 min | Warning |
| Audit log large | `engram_audit_log_size_mb > 100` | Info |

### Integration with rook-doctor (Jetson)

If running on Jetson with the rook-doctor watchdog:

```bash
# Add Engram health check to doctor.sh
curl -sf --max-time 5 localhost:8001/health > /dev/null || systemctl restart engram-server
```

This gives automatic recovery: if Engram crashes, rook-doctor restarts it within one watchdog cycle (default 60s).
