# Engram Troubleshooting Guide

This document captures real-world issues encountered during Engram Hub and Jetson fleet deployment, with exact reproduction steps and fixes.

---

## 1. Docker "Space vs. Dash" Trap

### Problem
Modern Docker installations (v20.10+) removed the standalone `docker-compose` command. Users attempting to run `docker-compose up` will see:
```
zsh: command not found: docker-compose
```

### Root Cause
Docker Compose is now installed as a subcommand (`docker compose` with a space, not hyphen). The standalone `docker-compose` binary is deprecated and removed from newer Docker versions.

### Fix
**Always use `docker compose` (space), not `docker-compose` (hyphen):**

```bash
# ✓ CORRECT
docker compose up -d
docker compose down
docker compose logs -f

# ✗ WRONG
docker-compose up -d
docker-compose down
docker-compose logs -f
```

### Prevention
- Update all documentation to use `docker compose` (space)
- Remove `version: '3.8'` from docker-compose.yml — it is now obsolete and Docker ignores it with a warning
- Add a note in README about Docker version requirements (v20.10+)

**Status:** ✅ Fixed in docker-compose.yml (version tag removed)

---

## 2. Synapse "Identity Crisis" — Container Name & Config Mismatch

### Problem
When the Matrix service was renamed from `matrix_hub` to `engram-synapse` in docker-compose.yml, the container kept restarting with:
```
Config file '/data/homeserver.yaml' does not exist.
The synapse docker image no longer supports generating a config file on-the-fly
```

### Root Cause
1. The old `matrix_hub` container left dangling resources in the Docker network
2. Synapse's modern image (v1.x+) **requires** a pre-existing `homeserver.yaml` config file
3. No automatic config generation happens on container startup

### Fix: Force-Clear the Dangling Service

```bash
# Kill and remove the old container explicitly
docker compose down --remove-orphans

# Remove all associated data
sudo rm -rf matrix/

# If port 8008 is still bound, kill any lingering processes
lsof -i :8008  # Check what's using it
kill -9 <PID>   # Force kill if needed

# Restart fresh
docker compose up -d
```

### Prevention
1. **First-Run Check:** Add a validation step in the deployment script to verify `homeserver.yaml` exists before starting Synapse
2. **Document the Setup Flow:** Create a separate `matrix/homeserver.yaml` during initial deployment, not as part of container startup
3. **Use `--remove-orphans`:** Always include this flag when stopping to clean up renamed services

**Status:** ⚠️ Partially fixed — Synapse is currently disabled; re-enable with proper config generation flow

---

## 3. Port Squatting (8008) — Zombie Processes

### Problem
After killing a Synapse container, port 8008 remained "occupied" even though no process showed in `lsof`:
```bash
$ docker compose up -d
Error: Port 8008 is already in use
```

Running `lsof -i :8008` showed nothing, but the port didn't respond to connections.

### Root Cause
Old Matrix/Synapse containers sometimes leave lingering Docker network namespaces. The port appears bound but no userland process holds it. Docker's internal network driver keeps the reservation.

### Fix: Complete Port Cleanup

```bash
# 1. Kill the service
docker compose down

# 2. Remove orphan containers
docker compose down --remove-orphans

# 3. Check if port 8008 is actually bound
lsof -i :8008
netstat -tupln | grep 8008

# 4. If port still appears bound, inspect Docker networks
docker network inspect engram_engram-mesh

# 5. Force-prune unused networks and containers
docker system prune -f
docker network prune -f

# 6. Restart
docker compose up -d
```

### Prevention
- Always use `docker compose down --remove-orphans` when stopping services
- Document port usage in docker-compose.yml comments (who uses 8000, 8008, 8448)
- Add a "Port Check" section to pre-flight validation in deployment scripts
- Consider using explicit network cleanup in automated deployments

**Status:** ✅ Fixed via `--remove-orphans` during deployment

---

## 4. Tailscale Node ID Discrepancy — Manifest vs. Runtime

### Problem
The `get_mesh_auth.py` script generated a manifest with node ID `jetson-0addff44`, but after provisioning, `tailscale status` showed the Jetson as `jetson-41064a3d` — a completely different ID:

```bash
# Expected from manifest
jetson-0addff44

# Actually appeared in mesh
100.x.x.x   jetson-41064a3d  mtaylor@  linux  -
```

This caused confusion about whether provisioning succeeded or failed.

### Root Cause
Tailscale generates the node ID **at first authentication** based on the machine's hostname and a random component. The manifest script generates a predicted ID based on available info, but this ID is not finalized until the node actually authenticates via `tailscale up --authkey`.

This is **not a bug**—it's expected behavior. The manifest ID is a best-guess for tracking; the authenticated ID is ground truth.

### Fix: Verify Using Authenticated Node ID

```bash
# After provisioning, always verify the node in the mesh:
tailscale status | grep jetson

# Output should show:
# <Tailscale IP>   <Actual Node ID>  <User>  <OS>  -

# Verify connectivity via Tailscale IP (not the predicted ID)
ssh howsa@<TAILSCALE_IP>

# Example from this deployment:
ssh howsa@100.x.x.x
```

### Prevention
1. **Document the ID Flip:** Add a section in README explaining that manifest IDs are predictions
2. **Update Deployment Output:** Have `deploy_to_node.sh` print the **actual authenticated node ID** from `tailscale status` instead of the manifest ID
3. **Update Polling Logic:** Poll for the **Tailscale IP address** becoming available, not the predicted node ID
4. **Add Verification Step:** After polling succeeds, print the confirmed node ID so the user sees ground truth

**Status:** ⚠️ Documented — polling still uses manifest ID; should be updated to use Tailscale IP detection

---

## 5. SSH Password Prompts in Headless Deployments

### Problem
The deployment script (`deploy_to_node.sh`) ran `ssh` and `rsync` commands that prompted for passwords, blocking the deployment:

```
howsa@192.168.x.x's password:
[stuck waiting for input]
```

### Root Cause
1. Passwordless SSH key exchange wasn't configured
2. The remote `prep_node.sh` tried to run `sudo` commands without passwordless sudo setup
3. No SSH agent was forwarding keys

### Fix: Setup SSH Key Trust and Passwordless Sudo

```bash
# 1. Ensure your SSH key is in the agent
ssh-add ~/.ssh/id_rsa

# 2. Copy your public key to the Jetson (you'll be prompted for password once)
ssh-copy-id -i ~/.ssh/id_rsa.pub howsa@192.168.x.x

# 3. Verify passwordless SSH works
ssh howsa@192.168.x.x "echo 'SSH works without password'"

# 4. Setup passwordless sudo on the Jetson (one-time setup with password)
ssh -t howsa@192.168.x.x \
  "echo 'howsa ALL=(ALL) NOPASSWD: ALL' | sudo tee /etc/sudoers.d/engram-prep"

# 5. Verify passwordless sudo works
ssh howsa@192.168.x.x "sudo tailscale status"
```

### Prevention
1. **Pre-Flight Check:** Add validation to `deploy_to_node.sh` that verifies passwordless SSH before attempting deployment
2. **Document Setup:** Add a "Prerequisites" section to README with exact SSH setup commands
3. **SSH Agent Check:** Have the deployment script check `ssh-add -l` before proceeding
4. **Fail Fast:** Print clear error messages if SSH/sudo aren't passwordless-enabled

**Status:** ✅ Fixed with one-time setup; should be documented in README prerequisites

---

## 6. Rsync Exclusion Errors — Permission Denied on Temporary Files

### Problem
Rsync failed with permission errors on files that shouldn't have been transferred:

```
rsync: [sender] send_files failed to open "/home/geodesix/engram/synapse_data/engram-hub.signing.key"
rsync error: some files/attrs were not transferred (code 23)
```

### Root Cause
The `synapse_data/` directory contained files with restrictive permissions (owned by root, 755) that rsync tried to scan even though they shouldn't be deployed to the Jetson.

### Fix: Add Exclusions to Rsync

```bash
rsync -aqz \
  --exclude='venv' \
  --exclude='.venv' \
  --exclude='__pycache__' \
  --exclude='.git' \
  --exclude='synapse_data' \
  --exclude='matrix' \
  --exclude='REFLECTIONS' \
  --exclude='MEMORIES' \
  --exclude='node_modules' \
  --exclude='.env' \
  --exclude='.env.tailscale' \
  --exclude='REPORTS' \
  --exclude='.claude' \
  --exclude='.mypy_cache' \
  --exclude='.pytest_cache' \
  --exclude='.ruff_cache' \
  --exclude='logs' \
  --exclude='artifacts' \
  "$SOURCE_DIR" "${TARGET_USER}@${TARGET_HOSTNAME}:${TARGET_DIR}"
```

### Prevention
1. **Centralize Exclusion List:** Create a `.rsyncignore` file with all excludes, reference it with `--exclude-from`
2. **Pre-Flight Check:** Verify permissions on all directories before rsync (`find . -type d -perm 0755 -o -type f -perm 0644`)
3. **Document Deployment Directory Structure:** Add a section listing which directories should never be synced

**Status:** ✅ Fixed in deploy_to_node.sh

---

## Quick Reference: One-Handed Recovery

If the stack breaks, use this sequence:

```bash
# Complete reset
docker compose down --remove-orphans
sudo rm -rf matrix/ synapse_data/
docker system prune -f
docker network prune -f

# Rebuild and restart
docker compose build --no-cache
docker compose up -d

# Verify
curl http://localhost:8000/health
docker compose ps
```

---

## When to Report Issues

- **Port conflicts persist after `docker system prune -f`?** → Check if Docker daemon needs restart: `sudo systemctl restart docker`
- **Jetson doesn't appear in mesh after 120s?** → SSH in and check `/tmp/prep_node.log`
- **Synapse keeps crashing?** → Currently disabled; document your homeserver.yaml before re-enabling
- **Rsync still failing?** → Add directory to exclusions list in deploy_to_node.sh and retry
