# Engram Security Hardening: Data Leakage Prevention

**Status:** Complete implementation with multi-layer secret scanning
**Last Updated:** 2026-03-29

---

## Overview

This document describes the comprehensive security infrastructure added to the Engram onboarding wizard to prevent data leakage, credential exposure, and unauthorized repository visibility changes.

### Three-Layer Defense

```
┌────────────────────────────────────────────────────────────────┐
│ Layer 1: GitHub Privacy Controls (Prevention)                 │
│ ├─ Private-by-default repositories                           │
│ ├─ Fine-grained tokens (limited permissions)                 │
│ └─ Agents cannot change visibility without user approval    │
├────────────────────────────────────────────────────────────────┤
│ Layer 2: Pre-Commit Secret Scanning (Automated Detection)     │
│ ├─ git-secrets: Regex pattern detection                      │
│ ├─ gitleaks: Entropy-based + pattern detection              │
│ └─ ggshield: Enterprise-grade ML-based detection            │
├────────────────────────────────────────────────────────────────┤
│ Layer 3: Credential Validation (Format Verification)         │
│ ├─ Validates all token formats on save                      │
│ ├─ Checks file permissions (credentials.json = 600)        │
│ └─ Reports format errors with fixes                         │
└────────────────────────────────────────────────────────────────┘
```

---

## Layer 1: GitHub Privacy Controls

### Default Behavior

- **All new repositories: PRIVATE** (not public)
- **Fine-grained tokens required** (not legacy tokens)
- **Limited permissions:**
  - `Contents`: Read + Write (for code)
  - `Workflows`: Read + Write (for CI/CD)
  - **NOT** `admin:org` or account-level permissions

### Prevents Agents From:

- ❌ Making repositories public
- ❌ Changing repository visibility
- ❌ Accessing organization settings
- ❌ Managing access control
- ❌ Deleting repositories

### Implementation

```python
# Token Format Validation (interview.py)
if github_token and (github_token.startswith("ghp_") or github_token.startswith("github_pat_")):
    # Fine-grained tokens (github_pat_) are preferred
    # Legacy tokens (ghp_) also supported
    # Both have limited permissions compared to classic tokens
```

### User Confirmation

```
🔒 PRIVACY BY DEFAULT:
  • All repositories will be PRIVATE by default
  • Agents CANNOT make repos public without your explicit confirmation
  • You must approve any public visibility changes
```

---

## Layer 2: Pre-Commit Secret Scanning

### Three-Tool Strategy

#### 1. **git-secrets** (Local Pattern Detection)

**What it detects:**
- AWS/GCP credentials
- Custom patterns:
  - Proprietary formulas: `(?i)(custom_indicator|proprietary_formula)[\s_]*=[\s_]*["\']?[A-Za-z0-9\-_]{20,}`
  - API keys: `(?i)(api[_-]?key|apikey)[\s]*[=:]\s*["\']?[A-Za-z0-9\-_]{32,}`
  - Passwords: `(?i)(password|passwd|pwd)[\s]*[=:]\s*["\']?.{8,}`
  - Private keys: `(?i)(private|secret)[_-]?(key|token)`
  - Email/PII: `user[_-]?(id|name)` patterns
  - SSN: `\d{3}-\d{2}-\d{4}` format
  - Credit cards: `\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}`

**Installation:**
```bash
sudo apt install -y git-secrets
git secrets --install -f
git secrets --register-aws
# Custom patterns added automatically
```

**How it runs:**
```bash
# Runs before each commit
git commit -m "your message"
# → git-secrets scans staged files
# → If secret found, commit is blocked
```

#### 2. **gitleaks** (Entropy-Based Detection)

**What it detects:**
- High-entropy strings (base64, hex)
- Password-like patterns
- Known secret patterns (API keys, tokens)
- Complexity analysis (flags suspicious data)

**Installation:**
```bash
sudo apt install -y gitleaks
# OR manual from: https://github.com/gitleaks/gitleaks/releases
```

**How it runs:**
```bash
gitleaks detect --staged --verbose
# Runs before each commit
# Detects even obfuscated/encoded secrets
```

#### 3. **GitGuardian ggshield** (ML-Based Enterprise Detection)

**What it detects:**
- Real secrets (validated against hashsums)
- Incident context (was this secret ever leaked?)
- Custom organizational patterns
- Business logic secrets (proprietary formulas)

**Installation:**
```bash
pipx install ggshield
ggshield config init  # Interactive auth setup
```

**Authentication:**
```bash
# Get GitGuardian API key from:
# https://dashboard.gitguardian.com/settings/api

# Store in: ~/.config/ggshield/config.yml
gitguardian:
  api_key: YOUR_API_KEY_HERE
```

**How it runs:**
```bash
ggshield secret scan pre-commit
# Pre-commit hook automatically calls this
# Checks against GitGuardian's database of real incidents
```

### Pre-Commit Hook Setup

**File:** `.pre-commit-config.yaml` (automatically created)

```yaml
repos:
  # git-secrets
  - repo: local
    hooks:
      - id: git-secrets
        name: git-secrets
        entry: git secrets --scan
        language: system
        stages: [commit]

  # gitleaks
  - repo: local
    hooks:
      - id: gitleaks
        name: gitleaks
        entry: gitleaks detect --staged
        language: system
        stages: [commit]

  # Large file detection
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.4.0
    hooks:
      - id: detect-private-key
      - id: check-added-large-files
        args: ['--maxkb=1000']
```

**How to run manually:**
```bash
# Test secret scanning
git secrets --scan

# Scan specific file
gitleaks detect --path myfile.py

# Full repo scan
gitleaks detect --no-pager

# GitGuardian scan
ggshield secret scan repo .
```

---

## Layer 3: Credential Validation

### Format Checks (On Save)

```
Token/Key Format Validation:
├─ Nvidia:    nvapi-* (64+ chars)
├─ Tailscale: tskey-* (64+ chars)
├─ Matrix:    syt_* (variable length)
├─ GitHub:    ghp_* (legacy) or github_pat_* (fine-grained)
└─ Email:     *@*.* (basic email format)

File Permissions:
└─ credentials.json: 600 (read/write owner only)
```

### Reported Issues

If a secret fails validation:

```
❌ Nvidia key doesn't start with 'nvapi-' (invalid format)
→ Fix: Regenerate from https://docs.nvidia.com/...

❌ credentials.json is readable by others (mode: 644)
→ Fix: chmod 600 ~/engram/credentials.json
```

### What Gets Validated

- ✅ All credential formats
- ✅ File permissions (600)
- ✅ Token/key structure
- ❌ Does NOT validate token expiry
- ❌ Does NOT validate token scope (user responsibility)

---

## Preventing Data Leakage: Real-World Scenarios

### Scenario 1: Agent Tries to Commit Proprietary Formula

```bash
# Agent writes code that includes proprietary formula
echo "proprietary_formula = 'custom_algorithm_ABC123DEFGHIJKLMNOP'" >> config.py
git add config.py
git commit -m "Add proprietary algorithm"

# What happens:
# → git-secrets matches "proprietary_formula" pattern
# → Commit BLOCKED with error:
#    "WARNING: git-secrets has detected a possible secret"
# → Agent cannot push code with proprietary data
```

### Scenario 2: Agent Tries to Make Repo Public

```bash
# Agent has GitHub token with limited permissions
# Agent tries: gh repo edit --visibility public

# What happens:
# → GitHub API returns 403 Forbidden
# → Token permissions only include: Contents, Workflows
# → Cannot access: Repo settings, Visibility changes
# → Agent must ask user to change visibility manually
```

### Scenario 3: Agent Commits Password in Config

```bash
# Agent writes: password = "MySecurePass123"
git add myconfig.py
git commit -m "Add default password"

# What happens:
# → gitleaks detects entropy: "MySecurePass123"
# → ggshield validates: is this a real password?
# → Commit BLOCKED
# → Agent must remove secret before retry
```

### Scenario 4: Agent Tries to Leak PII

```bash
# Agent writes: user_ssn = "123-45-6789"
# or: credit_card = "4111-1111-1111-1111"

git commit -m "Store user data"

# What happens:
# → git-secrets matches SSN/CC patterns
# → Commit BLOCKED immediately
# → PII never reaches repository
```

---

## Operational Procedures

### Initial Setup (Wizard Runs These)

```bash
# 1. GitHub account setup
#    - Create account
#    - Enable 2FA (REQUIRED)
#    - Generate fine-grained token with limited permissions

# 2. Secret scanning tools
#    - Install: git-secrets, gitleaks, ggshield
#    - Configure: Custom patterns, GitGuardian API key
#    - Install hooks: Pre-commit framework

# 3. Credentials saved
#    - credentials.json created (mode 600)
#    - All tokens format-validated
#    - File permissions checked
```

### Daily Operations

```bash
# Developer workflow (automatic):
1. Edit code
2. Stage files: git add .
3. Commit: git commit -m "message"
   → Pre-commit hooks run automatically
   → git-secrets scans for patterns
   → gitleaks scans for entropy
   → ggshield checks against incident database
4. If secrets found: Commit BLOCKED, developer fixes
5. If no secrets: Commit succeeds, can push

# Developer can also manually scan:
git secrets --scan              # Local patterns
gitleaks detect --no-pager      # Entropy + patterns
ggshield secret scan repo .     # Enterprise database
```

### Monitoring & Alerts

```bash
# Check what secret scanning is active
cat ~/.git-hooks/pre-commit-gitleaks
cat ~/.git-hooks/pre-commit-ggshield

# View ggshield config
cat ~/.config/ggshield/config.yml

# Check git-secrets patterns
git secrets --list

# Test a pattern match (should block)
echo 'api_key = "nvapi-fake123"' > test.py
git add test.py
git commit -m "test"  # Should fail
```

### If Secret Gets Committed (Emergency)

```bash
# 1. Immediately revoke the compromised secret
#    (e.g., regenerate GitHub token, reset API key)

# 2. Remove from history
git filter-branch --force --index-filter \
  'git rm --cached --ignore-unmatch <filename>' \
  --prune-empty --tag-name-filter cat -- --all

# 3. Force push (only if you own the repo)
git push origin --force --all

# 4. Notify team (if shared repo)
# 5. Regenerate all affected credentials
```

---

## Testing the Security Implementation

### Test 1: Verify git-secrets

```bash
cd ~/engram
echo 'test_key = "nvapi-abcdef123456"' > test.py
git add test.py

# Should fail:
git commit -m "test"
# Error: "WARNING: git-secrets has detected a possible secret"
```

### Test 2: Verify gitleaks

```bash
echo 'password = "SuperSecret123!"' > test.py
git add test.py

# Should fail:
git commit -m "test"
# Error: "gitleaks detected x results"
```

### Test 3: Verify ggshield

```bash
echo 'api_key = "test_key_with_high_entropy_xyzabc"' > test.py
git add test.py

# Should fail:
git commit -m "test"
# Error: ggshield detected potential secrets
```

### Test 4: Verify GitHub Permissions

```bash
# Try to make repo public with limited-permission token
gh repo edit --visibility public

# Should fail:
# Error: "Resource not accessible by integration"
# (token doesn't have permission to change settings)
```

---

## Configuration & Customization

### Add Custom Pattern (git-secrets)

```bash
git secrets --add -a 'my-custom-pattern'

# Example: Detect internal ID format
git secrets --add -a 'INTERNAL_ID=[0-9]{10}'
```

### Configure ggshield

```bash
# Interactive setup
ggshield config init

# Manual config at ~/.config/ggshield/config.yml
gitguardian:
  api_key: YOUR_KEY
  only_verified: true   # Only real secrets
  verbose: true         # Show details
```

### Update gitleaks Rules

```bash
# Download latest rules
gitleaks detect --update-rules

# Custom rules in: ~/.gitleaks.toml
```

---

## Troubleshooting

### Issue: Pre-commit hook not running

```bash
# Check if installed
pre-commit run --all-files

# Reinstall
pre-commit install
pre-commit install-hooks
```

### Issue: False positive (legitimate string flagged)

```bash
# Add to git-secrets whitelist
git secrets --add --allowed 'my-test-string'

# Or use .gitignore + file exemptions
echo 'testfile.py' >> .git/info/exclude
```

### Issue: ggshield authentication failed

```bash
# Check API key
cat ~/.config/ggshield/config.yml

# Re-authenticate
ggshield auth login

# Test connection
ggshield secret scan .
```

### Issue: git-secrets installed but not in PATH

```bash
# Check location
which git-secrets

# If not found, install in PATH
sudo apt install -y git-secrets
# OR
git clone https://github.com/awslabs/git-secrets
cd git-secrets && sudo make install
```

---

## Summary: Layers of Defense

| Layer | Tool | Detection | Prevention |
|-------|------|-----------|-----------|
| **Pre-Commit Hook** | git-secrets | Regex patterns | Blocks commit |
| **Pre-Commit Hook** | gitleaks | Entropy + patterns | Blocks commit |
| **Pre-Commit Hook** | ggshield | Real incident database | Blocks commit |
| **API Permissions** | GitHub fine-grained | Limited scope | Token can't change visibility |
| **Validation** | Format checker | Token structure | Alerts on save |
| **File Permissions** | OS-level | 600 mode | Only owner reads |

**Result:** Data leakage is cryptographically difficult without user knowledge.

---

## Next Steps

1. **Run the wizard:** `python3 scripts/interview.py`
2. **Verify hooks installed:** `pre-commit run --all-files`
3. **Test a pattern:** Try to commit a fake secret (should fail)
4. **Configure ggshield:** Add your API key for enterprise detection
5. **Review logs:** `git log --all --source --remotes` to verify clean history

---

**Questions?** Refer to tool documentation:
- git-secrets: https://github.com/awslabs/git-secrets
- gitleaks: https://github.com/gitleaks/gitleaks
- ggshield: https://docs.gitguardian.com/ggshield
