Operations Runbook

Overview

This runbook provides operational procedures for maintaining and operating the CyberAi control plane. Follow these procedures for common operational tasks.

Daily Operations

Health Check

# Run audit to verify system health
./tools/audit/audit.sh

# Check workflow status
gh workflow list

# Verify site deployment
curl -I https://cyberai.network

Monitoring

Common Tasks

Adding a New Contract

  1. Create contract JSON in appropriate directory (contracts/agents/ or contracts/repos/)
  2. Validate against schema: ajv validate -s contracts/contract.schema.json -d "contracts/agents/your-contract.json"
  3. Create PR with the new contract
  4. Wait for automated validation to pass
  5. Merge after review

Updating Documentation

  1. Edit files in site/src/pages/docs/
  2. Test locally: cd site && npm run dev
  3. Build: npm run build
  4. Commit and push to trigger deployment

Deploying Site Updates

# Build the site
cd site
npm run build

# Commit and push
git add .
git commit -m "Update site"
git push origin main

# Deployment happens automatically via GitHub Actions

Incident Response

Site Down

  1. Check GitHub Pages status: githubstatus.com
  2. Verify DNS records for cyberai.network
  3. Check latest workflow runs for deployment failures
  4. Review recent commits for breaking changes
  5. Rollback if necessary: git revert <commit-sha>

Contract Validation Failures

  1. Review workflow logs in GitHub Actions
  2. Identify failing contracts
  3. Validate manually: ./tools/audit/audit.sh
  4. Fix schema violations or contract issues
  5. Re-run validation

Security Alert

  1. Review alert details in GitHub Security tab
  2. Assess severity and impact
  3. Create incident ticket
  4. Apply fix or mitigation
  5. Verify fix with audit tool
  6. Document incident and resolution

Maintenance Windows

Scheduled Maintenance

For planned maintenance:

  1. Announce maintenance window in advance
  2. Create maintenance branch
  3. Perform updates and testing
  4. Merge to main during maintenance window
  5. Verify deployment
  6. Announce completion

Emergency Maintenance

For critical issues requiring immediate attention:

  1. Assess impact and urgency
  2. Create hotfix branch from main
  3. Apply minimal fix
  4. Fast-track review and merge
  5. Monitor post-deployment
  6. Schedule follow-up for complete fix

Backup and Recovery

Data Backup

All data is version-controlled in Git:

Recovery Procedures

# Restore from a specific commit
git checkout <commit-sha> -- contracts/

# Rebuild site
cd site && npm run build

# Redeploy
git push origin main

Escalation

Escalate issues when:

Contact: See SECURITY.md for escalation contacts.

Related Documentation