Skip to main content

Observability

Risoluto provides multiple observability surfaces: Prometheus metrics, real-time event streams, audit logs, request tracing, and structured logs.

Prometheus Metrics

Available at GET /metrics in Prometheus exposition format. Scrape at 10-15 second intervals.
curl -s http://127.0.0.1:4000/metrics

Metric reference

MetricTypeLabelsDescription
risoluto_http_requests_totalCountermethod, statusTotal HTTP requests
risoluto_http_request_duration_secondsHistogrammethodRequest latency distribution
Useful queries:
# Error rate (5xx)
sum(rate(risoluto_http_requests_total{status=~"5.."}[5m]))
/ sum(rate(risoluto_http_requests_total[5m]))

# P95 latency
histogram_quantile(0.95, rate(risoluto_http_request_duration_seconds_bucket[5m]))
MetricTypeLabelsDescription
risoluto_orchestrator_polls_totalCounterstatusPoll cycles (success, error, skipped)
Useful queries:
# Polls per minute
rate(risoluto_orchestrator_polls_total{status="success"}[5m]) * 60

# Stall detection — no polls in 10 minutes
increase(risoluto_orchestrator_polls_total[10m]) == 0
MetricTypeLabelsDescription
risoluto_agent_runs_totalCounteroutcomeAgent completions (completed, failed, oom, stalled)
Useful queries:
# Success rate (1h)
sum(rate(risoluto_agent_runs_total{outcome="completed"}[1h]))
/ sum(rate(risoluto_agent_runs_total[1h]))

# Runs by outcome (stacked chart)
sum by (outcome) (increase(risoluto_agent_runs_total[1h]))

Scrape configuration

scrape_configs:
  - job_name: risoluto
    static_configs:
      - targets: ["risoluto:4000"]
    metrics_path: /metrics
    scrape_interval: 10s

Alert rules

groups:
  - name: risoluto
    rules:
      - alert: HighErrorRate
        expr: >
          rate(risoluto_http_requests_total{status=~"5.."}[5m])
          / rate(risoluto_http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "HTTP error rate above 5%"

      - alert: AgentRunFailures
        expr: rate(risoluto_agent_runs_total{outcome="failed"}[15m]) > 0
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "Agent runs failing consistently"

      - alert: PollStalled
        expr: increase(risoluto_orchestrator_polls_total[10m]) == 0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Orchestrator hasn't polled in 10 minutes"

Request Tracing

Every request gets an X-Request-ID header:
  • Incoming: If the client sends X-Request-ID, it is preserved
  • Generated: Otherwise a UUID v4 is assigned
  • Response: The ID is always returned in the response headers
Use this ID to correlate logs, metrics, and audit entries for a single request.

Error Tracking

Sentry-compatible error tracking when SENTRY_DSN is set:
export SENTRY_DSN=https://your-key@sentry.io/project-id
When enabled: exceptions are captured with full stack traces, breadcrumbs track the last 100 operations, and context (issue identifier, attempt count) is attached to every error. DSN is redacted in log output. When SENTRY_DSN is not set, a no-op tracker is used with zero overhead.

Process Logs

Logs are emitted to stdout via Pino in structured format:
VariableValuesDefault
RISOLUTO_LOG_FORMATlogfmt, jsonlogfmt
LOG_LEVELtrace through fatalinfo
# Persist logs to file
node dist/cli/index.js --port 4000 2>&1 | tee risoluto.log

# JSON format for log aggregators (Loki, Datadog, etc.)
RISOLUTO_LOG_FORMAT=json node dist/cli/index.js --port 4000

Data Persistence

All attempt and event data is stored in SQLite (risoluto.db) with WAL mode:
.risoluto/
├── risoluto.db           # Attempts, events, issue index
├── risoluto.db-shm       # WAL shared-memory
├── risoluto.db-wal       # Write-ahead log
├── config/               # Operator config overlay
├── secrets.enc           # Encrypted credential store
├── secrets.audit.log     # Access audit trail
└── master.key            # Encryption master key

Querying data

# Via API
curl -s http://127.0.0.1:4000/api/v1/state | jq
curl -s http://127.0.0.1:4000/api/v1/NIN-6/attempts | jq

# Via CLI helper
./risoluto MT-42
./risoluto --attempt <attempt-id>

# Direct SQLite (read-only via WAL)
sqlite3 .risoluto/risoluto.db \
  "SELECT attempt_id, issue_identifier, status FROM attempts ORDER BY started_at DESC LIMIT 10;"

What’s Next

Monitoring Stack

Set up Prometheus and Grafana with pre-built queries and alerts.

Troubleshooting

Diagnose common failures and recovery procedures.

Notifications

Configure Slack notifications for agent lifecycle events.

Dashboard Guide

Board view, issue inspector, and live event stream.
Last modified on March 31, 2026