Telemetry Operations

internal

Internal setup, validation, and provider switching guide for runtime telemetry.

Verification checklist

Set TELEMETRY_ENABLED=true.
Keep module gates enabled (BACKEND_MODULE_OBSERVABILITY_ENABLED=true, VITE_MODULE_OBSERVABILITY_ENABLED=true).
Select provider mode (otlp, posthog_logs, or signoz).
Start backend and confirm no startup telemetry errors.
Trigger API requests and verify log/trace arrival in selected backend.

Provider switch model

Switching between PostHog and SigNoz should only require env changes.

Keep app code on generic OpenTelemetry APIs.
Use provider mode + endpoint/token vars to redirect telemetry export.

p99 baseline focus

Prioritize p99 latency and error-rate dashboards by endpoint group.
Track queue processing duration/failures as secondary runtime SLOs.

SLI contract (pass 1)

Use one shared latency SLI naming model across providers:

service: backend service (athas-backend)
route_group: stable endpoint family (auth, billing, notifications, webhooks, users)
method: HTTP method
status_class: response class (2xx, 4xx, 5xx)

Primary latency indicators per route group:

http.server.duration p50
http.server.duration p95
http.server.duration p99

Error indicator:

server error-rate = 5xx / total requests over 5m

Dashboard parity contract

Keep panel names and thresholds identical between PostHog and SigNoz:

API Latency p50/p95/p99 by route group (5m window)
Error-rate by route group (5m window)
Slowest endpoints table (p99, count, error-rate)
Queue processing latency and failures

Alert thresholds (initial)

Warning: p95 > 800ms for 10m (route group)
Critical: p99 > 1500ms for 10m (route group)
Critical: error-rate > 2% for 10m (route group)

Tune thresholds by real traffic baselines after one week of data.

Provider parity query contract (pass 2)

Maintain equivalent query intent across providers even if syntax differs:

Latency percentile query grouped by route_group and method (p50/p95/p99 over 5m buckets)
Error-rate query grouped by route_group (5xx / total over 5m buckets)
Slow endpoint table sorted by p99 with request count and error-rate

Use this mapping as the canonical panel contract:

api_latency_percentiles_by_route_group
api_error_rate_by_route_group
api_slowest_endpoints_p99
queue_latency_and_failures

Concrete provider query templates live in /internal/telemetry-dashboard-queries.

Versioned contract artifacts:

docs/telemetry/dashboard-contract.json
docs/telemetry/alerts-contract.yaml

Incident runbook (latency)

Confirm spike scope (single route group vs global).
Compare p95/p99 and error-rate for same time window.
Correlate with deployment, queue backlog, and provider incidents.
Mitigate (rollback, feature-flag reduction, or queue drain strategy).
Record timeline and update threshold/runbook notes.

Commands

bun run perf:baseline
bun run perf:baseline:enforce
bun run telemetry:latency:check -- --base-url http://localhost:3000 --paths /health/ready,/health/live
bun run telemetry:contracts:sync
bun run telemetry:contracts:check

Contract sync outputs provider provisioning bundles used by CI automation:

docs/telemetry/generated/signoz-provisioning.json
docs/telemetry/generated/posthog-provisioning.json

For docs integrity around migration/governance:

bun run check:docs-canonical
bun run check:docs-canonical:report

On this page

Verification checklist Provider switch model p99 baseline focus SLI contract (pass 1)Dashboard parity contract Alert thresholds (initial)Provider parity query contract (pass 2)Incident runbook (latency)Commands