Telemetry Operations
internalInternal setup, validation, and provider switching guide for runtime telemetry.
Verification checklist
- Set
TELEMETRY_ENABLED=true. - Keep module gates enabled (
BACKEND_MODULE_OBSERVABILITY_ENABLED=true,VITE_MODULE_OBSERVABILITY_ENABLED=true). - Select provider mode (
otlp,posthog_logs, orsignoz). - Start backend and confirm no startup telemetry errors.
- Trigger API requests and verify log/trace arrival in selected backend.
Provider switch model
Switching between PostHog and SigNoz should only require env changes.
- Keep app code on generic OpenTelemetry APIs.
- Use provider mode + endpoint/token vars to redirect telemetry export.
p99 baseline focus
- Prioritize p99 latency and error-rate dashboards by endpoint group.
- Track queue processing duration/failures as secondary runtime SLOs.
SLI contract (pass 1)
Use one shared latency SLI naming model across providers:
service: backend service (athas-backend)route_group: stable endpoint family (auth,billing,notifications,webhooks,users)method: HTTP methodstatus_class: response class (2xx,4xx,5xx)
Primary latency indicators per route group:
http.server.durationp50http.server.durationp95http.server.durationp99
Error indicator:
- server error-rate =
5xx / total requestsover 5m
Dashboard parity contract
Keep panel names and thresholds identical between PostHog and SigNoz:
- API Latency p50/p95/p99 by route group (5m window)
- Error-rate by route group (5m window)
- Slowest endpoints table (p99, count, error-rate)
- Queue processing latency and failures
Alert thresholds (initial)
- Warning: p95 > 800ms for 10m (route group)
- Critical: p99 > 1500ms for 10m (route group)
- Critical: error-rate > 2% for 10m (route group)
Tune thresholds by real traffic baselines after one week of data.
Provider parity query contract (pass 2)
Maintain equivalent query intent across providers even if syntax differs:
- Latency percentile query grouped by
route_groupandmethod(p50/p95/p99 over 5m buckets) - Error-rate query grouped by
route_group(5xx / totalover 5m buckets) - Slow endpoint table sorted by p99 with request count and error-rate
Use this mapping as the canonical panel contract:
api_latency_percentiles_by_route_groupapi_error_rate_by_route_groupapi_slowest_endpoints_p99queue_latency_and_failures
Concrete provider query templates live in /internal/telemetry-dashboard-queries.
Versioned contract artifacts:
docs/telemetry/dashboard-contract.jsondocs/telemetry/alerts-contract.yaml
Incident runbook (latency)
- Confirm spike scope (single route group vs global).
- Compare p95/p99 and error-rate for same time window.
- Correlate with deployment, queue backlog, and provider incidents.
- Mitigate (rollback, feature-flag reduction, or queue drain strategy).
- Record timeline and update threshold/runbook notes.
Commands
bun run perf:baseline
bun run perf:baseline:enforce
bun run telemetry:latency:check -- --base-url http://localhost:3000 --paths /health/ready,/health/live
bun run telemetry:contracts:sync
bun run telemetry:contracts:checkContract sync outputs provider provisioning bundles used by CI automation:
docs/telemetry/generated/signoz-provisioning.jsondocs/telemetry/generated/posthog-provisioning.json
For docs integrity around migration/governance:
bun run check:docs-canonical
bun run check:docs-canonical:report