Athas Boilerplate

Telemetry Dashboard Queries

internal

Provider-parity query templates and alert mapping for p95/p99 operations.

Purpose

This page provides a practical query pack for runtime telemetry operations.

  • Keep panel IDs identical between providers.
  • Keep threshold semantics identical between providers.
  • Allow syntax differences (PostHog vs SigNoz) without changing SLI meaning.

Panel IDs (canonical)

  1. api_latency_percentiles_by_route_group
  2. api_error_rate_by_route_group
  3. api_slowest_endpoints_p99
  4. queue_latency_and_failures

SigNoz templates

Use these as PromQL-style templates in SigNoz dashboards. Replace metric names if your collector renames histograms.

# p50 latency by route_group (5m)
histogram_quantile(
  0.50,
  sum by (le, route_group) (
    rate(http_server_request_duration_milliseconds_bucket{service_name="athas-backend"}[5m])
  )
)
# p95 latency by route_group (5m)
histogram_quantile(
  0.95,
  sum by (le, route_group) (
    rate(http_server_request_duration_milliseconds_bucket{service_name="athas-backend"}[5m])
  )
)
# p99 latency by route_group (5m)
histogram_quantile(
  0.99,
  sum by (le, route_group) (
    rate(http_server_request_duration_milliseconds_bucket{service_name="athas-backend"}[5m])
  )
)
# error rate (%) by route_group (5m)
100 * (
  sum by (route_group) (rate(http_server_request_count{service_name="athas-backend",status_class="5xx"}[5m]))
  /
  sum by (route_group) (rate(http_server_request_count{service_name="athas-backend"}[5m]))
)

PostHog templates

Use these as query templates in PostHog logs/traces views with equivalent field mappings.

-- p50, p95 and p99 latency by route_group over 5m buckets
SELECT
  toStartOfFiveMinute(timestamp) AS ts,
  properties.route_group AS route_group,
  quantile(0.50)(properties.duration_ms) AS p50_ms,
  quantile(0.95)(properties.duration_ms) AS p95_ms,
  quantile(0.99)(properties.duration_ms) AS p99_ms
FROM logs
WHERE properties.service = 'athas-backend'
GROUP BY ts, route_group
ORDER BY ts DESC, route_group;
-- error rate by route_group over 5m buckets
SELECT
  toStartOfFiveMinute(timestamp) AS ts,
  properties.route_group AS route_group,
  100.0 * sum(if(properties.status_class = '5xx', 1, 0)) / count() AS error_rate_pct
FROM logs
WHERE properties.service = 'athas-backend'
GROUP BY ts, route_group
ORDER BY ts DESC, route_group;

Alert mapping

Use the same thresholds from docs/telemetry/alerts-contract.yaml:

  • warning: p95 > 800ms for 10m
  • critical: p99 > 1500ms for 10m
  • critical: error_rate > 2% for 10m

Validation flow

  1. Run bun run telemetry:latency:check against target env.
  2. Compare synthetic snapshot with dashboard p50/p95/p99 trends.
  3. Trigger one controlled latency spike in non-prod and confirm alert behavior.

On this page