SLO Targets and Burn Rates

Causely supports customizing SLO (Service Level Objective) targets and burn rate behavior for individual services. This can be done by applying labels to your services in Kubernetes or Nomad job files. This page describes how to set and tune these labels.

Overview

SLO configuration labels allow you to:

Define custom SLO targets for error rates, latency and availability
Tune how aggressively burn rate is calculated
Override default values that govern budget consumption

Supported Labels

You can configure the following SLO-related labels:

Label	Description	Default
`causely.ai/error-rate-slo-target`	Percentage of successful requests expected, for example 99.0. This defines the percentage of requests that must not result in an error to remain within the error SLO.	`99.0`
`causely.ai/latency-slo-target`	Percentage of requests expected to be under the latency threshold, for example 95.0. Note that the latency threshold is automatically learned by Causely, but can be manually adjusted via the Thresholds configuration page.	`95.0`
`causely.ai/availability-slo-target`	Percentage of time the service is expected to be operational, for example 99.0. This defines the proportion of total time the service must remain available, responding successfully to requests without downtime, to remain within the availability SLO.	`99.0`
`causely.ai/error-rate-burn-rate-threshold`	Rate of error budget burn relative to the SLO target. A default of 2 means that for a 1-day SLO, if errors continue at the current rate, the error budget would be consumed in 1/2 a day.	`2`
`causely.ai/latency-burn-rate-threshold`	Rate of latency budget burn relative to the SLO target. A default of 2 means that for a 1-day SLO, latency at the current rate would consume the entire budget in 1/2 a day.	`2`
`causely.ai/availability-burn-rate-threshold`	Rate of availability budget burn relative to the SLO target. A default of 2 means that for a 1-day SLO, availability at the current rate would consume the entire budget in 1/2 a day.	`2`
`causely.ai/error-rate-burn-rate-window`	Burn rate calculation window (in minutes) used to indicate whether a service is rapidly consuming its error SLO budget.	`15`
`causely.ai/latency-burn-rate-window`	Burn rate calculation window (in minutes) used to indicate whether a service is rapidly consuming its latency SLO budget.	`15`
`causely.ai/availability-burn-rate-window`	Burn rate calculation window (in minutes) used to indicate whether a service is rapidly consuming its availability SLO budget.	`15`

Configuration Methods

Using Kubernetes Labels

You can apply labels directly to your Kubernetes services:

# Set error rate SLO target to 99%
kubectl label svc -n <namespace> <service-name> "causely.ai/error-rate-slo-target=99.0"

# Set latency SLO target to 95%
kubectl label svc -n <namespace> <service-name> "causely.ai/latency-slo-target=95.0"

# Set availability SLO target to 99%
kubectl label svc -n <namespace> <service-name> "causely.ai/availability-slo-target=99.0"

# Set error rate burn rate threshold to 2
kubectl label svc -n <namespace> <service-name> "causely.ai/error-rate-burn-rate-threshold=2"

# Set latency burn rate threshold to 2
kubectl label svc -n <namespace> <service-name> "causely.ai/latency-burn-rate-threshold=2"

# Set availability burn rate threshold to 2
kubectl label svc -n <namespace> <service-name> "causely.ai/availability-burn-rate-threshold=2"

# Set error rate burn rate window to 15 minutes
kubectl label svc -n <namespace> <service-name> "causely.ai/error-rate-burn-rate-window=15"

# Set latency burn rate window to 15 minutes
kubectl label svc -n <namespace> <service-name> "causely.ai/latency-burn-rate-window=15"

# Set availability burn rate window to 15 minutes
kubectl label svc -n <namespace> <service-name> "causely.ai/availability-burn-rate-window=15"

Using Nomad Service Tags

If you use Nomad, you can specify these as service tags:

job "example" {
  group "app" {
    service {
      name = "my-service"
      port = 8080

      tags = [
        "causely.ai/error-rate-slo-target=99.0",
        "causely.ai/latency-slo-target=95.0",
        "causely.ai/availability-slo-target=99.0",
        "causely.ai/error-rate-burn-rate-threshold=2",
        "causely.ai/latency-burn-rate-threshold=2",
        "causely.ai/availability-burn-rate-threshold=2",
        "causely.ai/error-rate-burn-rate-window=15",
        "causely.ai/latency-burn-rate-window=15",
        "causely.ai/availability-burn-rate-window=15"
      ]
    }
  }
}

Using Consul Service Metadata

For Consul services, you can configure these using service metadata:

# Register a service with slo metadata
consul services register \
  -name="my-service" \
  -port=8080 \
  -meta="causely.ai/error-rate-slo-target=99.0" \
  -meta="causely.ai/latency-slo-target=95.0" \
  -meta="causely.ai/availability-slo-target=99.0" \
  -meta="causely.ai/error-rate-burn-rate-threshold=2" \
  -meta="causely.ai/latency-burn-rate-threshold=2" \
  -meta="causely.ai/availability-burn-rate-threshold=2" \
  -meta="causely.ai/error-rate-burn-rate-window=15" \
  -meta="causely.ai/latency-burn-rate-window=15" \
  -meta="causely.ai/availability-burn-rate-window=15"


# Update existing service metadata
consul services register \
  -id="my-service-id" \
  -name="my-service" \
  -port=8080 \
  -meta="causely.ai/error-rate-slo-target=99.0" \
  -meta="causely.ai/latency-slo-target=95.0" \
  -meta="causely.ai/availability-slo-target=99.0" \
  -meta="causely.ai/error-rate-burn-rate-threshold=2" \
  -meta="causely.ai/latency-burn-rate-threshold=2" \
  -meta="causely.ai/availability-burn-rate-threshold=2" \
  -meta="causely.ai/error-rate-burn-rate-window=15" \
  -meta="causely.ai/latency-burn-rate-window=15" \
  -meta="causely.ai/availability-burn-rate-window=15"

Best Practices

Align with SLO policy: Reflect organizational reliability goals.
Avoid overly aggressive thresholds: High sensitivity may create alert fatigue.
Monitor and adjust: Tune thresholds based on incident reviews and error budget consumption.
Document changes: Record rationale for each SLO configuration.

Example Use Cases

Business-critical services: Set tighter SLO targets, for example 99.9% success, 98% low-latency.
Temporary adjustments: Raise burn rate thresholds during high-traffic events.

Burn Rate Threshold Examples

The burn rate threshold determines how aggressively your error or latency budget is being consumed, and helps you catch fast-burning issues. Here's a simple example:

Suppose your service has a 1-day SLO budget, meaning it can tolerate a limited amount of errors or latency over 24 hours.

Burn Rate Threshold = 1
The current rate of errors or latency is steady and would use up the entire budget in exactly 24 hours. No alarm yet, but you're tracking close to your SLO target.
Burn Rate Threshold = 2
At the current rate, the service would consume its full 24-hour budget in only 12 hours, triggering alerts about rapid budget consumption.
Burn Rate Threshold = 4
This indicates extremely fast-burning behavior. At this pace, the full error or latency budget would be used up in just 6 hours.

In practice, burn rate thresholds allow teams to catch reliability problems earlier, before they fully consume the SLO budget.

Burn Rate Window Examples

The burn rate window helps determine how quickly your service is consuming its SLO budget by observing behavior over short time intervals. Below are simple examples to clarify:

Short Window (5 minutes)
Useful for detecting rapid error or latency spikes. For example, if a service suddenly begins failing or slowing down at a high rate, a short burn rate window (like 5 minutes) helps you identify that it's quickly consuming its SLO budget, enabling earlier incident detection. For services where fast detection of degraded performance is critical, consider also shortening the symptom activation delay, which you can manually configure via the Symptom Delay settings.
Moderate Window (15 minutes)
This is the default and provides a good balance between reactivity and noise. It captures bursts of errors or latency that might not last long enough to trigger alerts in a longer window but are still significant.
Long Window (60 minutes)
Best used to detect sustained SLO violations. For example, if a service has a consistent error rate that slowly drains the budget, the longer window provides better confidence that it’s not just a transient blip.

Overview​

Supported Labels​

Configuration Methods​

Using Kubernetes Labels​

Using Nomad Service Tags​

Using Consul Service Metadata​

Best Practices​

Example Use Cases​

Burn Rate Threshold Examples​

Burn Rate Window Examples​