Skip to main content

SLO Targets and Burn Rates

Causely supports customizing SLO (Service Level Objective) targets and burn rate behavior for individual services. This can be done by applying labels to your services in Kubernetes or Nomad job files. This page describes how to set and tune these labels.

Overview

SLO configuration labels allow you to:

  • Define custom SLO targets for error rates and latency
  • Tune how aggressively burn rate is calculated
  • Override default values that govern budget consumption

Supported Labels

You can configure the following SLO-related labels:

LabelDescriptionDefault
causely.ai/error-rate-slo-targetPercentage of successful requests expected, for example 99.0. This defines the percentage of requests that must not result in an error to remain within the error SLO.99.0
causely.ai/latency-slo-targetPercentage of requests expected to be under the latency threshold, for example 95.0. Note that the latency threshold is automatically learned by Causely, but can be manually adjusted via the Thresholds configuration page.95.0
causely.ai/error-rate-burn-rate-thresholdRate of error budget burn relative to the SLO target. A default of 2 means that for a 1-day SLO, if errors continue at the current rate, the error budget would be consumed in 1/2 a day.2
causely.ai/latency-burn-rate-thresholdRate of latency budget burn relative to the SLO target. A default of 2 means that for a 1-day SLO, latency at the current rate would consume the entire budget in 1/2 a day.2
causely.ai/error-rate-burn-rate-windowBurn rate calculation window (in minutes) used to indicate whether a service is rapidly consuming its error SLO budget.15
causely.ai/latency-burn-rate-windowBurn rate calculation window (in minutes) used to indicate whether a service is rapidly consuming its latency SLO budget.15

Configuration Methods

Using Kubernetes Labels

You can apply labels directly to your Kubernetes services:

# Set error rate SLO target to 99%
kubectl label svc -n <namespace> <service-name> "causely.ai/error-rate-slo-target=99.0"

# Set latency SLO target to 95%
kubectl label svc -n <namespace> <service-name> "causely.ai/latency-slo-target=95.0"

# Set error rate burn rate threshold to 2
kubectl label svc -n <namespace> <service-name> "causely.ai/error-rate-burn-rate-threshold=2"

# Set latency burn rate threshold to 2
kubectl label svc -n <namespace> <service-name> "causely.ai/latency-burn-rate-threshold=2"

# Set error rate burn rate window to 15 minutes
kubectl label svc -n <namespace> <service-name> "causely.ai/error-rate-burn-rate-window=15"

# Set latency burn rate window to 15 minutes
kubectl label svc -n <namespace> <service-name> "causely.ai/latency-burn-rate-window=15"

Using Nomad Service Tags

If you use Nomad, you can specify these as service tags:

job "example" {
group "app" {
service {
name = "my-service"
port = 8080

tags = [
"causely.ai/error-rate-slo-target=99.0",
"causely.ai/latency-slo-target=95.0",
"causely.ai/error-rate-burn-rate-threshold=2",
"causely.ai/latency-burn-rate-threshold=2",
"causely.ai/error-rate-burn-rate-window=15",
"causely.ai/latency-burn-rate-window=15"
]
}
}
}

Using Consul Service Metadata

For Consul services, you can configure these using service metadata:

# Register a service with slo metadata
consul services register \
-name="my-service" \
-port=8080 \
-meta="causely.ai/error-rate-slo-target=99.0" \
-meta="causely.ai/latency-slo-target=95.0" \
-meta="causely.ai/error-rate-burn-rate-threshold=2" \
-meta="causely.ai/latency-burn-rate-threshold=2" \
-meta="causely.ai/error-rate-burn-rate-window=15" \
-meta="causely.ai/latency-burn-rate-window=15"


# Update existing service metadata
consul services register \
-id="my-service-id" \
-name="my-service" \
-port=8080 \
-meta="causely.ai/error-rate-slo-target=99.0" \
-meta="causely.ai/latency-slo-target=95.0" \
-meta="causely.ai/error-rate-burn-rate-threshold=2" \
-meta="causely.ai/latency-burn-rate-threshold=2" \
-meta="causely.ai/error-rate-burn-rate-window=15" \
-meta="causely.ai/latency-burn-rate-window=15"

Best Practices

  1. Align with SLO policy: Reflect organizational reliability goals.
  2. Avoid overly aggressive thresholds: High sensitivity may create alert fatigue.
  3. Monitor and adjust: Tune thresholds based on incident reviews and error budget consumption.
  4. Document changes: Record rationale for each SLO configuration.

Example Use Cases

  1. Business-critical services: Set tighter SLO targets, for example 99.9% success, 98% low-latency.
  2. Temporary adjustments: Raise burn rate thresholds during high-traffic events.

Burn Rate Threshold Examples

The burn rate threshold determines how aggressively your error or latency budget is being consumed, and helps you catch fast-burning issues. Here's a simple example:

Suppose your service has a 1-day SLO budget, meaning it can tolerate a limited amount of errors or latency over 24 hours.

  • Burn Rate Threshold = 1
    The current rate of errors or latency is steady and would use up the entire budget in exactly 24 hours. No alarm yet, but you're tracking close to your SLO target.

  • Burn Rate Threshold = 2
    At the current rate, the service would consume its full 24-hour budget in only 12 hours, triggering alerts about rapid budget consumption.

  • Burn Rate Threshold = 4
    This indicates extremely fast-burning behavior. At this pace, the full error or latency budget would be used up in just 6 hours.

In practice, burn rate thresholds allow teams to catch reliability problems earlier, before they fully consume the SLO budget.

Burn Rate Window Examples

The burn rate window helps determine how quickly your service is consuming its SLO budget by observing behavior over short time intervals. Below are simple examples to clarify:

  • Short Window (5 minutes)
    Useful for detecting rapid error or latency spikes. For example, if a service suddenly begins failing or slowing down at a high rate, a short burn rate window (like 5 minutes) helps you identify that it's quickly consuming its SLO budget, enabling earlier incident detection. For services where fast detection of degraded performance is critical, consider also shortening the symptom activation delay, which you can manually configure via the Symptom Delay settings.

  • Moderate Window (15 minutes)
    This is the default and provides a good balance between reactivity and noise. It captures bursts of errors or latency that might not last long enough to trigger alerts in a longer window but are still significant.

  • Long Window (60 minutes)
    Best used to detect sustained SLO violations. For example, if a service has a consistent error rate that slowly drains the budget, the longer window provides better confidence that it’s not just a transient blip.