Skip to main content

SLO Targets and Burn Rates

Service Level Objectives (SLOs) are how Causely translates reliability signals into urgency and action. SLOs define what “good” looks like for reliability and are used by Causely to determine when degradations represent acceptable risk versus issues that require immediate attention.

In Causely, SLOs directly influence how root causes are classified and prioritized. When a root cause puts an SLO at risk or violates it, Causely treats that root cause as more urgent, helping teams focus on the issues most likely to impact users and the business.

SLOs are applied by default at the service level, providing broad coverage with minimal configuration. For teams that need more granular protection, SLOs can also be defined for specific HTTP paths and RPC methods, allowing critical user flows or business transactions to be protected independently of overall service health.

Default SLO Behavior

By default, Causely applies the following SLO targets and burn rate settings to all services:

  • Error rate SLO target: 99.0% (99% of requests must be successful)
  • Latency SLO target: 95.0% (95% of requests must be under the latency threshold)
  • Availability SLO target: 99.0% (99% uptime expected)
  • Burn rate threshold: 4 (budget would be consumed in 6 hours for a 1-day SLO)
  • Burn rate window: 15 minutes (calculation window for burn rate monitoring)

These defaults are designed to catch fast-burning reliability issues early, while avoiding unnecessary noise for brief or low-impact fluctuations.

When SLOs Are Active

SLOs in Causely are evaluated only when traffic is observed for the corresponding entity.

This applies consistently to services, HTTP paths, and RPC methods. SLOs measure how an entity performs when responding to real requests. If no traffic is observed, there is no performance to evaluate, and the SLO remains inactive until requests are seen.

Customizing SLO Behavior

You can customize how SLOs behave in Causely depending on the level of control you need:

  • Service-level SLOs can be customized using labels or service metadata. This is the most common approach and is described in the sections below.
  • HTTP Path and RPC Method SLOs are configured exclusively through the API. These SLOs follow the same core concepts (targets, burn rates, and windows) but apply to specific endpoints rather than entire services. See Setting SLOs on Paths and Methods for details.
  • Default SLO values can also be adjusted programmatically through the API. Documentation and examples for API-based default configuration will be added in a future update.

Supported Labels

You can configure the following SLO-related labels:

LabelDescriptionDefault
causely.ai/error-rate-slo-targetPercentage of successful requests expected, for example 99.0. This defines the percentage of requests that must not result in an error to remain within the error SLO.99.0
causely.ai/latency-slo-targetPercentage of requests expected to be under the latency threshold, for example 95.0. Note that the latency threshold is automatically learned by Causely, but can be manually adjusted via the Thresholds configuration page.95.0
causely.ai/availability-slo-targetPercentage of time the service is expected to be operational, for example 99.0. This defines the proportion of total time the service must remain available, responding successfully to requests without downtime, to remain within the availability SLO.99.0
causely.ai/error-rate-burn-rate-thresholdRate of error budget burn relative to the SLO target. A default of 4 means that for a 1-day SLO, if errors continue at the current rate, the error budget would be consumed in 6 hours.4
causely.ai/latency-burn-rate-thresholdRate of latency budget burn relative to the SLO target. A default of 4 means that for a 1-day SLO, latency at the current rate would consume the entire budget in 6 hours.4
causely.ai/availability-burn-rate-thresholdRate of availability budget burn relative to the SLO target. A default of 4 means that for a 1-day SLO, availability at the current rate would consume the entire budget in 6 hours.4
causely.ai/error-rate-burn-rate-windowBurn rate calculation window (in minutes) used to indicate whether a service is rapidly consuming its error SLO budget.15
causely.ai/latency-burn-rate-windowBurn rate calculation window (in minutes) used to indicate whether a service is rapidly consuming its latency SLO budget.15
causely.ai/availability-burn-rate-windowBurn rate calculation window (in minutes) used to indicate whether a service is rapidly consuming its availability SLO budget.15

Configuration Methods

Using Kubernetes Labels

You can apply labels directly to your Kubernetes services:

# Set error rate SLO target to 99%
kubectl label svc -n <namespace> <service-name> "causely.ai/error-rate-slo-target=99.0"

# Set latency SLO target to 95%
kubectl label svc -n <namespace> <service-name> "causely.ai/latency-slo-target=95.0"

# Set availability SLO target to 99%
kubectl label svc -n <namespace> <service-name> "causely.ai/availability-slo-target=99.0"

# Set error rate burn rate threshold to 2
kubectl label svc -n <namespace> <service-name> "causely.ai/error-rate-burn-rate-threshold=2"

# Set latency burn rate threshold to 2
kubectl label svc -n <namespace> <service-name> "causely.ai/latency-burn-rate-threshold=2"

# Set availability burn rate threshold to 2
kubectl label svc -n <namespace> <service-name> "causely.ai/availability-burn-rate-threshold=2"

# Set error rate burn rate window to 15 minutes
kubectl label svc -n <namespace> <service-name> "causely.ai/error-rate-burn-rate-window=15"

# Set latency burn rate window to 15 minutes
kubectl label svc -n <namespace> <service-name> "causely.ai/latency-burn-rate-window=15"

# Set availability burn rate window to 15 minutes
kubectl label svc -n <namespace> <service-name> "causely.ai/availability-burn-rate-window=15"

Using Nomad Service Tags

If you use Nomad, you can specify these as service tags:

job "example" {
group "app" {
service {
name = "my-service"
port = 8080

tags = [
"causely.ai/error-rate-slo-target=99.0",
"causely.ai/latency-slo-target=95.0",
"causely.ai/availability-slo-target=99.0",
"causely.ai/error-rate-burn-rate-threshold=2",
"causely.ai/latency-burn-rate-threshold=2",
"causely.ai/availability-burn-rate-threshold=2",
"causely.ai/error-rate-burn-rate-window=15",
"causely.ai/latency-burn-rate-window=15",
"causely.ai/availability-burn-rate-window=15"
]
}
}
}

Using Consul Service Metadata

For Consul services, you can configure these using service metadata:

# Register a service with slo metadata
consul services register \
-name="my-service" \
-port=8080 \
-meta="causely.ai/error-rate-slo-target=99.0" \
-meta="causely.ai/latency-slo-target=95.0" \
-meta="causely.ai/availability-slo-target=99.0" \
-meta="causely.ai/error-rate-burn-rate-threshold=2" \
-meta="causely.ai/latency-burn-rate-threshold=2" \
-meta="causely.ai/availability-burn-rate-threshold=2" \
-meta="causely.ai/error-rate-burn-rate-window=15" \
-meta="causely.ai/latency-burn-rate-window=15" \
-meta="causely.ai/availability-burn-rate-window=15"


# Update existing service metadata
consul services register \
-id="my-service-id" \
-name="my-service" \
-port=8080 \
-meta="causely.ai/error-rate-slo-target=99.0" \
-meta="causely.ai/latency-slo-target=95.0" \
-meta="causely.ai/availability-slo-target=99.0" \
-meta="causely.ai/error-rate-burn-rate-threshold=2" \
-meta="causely.ai/latency-burn-rate-threshold=2" \
-meta="causely.ai/availability-burn-rate-threshold=2" \
-meta="causely.ai/error-rate-burn-rate-window=15" \
-meta="causely.ai/latency-burn-rate-window=15" \
-meta="causely.ai/availability-burn-rate-window=15"

Best Practices

  1. Align with SLO policy: Reflect organizational reliability goals.
  2. Avoid overly aggressive thresholds: High sensitivity may create alert fatigue.
  3. Monitor and adjust: Tune thresholds based on incident reviews and error budget consumption.
  4. Document changes: Record rationale for each SLO configuration.

Example Use Cases

  1. Business-critical services: Set tighter SLO targets, for example 99.9% success, 98% low-latency.
  2. Temporary adjustments: Raise burn rate thresholds during high-traffic events.

Burn Rate Threshold Examples

The burn rate threshold determines how aggressively your error or latency budget is being consumed, and helps you catch fast-burning issues. Here's a simple example:

Suppose your service has a 1-day SLO budget, meaning it can tolerate a limited amount of errors or latency over 24 hours.

  • Burn Rate Threshold = 1
    The current rate of errors or latency is steady and would use up the entire budget in exactly 24 hours. No alarm yet, but you're tracking close to your SLO target.

  • Burn Rate Threshold = 2
    At the current rate, the service would consume its full 24-hour budget in only 12 hours, triggering alerts about rapid budget consumption.

  • Burn Rate Threshold = 4
    This indicates extremely fast-burning behavior. At this pace, the full error or latency budget would be used up in just 6 hours.

In practice, burn rate thresholds allow teams to catch reliability problems earlier, before they fully consume the SLO budget.

Burn Rate Window Examples

The burn rate window helps determine how quickly your service is consuming its SLO budget by observing behavior over short time intervals. Below are simple examples to clarify:

  • Short Window (5 minutes)
    Useful for detecting rapid error or latency spikes. For example, if a service suddenly begins failing or slowing down at a high rate, a short burn rate window (like 5 minutes) helps you identify that it's quickly consuming its SLO budget, enabling earlier incident detection. For services where fast detection of degraded performance is critical, consider also shortening the symptom activation delay, which you can manually configure via the Symptom Delay settings.

  • Moderate Window (15 minutes)
    This is the default and provides a good balance between reactivity and noise. It captures bursts of errors or latency that might not last long enough to trigger alerts in a longer window but are still significant.

  • Long Window (60 minutes)
    Best used to detect sustained SLO violations. For example, if a service has a consistent error rate that slowly drains the budget, the longer window provides better confidence that it’s not just a transient blip.