Skip to main content

Thresholds for Service Symptoms

Causely detects service and infrastructure symptoms based on a wide range of metrics, using a combination of defaults and learned behavior. However, you may want to customize these thresholds to better match your specific requirements and SLO definitions. This document explains how to configure custom thresholds for your services.

Overview

Causely uses thresholds to detect service and infrastructure symptoms across a wide range of metrics, including latency, error rates, throughput, and resource utilization.

These thresholds serve two related purposes:

  • Symptom detection, where crossing a threshold activates a Causely issue or risk
  • SLO evaluation, where the same metrics act as Service Level Indicators (SLIs) that determine SLO health

Causely provides sensible defaults for all supported thresholds. For some metrics, Causely can also learn thresholds automatically based on historical and real-time behavior. You can optionally configure manual thresholds or tune learned thresholds to better reflect business requirements, reliability targets, or known system constraints.

Key Concepts

Threshold Sources

Each threshold in Causely has a source, which determines how its value is set and maintained:

  • Default
    A system-provided threshold value that applies when no learning or manual override is configured. Defaults are designed to be safe and broadly applicable.

  • Learned
    For some metrics, Causely can automatically learn a threshold based on historical and real-time behavior. Learned thresholds adapt over time as normal behavior changes.

  • Manual
    A user-configured threshold that explicitly overrides the default or learned value. Manual thresholds remain fixed until changed or removed.

Only one source is active for a given threshold at any time (minimum learned thresholds do not change the source).

How Thresholds Are Evaluated

Thresholds are always evaluated against a specific metric and aggregation, even when multiple series are shown for context.

For example:

  • Request Duration thresholds are evaluated against P95 latency
    A symptom becomes active when the P95 latency exceeds the configured threshold.
  • Other percentiles (such as P90 or P99) may be displayed to provide additional context but do not drive activation.

The evaluation aggregation is fixed per metric and does not change when thresholds are overridden.

Learned Threshold Minimums

For metrics that support learning, you can optionally configure a Minimum Learned Threshold.

A minimum learned threshold:

  • Sets a lower bound on how low a learned threshold can go
  • Allows Causely to continue learning above that value
  • Does not create a manual override

This is useful when you want adaptive behavior while preventing learned thresholds from becoming unrealistically low due to traffic patterns or short-term anomalies.

Thresholds and SLOs

The same metrics used for symptom detection are also used as Service Level Indicators (SLIs) when evaluating SLOs.

This means:

  • A threshold crossing may activate a symptom
  • The same metric contributes to SLO health calculations

Configuring thresholds affects both operational detection and SLO evaluation, so changes should be made with awareness of their broader impact.

How to Configure Thresholds

Causely supports configuring thresholds through multiple mechanisms, allowing you to choose the approach that best fits your workflow and environment. Regardless of the method used, the outcome is the same: a manual threshold, unless you configure a minimum learned threshold.

Thresholds can be configured at different levels (for example, service, workload, or infrastructure resource), depending on the metric and entity type.

Configuration Options

You can configure thresholds using the following methods:

  • Causely UI
    Best for inspecting learned behavior, understanding how thresholds relate to observed metrics, and making targeted adjustments.

  • Service metadata
    Service metadata supports configuring a subset of commonly used service-level thresholds and allows you to configure thresholds declaratively using:

    • Kubernetes labels
    • Nomad service tags
    • Consul service metadata

    This approach is well suited for version-controlled, environment-specific configuration that travels with your service definition.

  • Causely API
    Best for programmatic configuration, automation, and integration with internal tooling or workflows.

All configuration methods support configuring manual thresholds where applicable, and for metrics that support learning, configuring a minimum learned threshold to bound adaptive learning.

Choosing the Right Method

Use the UI when you want to:

  • understand why a symptom is activating,
  • compare learned thresholds against real traffic,
  • experiment or iterate quickly.

Use service metadata when you want to:

  • manage thresholds as code,
  • apply consistent thresholds across environments,
  • ensure thresholds are applied automatically during deployment.

Use the API when you want to:

  • automate threshold management,
  • integrate threshold changes into CI/CD or internal systems,
  • apply changes across many entities programmatically.

What Happens When You Configure a Threshold

When you configure a threshold:

  • The threshold source becomes Manual
  • Automatic learning (if supported for that metric) is paused
  • The configured value is used consistently for:
    • symptom activation
    • SLI evaluation for SLOs

If you remove a manual threshold, Causely reverts to the default or learned threshold, depending on the metric.

Minimum Learned Thresholds

For metrics that support learning, you can optionally configure a Minimum Learned Threshold instead of a full manual override.

This allows Causely to:

  • continue adapting to changing behavior,
  • while never learning a threshold below the configured minimum.

Minimum learned thresholds do not replace learning and do not create a manual override.

Using the Causely UI

Using the UI allows you to:

  • Inspect the learned threshold alongside observed metrics (for example P90 and P99 for latency)
  • Override thresholds to match documented SLOs or performance requirements
  • Set a minimum value that bounds how low a learned threshold can go while preserving adaptive learning
  • Immediately see how a custom threshold compares to real traffic patterns

To configure thresholds in the UI:

  1. Navigate to the service you want to configure.
  2. Select the Metrics tab for the service, then select the relevant symptom metric (for example, Request Duration or Request Error Rate).
  3. Click the pencil icon next to the threshold to edit the value.
  4. Save the change to apply the override. UI-based configuration is best suited for teams that want quick iteration, visibility into learned behavior, and explicit control without modifying service metadata or deployment configuration.
Edit Request Duration Threshold

Using Service Metadata

note

Not all thresholds can be configured via service metadata. Metadata-based configuration currently supports a subset of service-level thresholds such as request error rate and request latency.

Using Kubernetes Labels

The recommended way to configure thresholds is using Kubernetes labels. You can apply these labels to your services:

# Configure error rate threshold (for example, 1% error rate)
kubectl label svc -n <namespace> <service-name> "causely.ai/error-rate-threshold=0.01"

# Configure latency threshold (for example, 500ms)
kubectl label svc -n <namespace> <service-name> "causely.ai/latency-threshold=500.0"

Using Nomad Service Tags

For Nomad services, you can configure thresholds using service tags in your job specification:

job "example" {
group "app" {
service {
name = "my-service"
port = 8080

tags = [
"causely.ai/error-rate-threshold=0.01"
"causely.ai/latency-threshold=500.0"
]
}
}
}

Using Consul Service Metadata

For Consul services, you can configure thresholds using service metadata:

# Register a service with threshold metadata
consul services register \
-name="my-service" \
-port=8080 \
-meta="causely.ai/error-rate-threshold=0.01" \
-meta="causely.ai/latency-threshold=500.0"

# Update existing service metadata
consul services register \
-id="my-service-id" \
-name="my-service" \
-port=8080 \
-meta="causely.ai/error-rate-threshold=0.01" \
-meta="causely.ai/latency-activation-delay=500.0"

Supported Thresholds

Causely supports configurable thresholds across a broad set of service and infrastructure entities. These thresholds are used to detect symptoms and also act as Service Level Indicators (SLIs) when evaluating SLO health.

Some thresholds support automatically learned values, while others use system defaults that can be manually overridden.


Services, Workloads, HTTP Paths and RPC Methods

MetricUnitLearned
Request Error RatepercentNo
Request Duration (P95)millisecondYes
Request Duration P95 (Client)millisecondYes
Request Raterequest/sYes
ConnectionspercentNo
Mutex Wait TimepercentNo
Command LatencymillisecondNo
GC TimepercentNo
Queries QueuedcountNo
Transaction ErrorpercentNo
Transaction DurationsecondNo
Transaction IDs CongestedpercentNo
Cache SizebytesNo
Redis Connections UtilizationpercentNo
Kafka Message Ratemessage/sNo
Server ErrorscountNo
User ErrorscountNo
File Descriptor UtilizationpercentNo
Java Heap UtilizationpercentNo
ThrottledcountNo
DB Connections UtilizationpercentNo

Queues, Topics and Background Operations

MetricUnitLearned
Queue DepthcountNo
Dead Letter CountcountNo
Queue Acksrequest/sNo
Message Wait TimesecondsNo
Queue Size BytesbytesNo
LagcountNo
Task DurationmillisecondNo

Database Tables

MetricUnitLearned
DB Query DurationsecondYes
Select Query Duration (P95)millisecondYes
Table BloatpercentNo
Lock Exclusive RatepercentNo
DDL Lock Exclusive RatepercentNo

Application Load Balancers

MetricUnitLearned
Request Raterequest/sYes
Request4xx ErrorpercentNo
Request5xx ErrorpercentNo
Request504 ErrorpercentNo
ELB Auth ErrorcountNo
Target Connection ErrorcountNo

Containers and Controllers

MetricUnitLearned
CPU UtilizationpercentNo
CPU ThrottledpercentNo
Memory UtilizationpercentNo
Ephemeral Storage UtilizationpercentNo
Frequent CrashcountNo
FrequentOOM KillcountNo
Frequent Pod Ephemeral Storage EvictionscountNo

Nodes and Virtual Machines

MetricUnitLearned
CPU UtilizationpercentNo
Memory UtilizationpercentNo
Conntrack Table UtilizationpercentNo
SNAT Port UtilizationpercentNo
Container Ephemeral Storage UtilizationpercentNo
Memory Pressure Pod EvictionscountNo
Disk Pressure Pod EvictionscountNo
Disk Read IOPS UtilizationpercentNo
Disk Write IOPS UtilizationpercentNo
Disk Total IOPS UtilizationpercentNo
Disk Read Throughput UtilizationpercentNo
Disk Write Throughput UtilizationpercentNo
Disk Total Throughput UtilizationpercentNo

Disks

MetricUnitLearned
UtilizationpercentNo
Read IOPS UtilizationpercentNo
Write IOPS UtilizationpercentNo
Total IOPS UtilizationpercentNo
Read Throughput UtilizationpercentNo
Write Throughput UtilizationpercentNo
Total Throughput UtilizationpercentNo
Inodes UtilizationpercentNo

Best Practices

  1. Start with Default or Learned Thresholds
    Use Causely’s default or automatically learned thresholds as a baseline before introducing manual overrides.

  2. Override Only When There Is a Clear Requirement
    Configure manual thresholds when you have explicit business, reliability, or compliance requirements that differ from observed behavior.

  3. Prefer Minimum Learned Thresholds Over Full Overrides
    When available, use a minimum learned threshold to bound adaptive learning without disabling it entirely.

  4. Consider Both Symptom Detection and SLO Impact
    Threshold changes affect both symptom activation and SLI evaluation for SLOs. Validate changes in both contexts.

  5. Monitor After Changes
    After updating thresholds, observe how they affect symptom frequency, noise, and SLO evaluation over time.

  6. Document Intent, Not Just Values
    Record why a threshold was changed to support future reviews and adjustments.

Example Use Cases

  1. Strict SLO Requirements
    A critical service requires tighter latency bounds than normal traffic patterns allow. A manual threshold is configured to align symptom detection and SLO evaluation with the defined objective.

  2. Preventing Overly Aggressive Learned Thresholds
    A service with highly variable traffic uses a minimum learned threshold to prevent latency thresholds from adapting too low during off-peak periods while preserving adaptive behavior.

  3. Infrastructure Saturation Detection
    A team configures CPU or disk utilization thresholds on nodes to detect resource saturation early, independent of application-level symptoms.

  4. Queue Backlog Monitoring
    Queue depth and message wait time thresholds are configured to surface processing delays before they impact downstream services.

  5. Temporary Adjustments During Maintenance
    Thresholds are temporarily adjusted during planned maintenance or migrations and reverted afterward.