Types of inferable root causes
Modern cloud native systems evolve quickly, and complexity makes failures propagate: a small change in one part of the system can ripple across dependencies and surface as symptoms that impact user experience.
In Causely, a root cause is a causal explanation—an inferred factor that explains why specific symptoms or disruptions occurred. Root causes are not alerts or incidents; they are derived from structured dependency models, observed telemetry, and causal reasoning.
Below you can find a list of root cause types that are captured in our Causal Models. With these, Causely can pinpoint hundreds of thousands of potential issues and their effects within your environment. Each inferred cause is connected in a causal graph to the symptoms it explains, with supporting evidence and impact context.
Use the search and filters below to explore all root causes by category, subcategory, or integration source.
Congested
The service is experiencing congestion, resulting in high latency for clients. This suggests that the system is unable to handle the current load efficiently, causing delays in response times. Congestion often occurs when the service receives more requests than it can handle within its capacity, leading to bottlenecks in processing. This may be due to insufficient resources (for example, CPU, memory, or bandwidth), unoptimized code, or a surge in traffic (for example, due to a sudden increase in demand or DDoS attack).
Malfunction
The Service is experiencing a high rate of errors, causing disruptions for clients. This can lead to degraded performance, failed requests, or complete service unavailability, significantly affecting the user experience.
Authentication Misconfiguration
Application Load Balancer (ALB) authentication misconfiguration can disrupt secure traffic routing and lead to widespread configuration issues. This misconfiguration may trigger elevated ELB authentication errors, 504 request timeouts, and target connection errors, ultimately impacting service availability and application performance.
Idle Timeout Misconfiguration
Misconfigured idle timeout settings can lead to unintended connection drops and delays, potentially triggering a high frequency of 504 gateway timeout errors. This misconfiguration may also contribute to broader configuration issues that disrupt seamless connectivity between clients and servers.
Unknown Configuration Failure
Application Load Balancer (ALB) misconfiguration can cause widespread connectivity issues, leading to a high frequency of 504 gateway timeout errors and 5xx server errors. These issues indicate that the load balancer's settings are not properly optimized for handling traffic efficiently and reliably.
Network Policy Misconfiguration
Application Load Balancer (ALB) network policy misconfigurations can block or restrict legitimate traffic, leading to widespread configuration issues. These misconfigurations often result in elevated 504 gateway timeout errors and target connection failures, ultimately disrupting service availability.
Congested Azure Event Hub Namespace
When the Azure Event Hub namespace becomes congested, it reaches a point where its processing capacity is exceeded. This leads to consistent throttling of operations, as the system enforces limits to prevent overload. The high rate of throttling not only impacts event ingestion but also cascades into resource starvation, affecting downstream services that rely on timely event processing. Such congestion is typically caused by high message throughput, suboptimal configuration, or insufficient scaling to handle peak loads.
High User Errors
High user error rates in Azure Event Hub often stem from configuration issues, such as mismatched security credentials (for example Shared Access Signature (SAS) tokens), client SDK version incompatibilities, or throttling from overusing allocated resources. Insufficient permissions or quotas being exceeded can also trigger these errors. Another common cause is incorrect partition or consumer group usage, which can lead to connection limits being breached or messages being inaccessible.
High Server Errors
Common causes for Azure Event Hub errors include quota exceeded (throughput or message size limits have been breached), partition or offset issues (consumers unable to connect or reading from invalid offsets), networking problems (connectivity issues due to firewall rules, DNS misconfigurations, or latency), service outage (regional Azure service disruption), and misconfigured access policies (incorrect SAS tokens, permissions, or authentication methods).
Access Throttled
The application is receiving HTTP 429 "Too Many Requests" responses, indicating that it has exceeded the rate limits set by the other service. This can cause degraded functionality, slow performance, or temporary service unavailability for end users. HTTP 429 errors are typically triggered when an API or service imposes rate limits to control the volume of incoming requests.
Database Connection Pool Saturated
The client-side database connection pool is exhausted when all available connections are in use, preventing new database queries from being executed. This can cause application requests to hang or fail, impacting user experience and potentially leading to downtime for database-dependent features.
Database Malfunction
The database is returning a high rate of errors or failing to respond to queries, causing disruptions for services and clients that depend on it. This may result in delayed or failed access to one or more tables, leading to degraded application performance, elevated latency, or complete unavailability of database-backed functionality.
Excessive DNS Traffic from Client
The application is generating an unusually high volume of DNS requests, potentially overwhelming DNS servers, increasing latency for users, and risking service disruptions. This behavior may also incur additional costs or trigger rate-limiting from DNS providers. This typically arises when the application initiates DNS lookups more frequently than necessary due to lack of effective caching, redundant DNS resolution logic, or misconfigurations.
File Descriptor Exhaustion
The application has reached the system-imposed limit on the number of file descriptors it can open. This typically leads to errors such as "Too many open files," preventing the application from creating new connections, reading files, or accessing resources. This can severely impact functionality, particularly in high-concurrency or high-I/O scenarios.
GOMAXPROCS Misconfigured
The environment variable GOMAXPROCS, which controls the maximum number of CPU cores the Go runtime uses, has been set higher than the CPU limit of the container in which the Go application is running. This mismatch can lead to inefficient CPU usage, reduced performance, and potential throttling because the Go runtime attempts to schedule more work than the container is permitted to handle.
Inefficient DNS Lookup
The application is making an unusually high volume of DNS requests, with over 80% returning NXDomain (non-existent domain) responses. This excessive DNS activity is adding 10 to 20 ms of latency to each request, negatively impacting service performance. The issue is often caused by the service or application attempting to resolve incomplete or unqualified domain names.
Inefficient Garbage Collection
The application is experiencing high latency and reduced throughput because a significant portion of its runtime is being spent in garbage collection (GC). This leads to frequent pauses, degrading overall performance and causing delays in request handling through all the dependent services. This issue usually occurs when the Java Virtual Machine (JVM) or other garbage-collected runtime environments are under memory pressure.
Invalid Client Certificate
The application is failing to connect to a service due to invalid certificate errors, preventing secure communication over HTTPS or TLS. This can cause downtime or degraded functionality for users relying on this service.
Java Heap Saturated
The JVM is operating with limited available heap memory, resulting in degraded performance or potential application crashes. This condition typically leads to frequent or prolonged garbage collection (GC) pauses, slow response times, and, in severe cases, OutOfMemoryError. It often reflects memory leaks, improper heap sizing, or excessive object allocation.
Lock Contention
The application suffers from inefficient locking, where suboptimal lock management leads to excessive contention and prolonged mutex wait times. This inefficiency degrades performance by increasing the risk of thread starvation under heavy load. This can stem from overuse of locks, coarse-grained locking strategies, or improper lock design.
Memory Exhaustion
The Broker application has exhausted its available memory, resulting in degraded performance and potential service disruption. When memory usage reaches critical levels, the system may experience increased garbage collection (GC) activity, higher processing latency, and, in severe cases, OutOfMemoryError events that halt message processing.
Noisy Client
The application acts as a Noisy Client, generating high number of requests that burden destination services with increased load and elevated request rates. This aggressive request pattern directly impacts destination services by driving a high request rate that can overwhelm service capacity and contributing to increased load on the destination.
Redis Connection Pool Saturated
Client-side Redis connection pool exhaustion occurs when all available connections in the pool are in use, preventing new requests to Redis. This can lead to request timeouts or failures, causing application disruptions for features relying on Redis for caching, messaging, or other operations.
Producer Publish Rate Spike
The application is publishing messages at a rate significantly higher than normal, causing queue depth to grow and producer message rate to increase. This surge in publishing activity creates backpressure and can overwhelm downstream consumers. When the application experiences a producer publish rate spike, it generates messages at an abnormally high rate that exceeds the system's normal capacity, leading to queue depth growth and propagation of congestion to downstream destinations.
Slow Consumer
The application is consuming messages slower than they are produced, creating a processing bottleneck. As unprocessed messages accumulate, the system experiences increased queue lag, potential memory pressure, and downstream congestion. This often indicates that one or more instances are unable to keep up due to resource constraints, inefficient processing logic, or external dependencies.
Slow Database Queries
The application is experiencing slow database queries that lead to downstream slow consumer behavior and potential resource starvation. This condition affects instance performance, particularly when query execution times become excessively long, degrading overall system responsiveness.
Slow Database Server Queries
Warehouse congestion in Snowflake occurs when the processing capacity is overwhelmed by incoming queries, causing a significant backlog. This leads to queries being queued at high rates, indicating that the system is struggling to process them in a timely manner. The resulting resource starvation further degrades performance.
Transaction ID Congested
In databases like PostgreSQL, transaction IDs are 32-bit integers that count the number of transactions performed. High utilization occurs when the counter nears its maximum value (~2 billion transactions), requiring a wraparound to continue operation. Failure to perform routine VACUUM operations can prevent the system from marking old XIDs as reusable.
Unauthorized Access
The application is receiving numerous "Unauthorized" status codes (typically HTTP 401) when trying to access another service. This prevents the application from successfully retrieving data or performing actions, potentially causing service disruptions or degraded functionality for end users.
CPU Congested
One or multiple containers in a workload are experiencing CPU congestion, leading to potential throttling. This occurs when the containers use more CPU resources than allocated, causing degraded performance, longer response times, or application crashes. CPU throttling occurs when a container exceeds its CPU quota as defined by Kubernetes or Docker.
Crash Failure
One or multiple containers of a workload has crashed with a non-zero exit code, indicating abnormal termination. This disrupts the application's functionality, leading to downtime or degraded performance depending on how the workload is designed. The non-zero exit code signifies an error during the execution of the container's process.
Frequent Crash Failure
One or multiple containers of a workload are frequently crashing with a non-zero exit code, indicating abnormal termination. This disrupts the application's functionality, leading to downtime or degraded performance depending on how the workload is designed.
Frequent Memory Failure
The application frequently runs out of memory, leading to crashes, performance degradation, or instability. This affects the application's availability and can lead to downtime or poor user experience. The issue is likely due to inefficient memory usage, such as memory leaks, excessive data loading into memory, or improper garbage collection.
Memory Failure
Containers running out of memory can lead to service crashes or degraded performance, resulting in errors for end users or failed service requests. This typically occurs when a container's allocated memory is insufficient for the workload it is handling, causing out-of-memory (OOM) errors and potential system instability.
Ephemeral Storage Congested
A container is experiencing ephemeral storage congestion when its ephemeral storage usage becomes critically high, leading to failures in operations that depend on temporary storage. This may be triggered by factors such as excessive logging, inadequate cleanup of temporary files, or unexpected bursts in data processing.
Ephemeral Storage Noisy Neighbor
A container acting as a noisy neighbor consumes excessive ephemeral storage, resulting in abnormally high storage usage and contributing to node-level disk pressure that can trigger pod evictions. This issue arises when a container consistently uses more ephemeral storage than expected.
Memory Noisy Neighbor
A container acting as a noisy neighbor consumes excessive memory, leading to abnormally high memory usage and contributing to node-level memory pressure that can trigger pod evictions. This issue occurs when a container consistently uses more memory than expected, which adversely impacts both the container and its hosting node.
FrequentPodEphemeralStorageEvictions
A Kubernetes workload is experiencing frequent pod evictions due to ephemeral storage exhaustion. This disrupts application availability and performance, as pods are terminated when they exceed their allocated storage limits or when node-level storage is under pressure.
Image Pull Errors
Kubernetes controllers may encounter image pull errors when they cannot download container images from a registry, causing Pods to fail in starting or remain in an ImagePullBackOff state. This disrupts the deployment of applications and can affect service availability.
Malfunction
Multiple pods for a Kubernetes controller are in a "NotReady" state for an extended period, which can lead to service unavailability or degraded performance.
Disk Pressure
Disk pressure on a Kubernetes node indicates that the node's disk usage is high, potentially causing the eviction of pods, reduced performance, and the inability to schedule new pods. This affects application stability and the node's overall functionality. Disk pressure can arise from insufficient disk space, often caused by log accumulation, container images, temporary files, or application data.
Memory Pressure
Memory pressure on a Kubernetes node occurs when available memory falls below critical levels, potentially causing the eviction of pods and instability for applications running on the node. This reduces the node's capacity to run workloads, potentially leading to service disruptions if insufficient resources are available across the cluster.
Conntrack Table Congested
The conntrack table on a VM is congested, causing new network connections to fail. This typically results in connectivity issues for applications, degraded performance, or downtime for services dependent on network communication. The conntrack table is responsible for tracking active network connections and has a fixed size, which can be exhausted under high connection load.
CPU Congested
A Virtual Machine (VM) experiencing CPU congestion can lead to sluggish application performance, delayed response times, or even timeout errors for users and processes. This typically indicates that the VM's CPU is overutilized, potentially due to high resource demands from applications or insufficient CPU allocation.
Disk Read IOPs Congested
The total disk read IOPS for a cloud VM are congested because the VM has reached its maximum allowable IOPS limit. This results in throttling, which can slow application performance and lead to delays or errors in read-heavy workloads.
Disk Read Throughput Congested
The total disk read throughput for a cloud VM is congested because the VM has reached its maximum allowable read bandwidth. This can lead to slower data transfer rates for read-intensive applications, causing delays in processing and reduced system performance.
Disk Total IOPs Congested
The total disk IOPS for a cloud VM are congested because the VM has reached its maximum allowable IOPS limit. This results in throttling, which can slow application performance and lead to delays or errors in read/write-heavy workloads.
Disk Total Throughput Congested
The total disk throughput for a cloud VM is congested because the VM has reached its maximum allowable bandwidth. This can lead to slower data transfer rates for read/write-intensive applications, causing delays in processing and reduced system performance.
Disk Write IOPs Congested
The total disk write IOPS for a cloud VM are congested because the VM has reached its maximum allowable IOPS limit. This results in throttling, which can slow application performance and lead to delays or errors in write-heavy workloads.
Disk Write Throughput Congested
The total disk write throughput for a cloud VM is congested because the VM has reached its maximum allowable write bandwidth. This can lead to slower data transfer rates for write-intensive applications, causing delays in processing and reduced system performance.
Memory Congested
Memory congestion in a Virtual Machine (VM) leads to slow system performance, application crashes, or even VM instability as the system struggles to allocate memory for running processes. This typically results in frequent swapping or out-of-memory (OOM) errors, impacting applications and user operations.
SNAT Ports Congested
The SNAT (Source Network Address Translation) ports on a virtual machine (VM) are congested, leading to outbound network connection failures or degraded performance for services relying on external APIs or resources. This issue primarily impacts VMs that need to establish multiple concurrent connections to the internet or external systems.
Invalid Server Certificate
The network endpoint is serving an invalid server certificate, resulting in a high rate of client request errors due to certificate validation failures. This issue propagates further, increasing the overall request error rate across the system.
Congested
The disk has reached full capacity, which prevents new data from being written and may cause applications to fail, especially those dependent on free disk space for logs, caching, or temporary files. This can also slow down or halt system operations if critical processes can no longer write to the disk.
Inode Usage Congested
The disk is experiencing inode exhaustion, meaning the file system has run out of inodes (metadata structures for file storage), which prevents new files from being created even if there is free disk space. This often causes errors in applications attempting to create files and can disrupt services reliant on file storage.
IOPs Congested
The disk is experiencing Read/Write Operations Per Second (IOPS) congestion, meaning that the total IOPS capacity is fully utilized. This causes slow performance for applications that rely on disk access, leading to delayed data processing, system lags, or even timeouts.
Read IOPs Congested
The disk is experiencing Read Operations Per Second (IOPS) congestion, meaning that the total IOPS capacity is fully utilized. This causes slow performance for applications that rely on disk access, leading to delayed data processing, system lags, or even timeouts.
Read Throughput Congested
The disk is experiencing congestion specifically in read throughput, which slows down data retrieval from the disk and can degrade the performance of applications reliant on high-speed data access.
Write IOPs Congested
The disk is experiencing Write Operations Per Second (IOPS) congestion, meaning that the total IOPS capacity is fully utilized. This causes slow performance for applications that rely on disk access, leading to delayed data processing, system lags, or even timeouts.
Write Throughput Congested
The disk is experiencing write throughput congestion, leading to slower data write speeds and affecting applications that require high-speed data recording. This issue can cause delays in data availability and reduced performance in write-intensive tasks.
Faulty Error Handling in HTTP Path
The HTTP path is experiencing a high rate of errors, causing disruptions for clients. This can lead to degraded performance, failed requests, or complete service unavailability, significantly affecting the user experience.
Slow Execution in HTTP Path Handler
The HTTP path is experiencing congestion, resulting in high latency for clients. This suggests that the system is unable to handle the current load efficiently, causing delays in response times. Congestion often occurs when the service receives more requests than it can handle within its capacity.
Faulty Error Handling in RPC Method
The RPC Method is experiencing a high rate of errors, causing disruptions for clients. This can lead to degraded performance, failed requests, or complete service unavailability, significantly affecting the user experience.
Slow Execution in RPC Method Handler
The RPC method is experiencing congestion, resulting in high latency for clients. This suggests that the system is unable to handle the current load efficiently, causing delays in response times. Congestion often occurs when the service receives more requests than it can handle within its capacity.
Contention on Database Table Locks
The database table is experiencing an abnormally high rate of exclusive locks, preventing multiple transactions from accessing the table simultaneously. This creates a bottleneck that significantly degrades performance for all client applications depending on this table. Excessive locking typically occurs due to long-running transactions, lock contention between multiple transactions, inefficient transaction design, inappropriate isolation levels, or missing indexes leading to table scans instead of index seeks.
Schema Change Causing Table Lock Contention
The database table is experiencing an unusually high rate of Data Dictionary Lock (DDL), which blocks both read and write operations during schema modifications. This significantly impacts all client applications, causing service disruptions and performance degradation.
Table Access Failure in Database
The database table is experiencing performance degradation or errors, causing disruptions for client applications. This results in slow query response times, potential errors, and degraded service performance for systems that depend on this table.
CPU Congestion
After a version upgrade, application containers experience high CPU usage, leading to performance degradation or unresponsiveness. This issue impacts the system's ability to handle requests effectively, potentially causing downtime or delays for end users. High CPU usage post-upgrade typically stems from changes in the application code, dependencies, or configurations.
Database Connection Pool Saturated
After a version upgrade, the client-side database connection pool is exhausted when all available connections are in use, preventing new database queries from being executed. This can cause application requests to hang or fail, impacting user experience and potentially leading to downtime for database-dependent features.
Frequent Crash Failure
One or multiple containers of a workload are frequently crashing with a non-zero exit code after a version upgrade. This disrupts the application's functionality, leading to downtime or degraded performance depending on the workload design. The issue likely stems from changes introduced in the new version.
Frequent Memory Failure
The application is running out of memory after a version upgrade, leading to crashes, degraded performance, or instability. This impacts availability and user experience, often requiring container restarts or manual intervention to restore functionality. The issue is likely tied to changes in the updated version that increase memory usage or introduce inefficiencies.
Inefficient Garbage Collection
After a version upgrade, the garbage collector is frequently running, leading to performance degradation or crashes. This issue is likely caused by changes in the application code or dependencies that increase memory usage or introduce inefficiencies.
Java Heap Saturated
After a version upgrade, the Java heap is frequently congested, leading to performance degradation or crashes. This issue is likely caused by changes in the application code or dependencies that increase memory usage or introduce inefficiencies.
Lock Contention
After a version upgrade, the application is experiencing frequent locking contention, leading to performance degradation or crashes. This issue is likely caused by changes in the application code or dependencies that increase locking or introduce inefficiencies.
Memory Failure
Memory failures after a code change can cause containers to crash or degrade performance, resulting in errors for end users or failed service requests. These issues occur when newly introduced code leads to unexpected increases in memory usage, triggering out-of-memory (OOM) errors and destabilizing the system.
Redis Connection Pool Saturated
After a version upgrade, the Redis connection pool is frequently congested, leading to performance degradation or crashes. This issue is likely caused by changes in the application code or dependencies that increase Redis usage or introduce inefficiencies.
Slow Database Queries
After a version upgrade, the application is experiencing slow database queries that lead to downstream slow consumer behavior and potential resource starvation. This condition affects instance performance, particularly when query execution times become excessively long.