Skip to main content

Types of inferable root causes

Modern cloud native systems evolve quickly, and complexity makes failures propagate: a small change in one part of the system can ripple across dependencies and surface as symptoms that impact user experience.

In Causely, a root cause is a causal explanation—an inferred factor that explains why specific symptoms or disruptions occurred. Root causes are not alerts or incidents; they are derived from structured dependency models, observed telemetry, and causal reasoning.

Below you can find a list of root cause types that are captured in our Causal Models. With these, Causely can pinpoint hundreds of thousands of potential issues and their effects within your environment. Each inferred cause is connected in a causal graph to the symptoms it explains, with supporting evidence and impact context.

Use the search and filters below to explore all root causes by category, subcategory, or integration source.

.NET Unhandled Exception

.NET logs show an unhandled exception, indicating the application failed unexpectedly and may have terminated the process. Unhandled exceptions in .NET typically indicate application logic errors, invalid state, or dependency failures that were not caught. Depending on hosting mode, this can fail the current request, terminate a worker loop, or crash the process entirely.

Category
Application
Integration Sources
Logs

Access Throttled

The application is receiving HTTP 429 "Too Many Requests" responses, indicating that it has exceeded the rate limits set by the other service. This can cause degraded functionality, slow performance, or temporary service unavailability for end users. HTTP 429 errors are typically triggered when an API or service imposes rate limits to control the volume of incoming requests.

Category
Application
Integration Sources
Service Communication

Apache Worker Exhausted

Apache HTTPD has reached its MaxRequestWorkers limit, meaning all worker slots are occupied and new requests cannot be processed. Apache prefork and event MPMs log "server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting" when all worker slots are busy. New connections are queued up to ListenBacklog, then dropped. This indicates either a load spike or slow request processing holding workers.

Category
Application
Integration Sources
Logs

Application Error Spike

The service is emitting a high volume of ERROR or CRITICAL log lines over a rolling window, indicating an active application-side failure with no matching specific log signature. Common causes include schema or payload mismatches after a deploy, repeated business-logic exceptions, dependency failures logged as generic errors, and application regressions that produce many error-level logs without a canonical runtime signature. This signal indicates the application is actively failing rather than providing a precise diagnosis of the underlying cause. This root cause activates when generic error-severity log volume exceeds 2,000 occurrences within a one-hour window.

Category
Application
Integration Sources
Logs

Authentication Misconfiguration

Application Load Balancer (ALB) authentication misconfiguration can disrupt secure traffic routing and lead to widespread configuration issues. This misconfiguration may trigger elevated ELB authentication errors, 504 request timeouts, and target connection errors, ultimately impacting service availability and application performance.

Category
ServiceApplication Load Balancer
Integration Sources
Infrastructure Scraper(AWS)

Cassandra Tombstone Pressure

Cassandra reads are scanning excessive tombstones, indicating partition design or TTL issues that are causing significant read latency and resource pressure. Cassandra logs a warning when a read scans more tombstones than tombstone_warn_threshold (default 1000). Tombstones are markers for deleted data that must be scanned until compaction removes them. Excessive tombstones cause read amplification, increased GC pressure, and can trigger ReadTimeoutExceptions.

Category
Application
Integration Sources
Logs

Causely Agent Down

The Causely agent on a Kubernetes node is unavailable, creating an observability gap for that node. When the agent is down, telemetry and symptom collection from the affected node may be incomplete or missing, reducing Causely's ability to detect and analyze issues originating there. Common causes include the node becoming unreachable or unavailable and taking the agent down with it, the agent container crashing or failing health checks, CPU, memory, or filesystem pressure on the node preventing the agent from running reliably, and configuration or deployment problems such as invalid configuration, rollout failures, or image errors.

Category
InfrastructureNode
Integration Sources
Infrastructure Scraper

Circuit Breaker Open

A circuit breaker protecting an upstream dependency has opened, causing the service to fail fast on calls to that dependency instead of waiting for timeouts. Circuit breakers such as Resilience4j and Netflix Hystrix open when an upstream dependency exceeds a configured failure rate or slow-call threshold. While open, all calls to the dependency are immediately rejected to prevent cascading failures and reduce latency. This is a protective mechanism, but it surfaces as errors for callers.

Category
Application
Integration Sources
Logs

Congested

The service is experiencing congestion, resulting in high latency for clients. This suggests that the system is unable to handle the current load efficiently, causing delays in response times. Congestion often occurs when the service receives more requests than it can handle within its capacity, leading to bottlenecks in processing. This may be due to insufficient resources (for example, CPU, memory, or bandwidth), unoptimized code, or a surge in traffic (for example, due to a sudden increase in demand or DDoS attack).

Category
Service
Integration Sources
Service Communication

Congested

The disk has reached full capacity, which prevents new data from being written and may cause applications to fail, especially those dependent on free disk space for logs, caching, or temporary files. This can also slow down or halt system operations if critical processes can no longer write to the disk.

Category
InfrastructureDisk
Integration Sources
Infrastructure Scraper

Congested Azure Event Hub Namespace

When the Azure Event Hub namespace becomes congested, it reaches a point where its processing capacity is exceeded. This leads to consistent throttling of operations, as the system enforces limits to prevent overload. The high rate of throttling not only impacts event ingestion but also cascades into resource starvation, affecting downstream services that rely on timely event processing. Such congestion is typically caused by high message throughput, suboptimal configuration, or insufficient scaling to handle peak loads.

Category
ServiceMessaging / Event Streaming
Integration Sources
Infrastructure Scraper(Azure)

Connection Pool Exhausted

The database or service connection pool is exhausted, and new connection requests are being rejected, causing application errors for all callers. Connection pool exhaustion occurs when all connections in the pool are in use and no connection becomes available within the timeout. Sources include PostgreSQL "sorry, too many clients already", MySQL error 1040, Redis "max number of clients reached", HikariCP timeout, or pgBouncer limit. This typically indicates a connection leak, long-running transactions holding connections, or insufficient pool sizing for the load.

Category
Application
Integration Sources
Logs

Conntrack Table Congested

The conntrack table on a VM is congested, causing new network connections to fail. This typically results in connectivity issues for applications, degraded performance, or downtime for services dependent on network communication. The conntrack table is responsible for tracking active network connections and has a fixed size, which can be exhausted under high connection load.

Category
InfrastructureVirtualMachine
Integration Sources
Infrastructure Scraper

Container Runtime Issue

The node's container runtime is unhealthy or unstable, preventing containers from starting or running reliably. Kubernetes depends on the container runtime to manage the full container lifecycle, including creation, startup, shutdown, and health monitoring. When the runtime is impaired, pods may fail to launch, existing workloads can become unstable, and the node may drift into a degraded state. Common causes include repeated crashes or restart loops in the runtime process that interrupt container lifecycle management, invalid runtime configuration preventing the node from managing containers correctly, overlay filesystem corruption or local disk issues breaking container image operations, and CPU, memory, or I/O saturation destabilizing the runtime.

Category
InfrastructureNode
Integration Sources
Infrastructure Scraper

Contention on Database Table Locks

The database table is experiencing an abnormally high rate of exclusive locks, preventing multiple transactions from accessing the table simultaneously. This creates a bottleneck that significantly degrades performance for all client applications depending on this table. Excessive locking typically occurs due to long-running transactions, lock contention between multiple transactions, inefficient transaction design, inappropriate isolation levels, or missing indexes leading to table scans instead of index seeks.

Category
Data PipelineDatabase Table
Integration Sources
Metrics

CPU Congested

One or multiple containers in a workload are experiencing CPU congestion, leading to potential throttling. This occurs when the containers use more CPU resources than allocated, causing degraded performance, longer response times, or application crashes. CPU throttling occurs when a container exceeds its CPU quota as defined by Kubernetes or Docker.

Category
InfrastructureCompute Spec
Integration Sources
Infrastructure Scraper

CPU Congested

A Virtual Machine (VM) experiencing CPU congestion can lead to sluggish application performance, delayed response times, or even timeout errors for users and processes. This typically indicates that the VM's CPU is overutilized, potentially due to high resource demands from applications or insufficient CPU allocation.

Category
InfrastructureVirtualMachine
Integration Sources
Infrastructure Scraper

CPU Congestion

After a version upgrade, application containers experience high CPU usage, leading to performance degradation or unresponsiveness. This issue impacts the system's ability to handle requests effectively, potentially causing downtime or delays for end users. High CPU usage post-upgrade typically stems from changes in the application code, dependencies, or configurations.

Category
Release ManagementCode Change Regression
Integration Sources
Infrastructure Scraper

Crash Failure

One or multiple containers of a workload has crashed with a non-zero exit code, indicating abnormal termination. This disrupts the application's functionality, leading to downtime or degraded performance depending on how the workload is designed. The non-zero exit code signifies an error during the execution of the container's process.

Category
InfrastructureCompute Spec
Integration Sources
Infrastructure Scraper

Database Connection Pool Saturated

The client-side database connection pool is exhausted when all available connections are in use, preventing new database queries from being executed. This can cause application requests to hang or fail, impacting user experience and potentially leading to downtime for database-dependent features.

Category
Application
Integration Sources
MetricsInfrastructure ScraperSymptom Activation

Database Connection Pool Saturated

After a version upgrade, the client-side database connection pool is exhausted when all available connections are in use, preventing new database queries from being executed. This can cause application requests to hang or fail, impacting user experience and potentially leading to downtime for database-dependent features.

Category
Release ManagementCode Change Regression
Integration Sources
MetricsInfrastructure ScraperSymptom Activation

Database Malfunction

The database is returning a high rate of errors or failing to respond to queries, causing disruptions for services and clients that depend on it. This may result in delayed or failed access to one or more tables, leading to degraded application performance, elevated latency, or complete unavailability of database-backed functionality.

Category
Application
Integration Sources
Service CommunicationMetrics

Disk Full

A filesystem used by this service has run out of space (ENOSPC), causing write operations to fail. "No space left on device" errors occur when a write() syscall fails because the filesystem has no free blocks. Logs, data files, temp files, or WAL segments may all contribute to disk exhaustion. This causes immediate write failures and can cause the process to crash or enter a degraded state.

Category
Application
Integration Sources
Logs

Disk Pressure

Disk pressure on a Kubernetes node indicates that the node's disk usage is high, potentially causing the eviction of pods, reduced performance, and the inability to schedule new pods. This affects application stability and the node's overall functionality. Disk pressure can arise from insufficient disk space, often caused by log accumulation, container images, temporary files, or application data.

Category
InfrastructureNode
Integration Sources
Infrastructure Scraper

Disk Read IOPs Congested

The total disk read IOPS for a cloud VM are congested because the VM has reached its maximum allowable IOPS limit. This results in throttling, which can slow application performance and lead to delays or errors in read-heavy workloads.

Category
InfrastructureVirtualMachine
Integration Sources
Infrastructure Scraper

Disk Read Throughput Congested

The total disk read throughput for a cloud VM is congested because the VM has reached its maximum allowable read bandwidth. This can lead to slower data transfer rates for read-intensive applications, causing delays in processing and reduced system performance.

Category
InfrastructureVirtualMachine
Integration Sources
Infrastructure Scraper

Disk Total IOPs Congested

The total disk IOPS for a cloud VM are congested because the VM has reached its maximum allowable IOPS limit. This results in throttling, which can slow application performance and lead to delays or errors in read/write-heavy workloads.

Category
InfrastructureVirtualMachine
Integration Sources
Infrastructure Scraper

Disk Total Throughput Congested

The total disk throughput for a cloud VM is congested because the VM has reached its maximum allowable bandwidth. This can lead to slower data transfer rates for read/write-intensive applications, causing delays in processing and reduced system performance.

Category
InfrastructureVirtualMachine
Integration Sources
Infrastructure Scraper

Disk Write IOPs Congested

The total disk write IOPS for a cloud VM are congested because the VM has reached its maximum allowable IOPS limit. This results in throttling, which can slow application performance and lead to delays or errors in write-heavy workloads.

Category
InfrastructureVirtualMachine
Integration Sources
Infrastructure Scraper

Disk Write Throughput Congested

The total disk write throughput for a cloud VM is congested because the VM has reached its maximum allowable write bandwidth. This can lead to slower data transfer rates for write-intensive applications, causing delays in processing and reduced system performance.

Category
InfrastructureVirtualMachine
Integration Sources
Infrastructure Scraper

Elasticsearch Cluster Unhealthy

The Elasticsearch cluster health status has transitioned to RED, indicating one or more primary shards are unassigned and data is unavailable for those shards. Elasticsearch reports RED health when at least one primary shard is unassigned. Search and indexing requests for affected indices will fail. This is caused by node failures, insufficient nodes to satisfy the index replication factor, or shard allocation issues.

Category
Application
Integration Sources
Logs

Ephemeral Storage Congested

A container is experiencing ephemeral storage congestion when its ephemeral storage usage becomes critically high, leading to failures in operations that depend on temporary storage. This may be triggered by factors such as excessive logging, inadequate cleanup of temporary files, or unexpected bursts in data processing.

Category
InfrastructureContainer
Integration Sources
Infrastructure Scraper

Ephemeral Storage Noisy Neighbor

A container acting as a noisy neighbor consumes excessive ephemeral storage, resulting in abnormally high storage usage and contributing to node-level disk pressure that can trigger pod evictions. This issue arises when a container consistently uses more ephemeral storage than expected.

Category
InfrastructureContainer
Integration Sources
Infrastructure Scraper

Excessive DNS Traffic from Client

The application is generating an unusually high volume of DNS requests, potentially overwhelming DNS servers, increasing latency for users, and risking service disruptions. This behavior may also incur additional costs or trigger rate-limiting from DNS providers. This typically arises when the application initiates DNS lookups more frequently than necessary due to lack of effective caching, redundant DNS resolution logic, or misconfigurations.

Category
Application
Integration Sources
Service Communication

Faulty Error Handling in HTTP Path

The HTTP path is experiencing a high rate of errors, causing disruptions for clients. This can lead to degraded performance, failed requests, or complete service unavailability, significantly affecting the user experience.

Category
Data PipelineHTTP Path
Integration Sources
Service Communication

Faulty Error Handling in RPC Method

The RPC Method is experiencing a high rate of errors, causing disruptions for clients. This can lead to degraded performance, failed requests, or complete service unavailability, significantly affecting the user experience.

Category
Data PipelineRPC Method
Integration Sources
Service Communication

File Descriptor Exhaustion

The application has reached the system-imposed limit on the number of file descriptors it can open. This typically leads to errors such as "Too many open files," preventing the application from creating new connections, reading files, or accessing resources. This can severely impact functionality, particularly in high-concurrency or high-I/O scenarios.

Category
Application
Integration Sources
Metrics(Elasticsearch)

File Limit Exhausted

The process has reached the operating system file descriptor limit (EMFILE/ENFILE), preventing it from opening new network connections, files, or sockets. Each open socket, file, or pipe consumes a file descriptor. When the per-process limit (ulimit -n / RLIMIT_NOFILE) or system-wide limit (/proc/sys/fs/file-max) is reached, new open/accept/connect calls fail with EMFILE or ENFILE. This manifests as connection refused errors in network servers or file-open failures in data pipelines.

Category
Application
Integration Sources
Logs

Filesystem Issue

The node filesystem is corrupted or has been remounted read-only, preventing normal node and container operation. The kubelet and container runtime depend on writable filesystem access to maintain state, update pod data, and manage container lifecycles, and when that access is lost, workloads on the node may fail entirely. Common causes include host filesystem corruption that breaks kubelet or runtime state management, the kernel remounting the filesystem read-only after detecting storage errors, and underlying disk or cloud volume failures that surface as filesystem instability.

Category
InfrastructureNode
Integration Sources
Infrastructure Scraper

Frequent Crash Failure

One or multiple containers of a workload are frequently crashing with a non-zero exit code, indicating abnormal termination. This disrupts the application's functionality, leading to downtime or degraded performance depending on how the workload is designed.

Category
InfrastructureCompute Spec
Integration Sources
Infrastructure Scraper

Frequent Crash Failure

One or multiple containers of a workload are frequently crashing with a non-zero exit code after a version upgrade. This disrupts the application's functionality, leading to downtime or degraded performance depending on the workload design. The issue likely stems from changes introduced in the new version.

Category
Release ManagementCode Change Regression
Integration Sources
Infrastructure Scraper

Frequent Memory Failure

The application frequently runs out of memory, leading to crashes, performance degradation, or instability. This affects the application's availability and can lead to downtime or poor user experience. The issue is likely due to inefficient memory usage, such as memory leaks, excessive data loading into memory, or improper garbage collection.

Category
InfrastructureCompute Spec
Integration Sources
Infrastructure Scraper

Frequent Memory Failure

The application is running out of memory after a version upgrade, leading to crashes, degraded performance, or instability. This impacts availability and user experience, often requiring container restarts or manual intervention to restore functionality. The issue is likely tied to changes in the updated version that increase memory usage or introduce inefficiencies.

Category
Release ManagementCode Change Regression
Integration Sources
Infrastructure Scraper

FrequentPodEphemeralStorageEvictions

A Kubernetes workload is experiencing frequent pod evictions due to ephemeral storage exhaustion. This disrupts application availability and performance, as pods are terminated when they exceed their allocated storage limits or when node-level storage is under pressure.

Category
InfrastructureController
Integration Sources
Infrastructure Scraper

Go Deadlock

The Go runtime has detected a full deadlock where all goroutines are permanently blocked, causing the process to panic and exit. The Go runtime prints "all goroutines are asleep - deadlock!" and exits when it determines that no goroutine can ever make progress. This is a fatal condition and the process terminates immediately. Common causes include channel send/receive with no corresponding partner, mutex lock with no unlock, or sync.WaitGroup misuse.

Category
Application
Integration Sources
Logs

Go Nil Pointer Panic

The Go runtime has panicked due to an invalid memory address or nil pointer dereference, causing the handler or process to fail. This panic occurs when code dereferences a nil pointer. In Go services, an unhandled panic usually terminates the current goroutine and may crash the whole process unless recovered. Common causes include missing dependency initialization, nil interface assumptions, unchecked map or pointer fields, or races around object lifecycle.

Category
Application
Integration Sources
Logs

GOMAXPROCS Misconfigured

The environment variable GOMAXPROCS, which controls the maximum number of CPU cores the Go runtime uses, has been set higher than the CPU limit of the container in which the Go application is running. This mismatch can lead to inefficient CPU usage, reduced performance, and potential throttling because the Go runtime attempts to schedule more work than the container is permitted to handle.

Category
Application
Integration Sources
MetricsSymptom Activation

HAProxy Backend Unavailable

HAProxy has no available backend servers, as all backends are either down or at their connection limit, causing all incoming requests to be rejected. HAProxy logs "backend has no server available" when every server in the backend pool is in DOWN state or has reached its maxconn limit. Clients receive a 503 Service Unavailable. This can be caused by all backend servers being unhealthy, a misconfigured health check, or a connection storm saturating backends.

Category
Application
Integration Sources
Logs

High Server Errors

Common causes for Azure Event Hub errors include quota exceeded (throughput or message size limits have been breached), partition or offset issues (consumers unable to connect or reading from invalid offsets), networking problems (connectivity issues due to firewall rules, DNS misconfigurations, or latency), service outage (regional Azure service disruption), and misconfigured access policies (incorrect SAS tokens, permissions, or authentication methods).

Category
ServiceMessaging / Event Streaming
Integration Sources
Infrastructure Scraper(Azure)

High User Errors

High user error rates in Azure Event Hub often stem from configuration issues, such as mismatched security credentials (for example Shared Access Signature (SAS) tokens), client SDK version incompatibilities, or throttling from overusing allocated resources. Insufficient permissions or quotas being exceeded can also trigger these errors. Another common cause is incorrect partition or consumer group usage, which can lead to connection limits being breached or messages being inaccessible.

Category
ServiceMessaging / Event Streaming
Integration Sources
Infrastructure Scraper(Azure)

Idle Timeout Misconfiguration

Misconfigured idle timeout settings can lead to unintended connection drops and delays, potentially triggering a high frequency of 504 gateway timeout errors. This misconfiguration may also contribute to broader configuration issues that disrupt seamless connectivity between clients and servers.

Category
ServiceApplication Load Balancer
Integration Sources
Infrastructure Scraper(AWS)

Image Pull Errors

Kubernetes controllers may encounter image pull errors when they cannot download container images from a registry, causing Pods to fail in starting or remain in an ImagePullBackOff state. This disrupts the deployment of applications and can affect service availability.

Category
InfrastructureController
Integration Sources
Infrastructure Scraper

Inefficient DNS Lookup

The application is making an unusually high volume of DNS requests, with over 80% returning NXDomain (non-existent domain) responses. This excessive DNS activity is adding 10 to 20 ms of latency to each request, negatively impacting service performance. The issue is often caused by the service or application attempting to resolve incomplete or unqualified domain names.

Category
Application
Integration Sources
Service Communication

Inefficient Garbage Collection

The application is experiencing high latency and reduced throughput because a significant portion of its runtime is being spent in garbage collection (GC). This leads to frequent pauses, degrading overall performance and causing delays in request handling through all the dependent services. This issue usually occurs when the Java Virtual Machine (JVM) or other garbage-collected runtime environments are under memory pressure.

Category
Application
Integration Sources
MetricsSymptom Activation

Inefficient Garbage Collection

After a version upgrade, the garbage collector is frequently running, leading to performance degradation or crashes. This issue is likely caused by changes in the application code or dependencies that increase memory usage or introduce inefficiencies.

Category
Release ManagementCode Change Regression
Integration Sources
MetricsSymptom ActivationInfrastructure Scraper

Inode Usage Congested

The disk is experiencing inode exhaustion, meaning the file system has run out of inodes (metadata structures for file storage), which prevents new files from being created even if there is free disk space. This often causes errors in applications attempting to create files and can disrupt services reliant on file storage.

Category
InfrastructureDisk
Integration Sources
Infrastructure Scraper

Invalid Client Certificate

The application is failing to connect to a service due to invalid certificate errors, preventing secure communication over HTTPS or TLS. This can cause downtime or degraded functionality for users relying on this service.

Category
Application
Integration Sources
Service Communication

Invalid Server Certificate

The network endpoint is serving an invalid server certificate, resulting in a high rate of client request errors due to certificate validation failures. This issue propagates further, increasing the overall request error rate across the system.

Category
InfrastructureNetwork Endpoint
Integration Sources
Service Communication

IOPs Congested

The disk is experiencing Read/Write Operations Per Second (IOPS) congestion, meaning that the total IOPS capacity is fully utilized. This causes slow performance for applications that rely on disk access, leading to delayed data processing, system lags, or even timeouts.

Category
InfrastructureDisk
Integration Sources
Infrastructure Scraper

Java GC Pressure

The JVM garbage collector is under pressure, logging allocation failures or to-space exhaustion events, indicating the heap is too small or object allocation rates are too high. GC pressure manifests as "Allocation Failure" in ParallelGC/SerialGC logs or "to-space exhausted" in G1GC logs. Both indicate that GC cannot reclaim space fast enough to keep up with allocation demand. This leads to longer GC pauses, increased latency, and potential OutOfMemoryError if unresolved.

Category
Application
Integration Sources
Logs

Java Heap Saturated

The JVM is operating with limited available heap memory, resulting in degraded performance or potential application crashes. This condition typically leads to frequent or prolonged garbage collection (GC) pauses, slow response times, and, in severe cases, OutOfMemoryError. It often reflects memory leaks, improper heap sizing, or excessive object allocation.

Category
Application
Integration Sources
MetricsSymptom Activation

Java Heap Saturated

After a version upgrade, the Java heap is frequently congested, leading to performance degradation or crashes. This issue is likely caused by changes in the application code or dependencies that increase memory usage or introduce inefficiencies.

Category
Release ManagementCode Change Regression
Integration Sources
MetricsSymptom ActivationInfrastructure Scraper

Java Null Pointer

The JVM has thrown java.lang.NullPointerException, indicating application code dereferenced a null reference and failed unexpectedly. A NullPointerException typically means application logic assumed an object was present when it was actually null. Depending on exception handling, this can fail an individual request or crash the process. Common causes include missing dependency wiring, invalid state transitions, bad deserialization assumptions, or unguarded optional values.

Category
Application
Integration Sources
Logs

Java Out of Memory

The JVM has thrown java.lang.OutOfMemoryError, indicating the heap, metaspace, or GC overhead limit is exhausted and the process cannot allocate memory. OutOfMemoryError occurs when the JVM cannot satisfy an allocation request. Common subtypes include heap space (object allocation failed), Metaspace (class metadata exhausted), GC overhead limit exceeded (GC spending over 98% of time reclaiming less than 2% of heap), or unable to create native thread. The process may continue in a degraded state or crash.

Category
Application
Integration Sources
Logs

Java Stack Overflow

The JVM has thrown java.lang.StackOverflowError due to runaway recursion or an excessively deep call stack that has exhausted the thread stack space. Each thread has a fixed stack size (-Xss). When recursive calls or deep call chains exceed this limit, StackOverflowError is thrown. Unlike OutOfMemoryError, this is usually a code defect such as infinite recursion or mutual recursion without a base case, or a framework issue with excessive proxy/interceptor wrapping.

Category
Application
Integration Sources
Logs

Java Thread Pool Exhausted

The JVM thread pool is saturated and rejecting new task submissions with java.util.concurrent.RejectedExecutionException. ThreadPoolExecutor throws RejectedExecutionException when both the thread pool is at max capacity and the task queue is full and the rejection policy fires (default: AbortPolicy). This causes request handlers to fail, leading to increased error rates for callers.

Category
Application
Integration Sources
Logs

Kafka ISR Shrink

A Kafka broker has logged ISR (In-Sync Replica) shrinkage, indicating that one or more follower replicas have fallen behind the leader and been removed from the ISR set. Kafka logs "ISR shrunk from [X] to [Y]" when a follower fails to keep up with the leader within replica.lag.time.max.ms. While a replica is out of the ISR, the effective replication factor is reduced, increasing the risk of data loss if the leader fails. Producers with acks=all will also experience increased latency or errors.

Category
Application
Integration Sources
Logs

Kafka Partition Outage

Kafka has one or more offline partitions, meaning the cluster cannot maintain a healthy leader for those partitions and affected topic data is currently unavailable. This is a hard failure condition rather than a performance degradation: produce requests may be rejected and consumers cannot fetch from affected partitions until leadership is restored. Common causes include a broker outage that simultaneously removes leaders and replicas from the in-sync replica set, replication falling far enough behind that no eligible replica can be elected as leader, controller instability during broker flaps or cluster reconfiguration, and storage corruption or severe disk failure on a broker node.

Category
Application
Integration Sources
MetricsSymptom Activation

Kafka Replication Degraded

One or more Kafka partitions are under-replicated, meaning followers are not keeping pace with the partition leader. This reduces fault tolerance across the cluster: if an additional broker is lost or a leader change is forced, the partition may become unavailable. The cluster may still be serving traffic, but producers and consumers are operating with reduced resilience. Common causes include disk I/O saturation on a follower broker slowing replica fetch or log flush, network congestion between brokers, a follower broker that is restarting, stuck, or under heavy load, and replication traffic competing with high-volume client produce traffic.

Category
Application
Integration Sources
MetricsSymptom Activation

Kafka Storage Pressure

Kafka broker log storage utilization is high, causing degraded performance across produce, replication, and retention paths. Kafka depends on sequential log I/O, and as data-log disks approach capacity, flush, compaction, and replica synchronization operations become less efficient. This typically creates a starvation pattern of increasing broker latency, slower replication, and delayed producer acknowledgements rather than an immediate hard failure. Common causes include topic retention policies keeping more data than the disk budget supports, write rate growth exceeding storage throughput, a log compaction backlog consuming disk bandwidth, and uneven partition placement concentrating data on a single broker.

Category
Application
Integration Sources
MetricsSymptom Activation

Kernel Issue

The node kernel is stalled or deadlocked, preventing the host from functioning correctly. Kernel-level failures are severe conditions that can block scheduling, container management, networking, and filesystem operations simultaneously, and a kernel deadlock often precedes full node unavailability. Common causes include bugs in the kernel or kernel modules that deadlock critical host operations, storage or networking drivers triggering hangs under load, and extreme resource contention exposing kernel-level instability.

Category
InfrastructureNode
Integration Sources
Infrastructure Scraper

Kubelet Issue

The kubelet on this node is unhealthy or restarting frequently, disrupting pod lifecycle operations and making the node unreliable. The kubelet is the primary Kubernetes agent on each node and is responsible for reporting node status, managing pod readiness, and coordinating with the container runtime. When it becomes unstable, the node may stop reporting status correctly, fail to update readiness conditions, and struggle to manage running pods. Common causes include CPU, memory, or disk pressure destabilizing the kubelet process, invalid kubelet flags or node configuration causing repeated failures, problems with the container runtime, networking stack, or host filesystem cascading into kubelet instability, and kernel or VM-level host issues making the kubelet unreliable.

Category
InfrastructureNode
Integration Sources
Infrastructure Scraper

Lock Contention

The application suffers from inefficient locking, where suboptimal lock management leads to excessive contention and prolonged mutex wait times. This inefficiency degrades performance by increasing the risk of thread starvation under heavy load. This can stem from overuse of locks, coarse-grained locking strategies, or improper lock design.

Category
Application
Integration Sources
MetricsSymptom Activation

Lock Contention

After a version upgrade, the application is experiencing frequent locking contention, leading to performance degradation or crashes. This issue is likely caused by changes in the application code or dependencies that increase locking or introduce inefficiencies.

Category
Release ManagementCode Change Regression
Integration Sources
MetricsSymptom ActivationInfrastructure Scraper

Malfunction

The Service is experiencing a high rate of errors, causing disruptions for clients. This can lead to degraded performance, failed requests, or complete service unavailability, significantly affecting the user experience.

Category
Service
Integration Sources
Service Communication

Malfunction

Multiple pods for a Kubernetes controller are in a "NotReady" state for an extended period, which can lead to service unavailability or degraded performance.

Category
InfrastructureController
Integration Sources
Infrastructure Scraper

Malfunction

A Kubernetes node is unavailable or unresponsive, causing workloads scheduled on it to fail. This typically means the kubelet or the underlying host is no longer operating normally, and pods on the affected node may become unavailable, fail health checks, or stop serving traffic entirely. Common causes include CPU, memory, disk, or kernel resource exhaustion destabilizing the node, the kubelet, container runtime, or other host-level services becoming unhealthy or unreachable, cloud instance failures, network isolation, or underlying VM problems making the node unavailable, and excessive pod concentration or invalid node-level configuration causing the node to become unstable.

Category
InfrastructureNode
Integration Sources
Infrastructure Scraper

Memory Congested

Memory congestion in a Virtual Machine (VM) leads to slow system performance, application crashes, or even VM instability as the system struggles to allocate memory for running processes. This typically results in frequent swapping or out-of-memory (OOM) errors, impacting applications and user operations.

Category
InfrastructureVirtualMachine
Integration Sources
Infrastructure Scraper

Memory Exhaustion

The Broker application has exhausted its available memory, resulting in degraded performance and potential service disruption. When memory usage reaches critical levels, the system may experience increased garbage collection (GC) activity, higher processing latency, and, in severe cases, OutOfMemoryError events that halt message processing.

Category
Application
Integration Sources
Infrastructure Scraper

Memory Failure

Containers running out of memory can lead to service crashes or degraded performance, resulting in errors for end users or failed service requests. This typically occurs when a container's allocated memory is insufficient for the workload it is handling, causing out-of-memory (OOM) errors and potential system instability.

Category
InfrastructureCompute Spec
Integration Sources
Infrastructure Scraper

Memory Failure

Memory failures after a code change can cause containers to crash or degrade performance, resulting in errors for end users or failed service requests. These issues occur when newly introduced code leads to unexpected increases in memory usage, triggering out-of-memory (OOM) errors and destabilizing the system.

Category
Release ManagementCode Change Regression
Integration Sources
Infrastructure Scraper

Memory Noisy Neighbor

A container acting as a noisy neighbor consumes excessive memory, leading to abnormally high memory usage and contributing to node-level memory pressure that can trigger pod evictions. This issue occurs when a container consistently uses more memory than expected, which adversely impacts both the container and its hosting node.

Category
InfrastructureContainer
Integration Sources
Infrastructure Scraper

Memory Pressure

Memory pressure on a Kubernetes node occurs when available memory falls below critical levels, potentially causing the eviction of pods and instability for applications running on the node. This reduces the node's capacity to run workloads, potentially leading to service disruptions if insufficient resources are available across the cluster.

Category
InfrastructureNode
Integration Sources
Infrastructure Scraper

MongoDB Connections Exhausted

MongoDB is at or near its maximum connection limit, and new client operations may fail when they cannot obtain a server connection. MongoDB enforces a cap on concurrent active connections, and when that limit is reached, clients cannot establish or reuse connections fast enough, causing requests to fail before any database work is attempted. Common causes include application connection pools configured with an excessively high maximum size, connection leaks where clients open connections without properly releasing them, traffic spikes that create more concurrent clients than the server can handle, and long-running queries that hold connections for extended periods.

Category
Application
Integration Sources
MetricsSymptom Activation

MongoDB Cursor Pressure

MongoDB has an unusually high number of open cursors, increasing server-side state management overhead and slowing query execution for callers. Open cursors represent active or partially consumed query result streams, and when too many remain open simultaneously, MongoDB must maintain more concurrent state and expend additional resources servicing them. This typically manifests as increasing latency and resource starvation rather than immediate request failures. Common causes include clients reading through large result sets slowly, applications that do not fully consume or explicitly close cursors, excessive polling or fan-out query patterns that create many simultaneous cursors, and batch sizes configured too small, which keeps individual cursors open longer than necessary.

Category
Application
Integration Sources
MetricsSymptom Activation

MongoDB Replica Lag

MongoDB replication lag is high, meaning secondaries are applying oplog entries noticeably behind the primary. Read queries directed to secondaries may return stale data, and the replica set's readiness to complete a clean failover is reduced. The database continues to operate, but replication cannot keep pace with the primary write rate, degrading both read consistency and fault tolerance. Common causes include secondary nodes with insufficient disk or CPU resources for the current oplog apply rate, network latency or bandwidth constraints between replica set members, bursty write workloads that generate oplog traffic faster than secondaries can consume, and long-running workloads on secondaries that compete with the replication apply process.

Category
Application
Integration Sources
MetricsSymptom Activation

MySQL Deadlock

MySQL InnoDB has detected a deadlock between two or more transactions and rolled one back to break the cycle. MySQL InnoDB logs "Deadlock found when trying to get lock; try restarting transaction" when two transactions hold locks that the other needs. The engine automatically selects a victim and rolls it back. Frequent deadlocks indicate lock acquisition order inconsistencies in application code.

Category
Application
Integration Sources
Logs

Network Issue

The node network is unavailable or unstable, causing connectivity failures for node services and workloads. Node-level networking problems can disrupt kubelet communication, service routing, pod-to-pod traffic, and access to external systems, and in practice can make the node appear partially or fully unavailable even when the host itself is still running. Common causes include network interface instability such as resets or repeated unregister events interrupting traffic, misconfiguration in the node's CNI or routing stack isolating the node from the cluster, and underlying VM or cloud network failures cutting the node off from the rest of the infrastructure.

Category
InfrastructureNode
Integration Sources
Infrastructure Scraper

Network Policy Misconfiguration

Application Load Balancer (ALB) network policy misconfigurations can block or restrict legitimate traffic, leading to widespread configuration issues. These misconfigurations often result in elevated 504 gateway timeout errors and target connection failures, ultimately disrupting service availability.

Category
ServiceApplication Load Balancer
Integration Sources
Infrastructure Scraper(AWS)

Nginx Upstream Timeout

Nginx is logging upstream connection timeouts, indicating a backend service is not responding within the configured proxy_read_timeout or proxy_connect_timeout. Nginx logs "upstream timed out (110: Connection timed out)" when a backend fails to respond within the configured timeout. This causes nginx to return a 504 Gateway Timeout to the caller. The upstream service may be overloaded, deadlocked, or experiencing a network partition.

Category
Application
Integration Sources
Logs

Nginx Worker Connections Exhausted

Nginx has exhausted its worker_connections limit and cannot accept new connections, meaning incoming requests are being dropped. Nginx's worker_connections directive limits the number of simultaneous connections per worker process. When all connections are in use, nginx logs "worker_connections are not enough" and new connections are refused. Total capacity equals worker_processes multiplied by worker_connections.

Category
Application
Integration Sources
Logs

Noisy Client

The application acts as a Noisy Client, generating high number of requests that burden destination services with increased load and elevated request rates. This aggressive request pattern directly impacts destination services by driving a high request rate that can overwhelm service capacity and contributing to increased load on the destination.

Category
Application
Integration Sources
Infrastructure Scraper

PID Pressure

The node is close to exhausting its process ID capacity, limiting its ability to create new processes and causing workload instability. When PID limits are approached, Kubernetes may be unable to start new processes reliably and workloads can become unstable even when CPU and memory remain available. Common causes include applications or system services creating excessive processes or threads, too many workloads concentrated on a single node, and PID limits configured too low for the workload profile running on the node.

Category
InfrastructureNode
Integration Sources
Infrastructure Scraper

Postgres Cache Hit Rate Degraded

Postgres shared_buffers cache hit rate has dropped, causing queries to read from disk and significantly increasing query latency. Postgres uses shared_buffers as an in-memory page cache for table and index data, and when the working set fits in shared_buffers queries are served from memory, but when the cache hit rate drops Postgres must read data from disk or OS page cache, which is orders of magnitude slower and causes query latency to increase for all clients. Common causes include shared_buffers being too small for the working set size, new query patterns accessing large table scans that evict hot pages, database growth causing the working set to exceed available memory, sequential scans on large tables, and a cold cache after a server restart until the working set is loaded.

Category
Application
Integration Sources
MetricsSymptom Activation

Postgres Checkpoint I/O Pressure

Postgres checkpoint I/O write time is high, indicating disk saturation that increases latency for all database operations. Postgres periodically performs checkpoints to flush dirty pages from shared_buffers to disk to ensure durability, and when checkpoint_write_time is high the disk is being saturated by checkpoint I/O, which competes with query I/O for reads and writes and causes elevated query latency for all clients. It may also indicate that checkpoints are too infrequent, leading to large bursts of write I/O. Common causes include a high WAL write rate generating many dirty pages between checkpoints, checkpoint_completion_target being too low and concentrating I/O into a short burst, disk I/O throughput being insufficient for the write rate, min_wal_size or max_wal_size being too small and causing frequent checkpoints, and shared storage throughput limits being hit.

Category
Application
Integration Sources
MetricsSymptom Activation

Postgres Connection Slots Exhausted

Postgres active connections are approaching or have reached max_connections, and new connection attempts from services will be rejected. Postgres uses a process-per-connection model and limits total connections via max_connections, and when this limit is reached new connection attempts receive "FATAL: sorry, too many clients already," causing services that cannot connect to fail their database operations. Connections also consume shared memory, so running near the limit causes additional resource pressure. Common causes include no connection pooler being in use, connection pools being misconfigured with too high a max_size, connection leaks where connections are opened but not properly closed, sudden traffic spikes creating more service instances than the pool can handle, and long-running queries holding connections that should be idle.

Category
Application
Integration Sources
MetricsSymptom Activation

Postgres Deadlock

PostgreSQL has detected a deadlock cycle between concurrent transactions and terminated one to resolve it. PostgreSQL logs "ERROR: deadlock detected" when its deadlock detector finds a cycle in the lock wait graph. One transaction is chosen as the victim and receives an error and must be retried. Frequent deadlocks degrade throughput and cause user-visible errors.

Category
Application
Integration Sources
Logs

Postgres Deadlock Storm

Postgres is experiencing a high rate of deadlocks, causing transaction rollbacks and forcing retries in upstream services. A deadlock occurs when two or more transactions each hold a lock the other needs, creating a circular dependency, and Postgres detects this and aborts one transaction with "ERROR: deadlock detected." The aborted transaction must be retried by the application, and a high deadlock rate causes elevated error rates, retry traffic amplification, and query latency spikes. Common causes include application code acquiring locks in inconsistent order across transactions, bulk UPDATE or DELETE operations without consistent row ordering, missing explicit locking where the application assumes order of operations, and high-concurrency workloads on the same rows without retry logic.

Category
Application
Integration Sources
MetricsSymptom Activation

Postgres Idle-in-Transaction Accumulation

Postgres has idle-in-transaction sessions accumulating, and these sessions hold locks, prevent autovacuum, and cause compounding table bloat and lock contention. An idle-in-transaction session has started a transaction with BEGIN but is not actively executing any query, and it may be waiting for application-side processing, a network call, or simply be a leaked connection. These sessions hold all locks acquired during the transaction and prevent autovacuum from cleaning dead rows, and as they accumulate they cause lock contention, connection slot exhaustion, and eventually table bloat that degrades query performance. Common causes include the application opening a transaction and making an external API call or sleeping, an ORM or framework starting a transaction at request start but not committing promptly, a connection pool returning connections that were left in a transaction, and missing idle_in_transaction_session_timeout configuration.

Category
Application
Integration Sources
MetricsSymptom Activation

Postgres Lock Contention

Postgres has a high number of sessions waiting to acquire locks, indicating contention from long-running or conflicting transactions. Lock contention occurs when multiple transactions attempt to modify the same rows or tables simultaneously, and sessions that cannot acquire the lock wait in the lock queue, consuming a connection slot and blocking progress. As waiting sessions accumulate, query latency rises and eventually all connection slots may be consumed. Common causes include long-running transactions holding row or table locks, missing or overly broad UPDATE or DELETE statements without appropriate WHERE clauses, idle-in-transaction sessions holding locks without doing any work, DDL operations such as ALTER TABLE or VACUUM FULL taking exclusive locks, and high-concurrency write patterns on hot rows such as counters or status fields.

Category
Application
Integration Sources
MetricsSymptom Activation

Postgres Query Memory Spill to Disk

Postgres is writing temporary files because queries exceed work_mem, causing significant I/O overhead and query slowdowns. When a sort, hash join, or aggregate operation requires more memory than work_mem allows, Postgres spills intermediate data to temporary files on disk. Disk I/O for temp files is much slower than in-memory operations, and temp files compete with other I/O on the same storage. High temp file creation rates indicate queries are consistently exceeding the configured memory budget. Common causes include work_mem being set too low for actual query complexity and result set sizes, queries using ORDER BY, GROUP BY, or JOIN without appropriate indexes, many concurrent sessions each using their full work_mem allocation, and queries with multiple sort or hash operations where each gets its own work_mem allocation.

Category
Application
Integration Sources
MetricsSymptom Activation

Postgres Replication Lag

Postgres standby replication lag is high, causing replica reads to return significantly stale data. Postgres streaming replication sends WAL records from primary to standbys, and replication lag is the delay between a write being committed on the primary and the standby applying it. High lag means read queries directed to replicas return stale results, which can cause data consistency issues for applications that read their own writes via replicas. Common causes include network bandwidth saturation between primary and standby, a standby I/O bottleneck where disk cannot apply WAL as fast as it arrives, long-running queries on the standby blocking WAL apply, the primary generating WAL faster than the network can deliver it, and the standby being under-provisioned relative to the primary write rate.

Category
Application
Integration Sources
MetricsSymptom Activation

Producer Publish Rate Spike

The application is publishing messages at a rate significantly higher than normal, causing queue depth to grow and producer message rate to increase. This surge in publishing activity creates backpressure and can overwhelm downstream consumers. When the application experiences a producer publish rate spike, it generates messages at an abnormally high rate that exceeds the system's normal capacity, leading to queue depth growth and propagation of congestion to downstream destinations.

Category
Application
Integration Sources
Service CommunicationMetrics

Python Unhandled Exception

Python logs show an unhandled exception or fatal interpreter error, indicating the application failed unexpectedly. An unhandled Python exception usually emits a traceback and terminates the active request, worker, or process. Common causes include missing input validation, bad assumptions about returned objects, dependency failures that are not caught, or unexpected runtime state.

Category
Application
Integration Sources
Logs

RabbitMQ Resource Alarm

RabbitMQ has triggered a memory or disk resource alarm, and all publishers are blocked until the watermark is cleared. RabbitMQ sets a memory alarm when used_memory exceeds vm_memory_high_watermark (default: 40% of available RAM) and a disk alarm when free disk space falls below disk_free_limit. While any alarm is active, all connections that have published are blocked, causing producer services to stall.

Category
Application
Integration Sources
Logs

Rate Limited

The service logs show repeated rate-limiting errors, indicating that the service or an upstream dependency is rejecting requests because a rate limit has been reached. Rate limiting is enforced by the service itself or by an upstream dependency such as an API gateway, an ingestion pipeline, or an external API. When the limit is exceeded, new requests are rejected or throttled, causing increased error rates for callers. Common sources of this pattern include ingestion services such as Grafana Tempo when write throughput exceeds the per-tenant ingestion limit, API gateways or proxies such as the nginx ngx_http_limit_req module when request rate exceeds the configured burst, and application-level rate limiters logging rate limit exceeded or rate limit reached when quota is consumed. The root cause is typically insufficient quota allocation, a sudden traffic spike, or a misconfigured rate limit policy rather than a code defect.

Category
Application
Integration Sources
Logs

Read IOPs Congested

The disk is experiencing Read Operations Per Second (IOPS) congestion, meaning that the total IOPS capacity is fully utilized. This causes slow performance for applications that rely on disk access, leading to delayed data processing, system lags, or even timeouts.

Category
InfrastructureDisk
Integration Sources
Infrastructure Scraper

Read Throughput Congested

The disk is experiencing congestion specifically in read throughput, which slows down data retrieval from the disk and can degrade the performance of applications reliant on high-speed data access.

Category
InfrastructureDisk
Integration Sources
Infrastructure Scraper

Redis Cache Miss Storm

Redis cache hit rate has dropped significantly, and upstream services are falling back to the database on most requests, multiplying database load. A cache miss storm occurs when Redis cannot serve requests from memory and callers bypass it to the backing database, dramatically increasing database query volume and latency for all services sharing that database. Common causes include memory pressure and key evictions, cache invalidation or a flush clearing a large portion of the keyspace, a cold start after the cache was restarted and not yet warmed, keys expiring simultaneously in a TTL cliff, and an application bug writing cache keys with wrong names that prevents future hits.

Category
Application
Integration Sources
MetricsSymptom Activation

Redis Command Queue Saturation

Redis single-threaded command pipeline is saturated, and slow O(N) commands are blocking the event loop while BLPOP or BRPOP clients accumulate, causing upstream timeouts. Redis processes commands on a single thread, and when a slow command such as KEYS, SMEMBERS, SORT, or LRANGE on a large collection runs, it blocks all other commands for its duration. Clients waiting on BLPOP or BRPOP accumulate because they cannot be served while the thread is busy, and as blocked client count rises upstream services begin timing out. This root cause requires both signals to be present simultaneously: redis_slowlog_last_id or slow command log growth, and redis_blocked_clients above threshold. Common causes include application use of KEYS * or SMEMBERS on large keysets in production, large sorted sets or lists being iterated with single-command scans, and BLPOP or BRPOP patterns with no timeout or long timeout combined with slow producers.

Category
Application
Integration Sources
MetricsSymptom Activation

Redis Connection Pool Saturated

Client-side Redis connection pool exhaustion occurs when all available connections in the pool are in use, preventing new requests to Redis. This can lead to request timeouts or failures, causing application disruptions for features relying on Redis for caching, messaging, or other operations.

Category
Application
Integration Sources
MetricsSymptom Activation

Redis Connection Pool Saturated

After a version upgrade, the Redis connection pool is frequently congested, leading to performance degradation or crashes. This issue is likely caused by changes in the application code or dependencies that increase Redis usage or introduce inefficiencies.

Category
Release ManagementCode Change Regression
Integration Sources
MetricsSymptom ActivationInfrastructure Scraper

Redis Memory Exhausted

Redis is rejecting commands because used_memory has exceeded the configured maxmemory limit, and clients receive OOM errors and writes fail. When Redis reaches its maxmemory limit, it enforces the configured eviction policy (for example allkeys-lru or volatile-lru). If the eviction policy cannot free enough space, or if maxmemory-policy is noeviction, all write commands are rejected with "OOM command not allowed when used memory > maxmemory".

Category
Application
Integration Sources
Logs

Redis Memory Pressure

Redis server memory is at or near its configured limit, causing active key evictions that degrade cache effectiveness and increase load on downstream databases. Redis evicts keys when memory usage exceeds the configured maxmemory limit, and although the eviction policy determines which keys are removed, under heavy eviction the cache hit rate drops sharply and forces callers to read from the database instead, compounding into a database load spike. Common causes include the data set growing beyond allocated memory, no TTL being set on keys and causing unbounded growth, memory fragmentation consuming effective capacity, and a sudden traffic spike filling the keyspace faster than evictions can keep up.

Category
Application
Integration Sources
MetricsSymptom Activation

Redis Server Connections Exhausted

Redis has reached its maxclients limit and is rejecting new connection attempts, causing connection refused errors in upstream services. Redis enforces a maximum number of simultaneous client connections via the maxclients configuration parameter, and when this limit is reached new connection attempts are immediately rejected with "ERR max number of clients reached," causing upstream services to receive connection errors and fail their Redis-dependent operations. Common causes include maxclients being set too low for the number of services connecting to the Redis instance, connection pool misconfiguration where pools do not release idle connections, sudden traffic spikes creating many short-lived connections without pooling, and connection leaks in application code.

Category
Application
Integration Sources
MetricsSymptom Activation

Schema Change Causing Table Lock Contention

The database table is experiencing an unusually high rate of Data Dictionary Lock (DDL), which blocks both read and write operations during schema modifications. This significantly impacts all client applications, causing service disruptions and performance degradation.

Category
Data PipelineDatabase Table
Integration Sources
Metrics

Slow Consumer

The application is consuming messages slower than they are produced, creating a processing bottleneck. As unprocessed messages accumulate, the system experiences increased queue lag, potential memory pressure, and downstream congestion. This often indicates that one or more instances are unable to keep up due to resource constraints, inefficient processing logic, or external dependencies.

Category
Application
Integration Sources
Service CommunicationMetricsInfrastructure ScraperSymptom Activation(Infrastructure Scraper for SQS only)

Slow Database Queries

The application is experiencing slow database queries that lead to downstream slow consumer behavior and potential resource starvation. This condition affects instance performance, particularly when query execution times become excessively long, degrading overall system responsiveness.

Category
Application
Integration Sources
Service Communication

Slow Database Queries

After a version upgrade, the application is experiencing slow database queries that lead to downstream slow consumer behavior and potential resource starvation. This condition affects instance performance, particularly when query execution times become excessively long.

Category
Release ManagementCode Change Regression
Integration Sources
Service CommunicationInfrastructure Scraper

Slow Database Server Queries

Warehouse congestion in Snowflake occurs when the processing capacity is overwhelmed by incoming queries, causing a significant backlog. This leads to queries being queued at high rates, indicating that the system is struggling to process them in a timely manner. The resulting resource starvation further degrades performance.

Category
Application
Integration Sources
Metrics(Snowflake)

Slow Execution in HTTP Path Handler

The HTTP path is experiencing congestion, resulting in high latency for clients. This suggests that the system is unable to handle the current load efficiently, causing delays in response times. Congestion often occurs when the service receives more requests than it can handle within its capacity.

Category
Data PipelineHTTP Path
Integration Sources
Service Communication

Slow Execution in RPC Method Handler

The RPC method is experiencing congestion, resulting in high latency for clients. This suggests that the system is unable to handle the current load efficiently, causing delays in response times. Congestion often occurs when the service receives more requests than it can handle within its capacity.

Category
Data PipelineRPC Method
Integration Sources
Service Communication

SNAT Ports Congested

The SNAT (Source Network Address Translation) ports on a virtual machine (VM) are congested, leading to outbound network connection failures or degraded performance for services relying on external APIs or resources. This issue primarily impacts VMs that need to establish multiple concurrent connections to the internet or external systems.

Category
InfrastructureVirtualMachine
Integration Sources
Infrastructure Scraper

Table Access Failure in Database

The database table is experiencing performance degradation or errors, causing disruptions for client applications. This results in slow query response times, potential errors, and degraded service performance for systems that depend on this table.

Category
Data PipelineDatabase Table
Integration Sources
Service Communication

Transaction ID Congested

In databases like PostgreSQL, transaction IDs are 32-bit integers that count the number of transactions performed. High utilization occurs when the counter nears its maximum value (~2 billion transactions), requiring a wraparound to continue operation. Failure to perform routine VACUUM operations can prevent the system from marking old XIDs as reusable.

Category
Application
Integration Sources
Infrastructure Scraper(GCP Postgres)

Unauthorized Access

The application is receiving numerous "Unauthorized" status codes (typically HTTP 401) when trying to access another service. This prevents the application from successfully retrieving data or performing actions, potentially causing service disruptions or degraded functionality for end users.

Category
Application
Integration Sources
Service Communication

Unknown Configuration Failure

Application Load Balancer (ALB) misconfiguration can cause widespread connectivity issues, leading to a high frequency of 504 gateway timeout errors and 5xx server errors. These issues indicate that the load balancer's settings are not properly optimized for handling traffic efficiently and reliably.

Category
ServiceApplication Load Balancer
Integration Sources
Infrastructure Scraper(AWS)

Write IOPs Congested

The disk is experiencing Write Operations Per Second (IOPS) congestion, meaning that the total IOPS capacity is fully utilized. This causes slow performance for applications that rely on disk access, leading to delayed data processing, system lags, or even timeouts.

Category
InfrastructureDisk
Integration Sources
Infrastructure Scraper

Write Throughput Congested

The disk is experiencing write throughput congestion, leading to slower data write speeds and affecting applications that require high-speed data recording. This issue can cause delays in data availability and reduced performance in write-intensive tasks.

Category
InfrastructureDisk
Integration Sources
Infrastructure Scraper