Applications
Software programs designed to perform specific tasks or functions for end-users or other systems.
Note: In the Causely user interface, these root causes appear under the corresponding service entity.
Root Causes
- DB Connections Congested
- File Descriptor Congested
- Go Max Procs Too High
- High DNS Request Rate
- Inefficient DNS Lookup
- Inefficient Garbage Collection
- Inefficient Locking
- Invalid Client Certificate
- Java Heap Congested
- Memory Congested
- Noisy Client
- Redis Connections Congested
- Slow Consumer
- Slow Database Query
- Throttled Access
- Unauthorized Access
DB Connections Congested
The client-side database connection pool is exhausted when all available connections are in use, preventing new database queries from being executed. This can cause application requests to hang or fail, impacting user experience and potentially leading to downtime for database-dependent features.
The exhaustion occurs because the application exceeds the configured maximum number of connections in the pool.
File Descriptor Congested
The application has reached the system-imposed limit on the number of file descriptors it can open. This typically leads to errors such as "Too many open files," preventing the application from creating new connections, reading files, or accessing resources. This can severely impact functionality, particularly in high-concurrency or high-I/O scenarios. File descriptors are a finite resource in operating systems, used to represent open files, network sockets, and other I/O streams. The limit is defined per process and globally by the operating system.
Go Max Procs Too High
The environment variable GOMAXPROCS
, which controls the maximum number of CPU cores the Go runtime uses, has been set higher than the CPU limit of the container in which the Go application is running. This mismatch can lead to inefficient CPU usage, reduced performance, and potential throttling because the Go runtime attempts to schedule more work than the container is permitted to handle.
By default, GOMAXPROCS
is set to match the number of CPU cores available on the host VM, not the container’s CPU limits. When the container's CPU limit is lower than the VM’s core count, the Go runtime can overestimate available CPU resources, causing the application to compete for CPUs it doesn't have access to. This leads to inefficiencies, such as task scheduling issues and potential throttling as the application exceeds its allowed CPU usage within the container.
High DNS Request Rate
The application is generating an unusually high volume of DNS requests, potentially overwhelming DNS servers, increasing latency for users, and risking service disruptions. This behavior may also incur additional costs or trigger rate-limiting from DNS providers. HighDNSRequestRate typically arises when the application initiates DNS lookups more frequently than necessary. This may be due to a lack of effective caching, redundant DNS resolution logic, or misconfigurations that trigger repeated lookups. The excessive volume of DNS requests can overload DNS servers, leading to slower resolution times, increased latency, and potential service disruptions. In addition, if DNS providers enforce rate limits or additional charges for high query volumes, this behavior can result in increased operational costs and further impact application performance.
Inefficient DNS Lookup
The application is making an unusually high volume of DNS requests, with over 80% returning NXDomain (non-existent domain) responses. This excessive DNS activity is adding 10 to 20 ms of latency to each request, negatively impacting service performance.
The issue is often caused by the service or application attempting to resolve incomplete or unqualified domain names, leading to multiple DNS lookup attempts. By default, when a domain name does not have a trailing dot (indicating it's a fully qualified domain name, or FQDN), the system adds search domains (like the pod’s namespace and cluster's domain) to the query, triggering multiple DNS requests. For example, a request for serviceA without a trailing dot (.
) might result in DNS lookups for:
serviceA.namespace.svc.cluster.local
serviceA.svc.cluster.local
serviceA.cluster.local
- and finally,
serviceA
. When the service does not exist at any of these levels, the queries return NXDomain errors.
Inefficient Garbage Collection
The application is experiencing high latency and reduced throughput because a significant portion of its runtime is being spent in garbage collection (GC). This leads to frequent pauses, degrading overall performance and causing delays in request handling through all the dependent services. This issue usually occurs when the Java Virtual Machine (JVM) or other garbage-collected runtime environments (for example, .NET, Go) are under memory pressure.
Inefficient Locking
The application suffers from inefficient locking, where suboptimal lock management leads to excessive contention and prolonged mutex wait times. This inefficiency degrades performance by increasing the risk of thread starvation under heavy load. Inefficient locking occurs when the application's concurrency mechanisms are not optimized, resulting in threads waiting longer than necessary for access to shared resources. This can stem from overuse of locks, coarse-grained locking strategies, or improper lock design. The cascading effect of these inefficiencies slows down the processing of tasks and can trigger additional issues such as slow consumption of data and potential thread starvation when mutex wait times remain high.
Invalid Client Certificate
The application is failing to connect to a service due to invalid certificate errors, preventing secure communication over HTTPS or TLS. This can cause downtime or degraded functionality for users relying on this service.
Java Heap Congested
Java heap congestion occurs when the Java Virtual Machine (JVM) runs low on available heap memory, leading to performance degradation or application crashes. This can manifest as slow response times, frequent garbage collection (GC) cycles, or OutOfMemoryError
.
Memory Congested
The Broker application is suffering from memory congestion, where excessive memory usage degrades performance and increases processing latency. Memory congestion in the Broker application arises when its memory resources are fully utilized, impairing its ability to process messages efficiently. This condition is often the result of inefficient memory management, an unoptimized message backlog, or suboptimal garbage collection settings. As memory usage approaches critical levels, the application may experience delays in processing, increased frequency of garbage collection cycles, and, in severe cases, out-of-memory errors that disrupt service continuity.
Noisy Client
The application acts as a Noisy Client, generating high number of requests that burden destination services with increased load and elevated request rates. As a Noisy Client, the application issues requests at a frequency that far exceeds normal operational thresholds. This aggressive request pattern directly impacts destination services by:
- Driving a high request rate that can overwhelm service capacity.
- Contributing to increased load on the destination, which may degrade performance and stability.
Redis Connections Congested
Client-side Redis connection pool exhaustion occurs when all available connections in the pool are in use, preventing new requests to Redis. This can lead to request timeouts or failures, causing application disruptions for features relying on Redis for caching, messaging, or other operations.
Slow Consumer
The application is experiencing slow consumer issues, where components process incoming messages too slowly. This inefficiency leads to a clogged consumption mechanism, resulting in high lag and propagating congestion to downstream destinations. When the application behaves as a slow consumer, its data processing rate cannot keep pace with incoming traffic. This creates a bottleneck that manifests in two key ways:
- Instances Slow Consumer: Specific components or instances become sluggish in handling data, delaying overall processing.
- Propagation of a Clogging: The slow consumption causes a buildup of unprocessed data, which in turn results in high lag and can extend congestion to downstream destination services.
Slow Database Query
The application is experiencing slow database queries that lead to downstream slow consumer behavior and potential resource starvation. This condition affects instance performance, particularly when query execution times become excessively long. Slow database queries indicate that interactions with the database are taking longer than expected, which can degrade overall system responsiveness. This problem propagates into a conditional state affecting individual instances, where prolonged query durations are likely to trigger further performance degradation. When query execution times exceed acceptable thresholds, the resulting slowdown becomes nearly certain to lead to resource starvation. Even under less severe conditions, the delay in database interactions can contribute to slow consumer behavior, compounding the overall impact on system performance.
Throttled Access
The application is receiving HTTP 429 "Too Many Requests" responses, indicating that it has exceeded the rate limits set by the other service. This can cause degraded functionality, slow performance, or temporary service unavailability for end users. HTTP 429 errors are typically triggered when an API or service imposes rate limits to control the volume of incoming requests. The application is likely sending requests faster than the service allows. Rate limits are often enforced to prevent overloading the service, and they may be defined based on time intervals (for example, X requests per minute).
Unauthorized Access
The application is receiving numerous "Unauthorized" status codes (typically HTTP 401) when trying to access another service. This prevents the application from successfully retrieving data or performing actions, potentially causing service disruptions or degraded functionality for end users. The unauthorized status codes likely indicate an authentication or authorization issue.