Infrastructure
Servers or cloud resources to execute tasks, run applications, or perform calculations.
Root Causes
Compute
Compute Spec
Container
Controller
Node
VirtualMachine
- Conntrack Table Congested
- SNAT Ports Congested
- CPU Congested
- Memory Congested
- Disk Total IOPs Congested
- Disk Read IOPs Congested
- Disk Write IOPs Congested
- Disk Total Throughput Congested
- Disk Read Throughput Congested
- Disk Write Throughput Congested
Storage
Disk
- Congested
- Inode Usage Congested
- IOPs Congested
- Read IOPs Congested
- Write IOPs Congested
- Read Throughput Congested
- Write Throughput Congested
Network
Network Endpoint
Compute
Compute Spec
Compute Spec refers to the configuration of containers or vms, including CPU, memory, and storage resources. It defines the capacity and performance characteristics of the compute resources used to run applications and services.
CPU Congested
One or multiple containers in a workload are experiencing CPU congestion, leading to potential throttling. This occurs when the containers use more CPU resources than allocated, causing degraded performance, longer response times, or application crashes. CPU throttling occurs when a container exceeds its CPU quota as defined by Kubernetes or Docker. The container runtime enforces these limits by restricting access to CPU resources, leading to delays in processing tasks. Common causes include insufficient CPU limits/requests for the workload’s demand, high contention for CPU resources from other containers on the same node, or inefficient application behavior such as busy loops or suboptimal thread management.
Memory Failure
Containers running out of memory can lead to service crashes or degraded performance, resulting in errors for end users or failed service requests. This typically occurs when a container's allocated memory is insufficient for the workload it is handling, causing out-of-memory (OOM) errors and potential system instability. The root cause is usually that the container's memory limit is too low, or the application running inside the container is consuming more memory than expected. This can be caused by memory leaks, improper configuration, or a sudden spike in workload that the container wasn't sized for. When the container reaches its memory limit, the system triggers an OOM event, killing the process, which causes the service interruption.
Frequent Memory Failure
The application frequently runs out of memory, leading to crashes, performance degradation, or instability. This affects the application's availability and can lead to downtime or poor user experience. The issue is likely due to inefficient memory usage, such as memory leaks, excessive data loading into memory, or improper garbage collection in languages like Java or Python. Alternatively, the application may not be appropriately sized for the workload or there could be a misconfiguration of memory limits in containerized environments (for example, Docker, Kubernetes).
Crash Failure
One or multiple containers of a workload has crashed with a non-zero exit code, indicating abnormal termination. This disrupts the application’s functionality, leading to downtime or degraded performance depending on how the workload is designed. The non-zero exit code signifies an error during the execution of the container's process.
Frequent Crash Failure
One or multiple containers of a workload are frequently crashing with a non-zero exit code, indicating abnormal termination. This disrupts the application’s functionality, leading to downtime or degraded performance depending on how the workload is designed. The non-zero exit code signifies an error during the execution of the container's process.
Container
Ephemeral Storage Congested
A container is experiencing ephemeral storage congestion when its ephemeral storage usage becomes critically high, leading to failures in operations that depend on temporary storage. Ephemeral storage congestion in a container occurs when the container's consumption of ephemeral storage consistently exceeds acceptable limits. This state is characterized by:
- EphemeralStorageUtilization_High: The container demonstrates continuously elevated use of ephemeral storage.
- Failure: The unsustainable storage usage ultimately disrupts critical operations, causing the container to fail and propagate to the related entities. This condition may be triggered by factors such as excessive logging, inadequate cleanup of temporary files, or unexpected bursts in data processing. When the container's ephemeral storage is overwhelmed, it can no longer maintain normal operations, leading to process failures.
Ephemeral Storage Noisy Neighbor
A container acting as a noisy neighbor consumes excessive ephemeral storage, resulting in abnormally high storage usage and contributing to node-level disk pressure that can trigger pod evictions. This issue arises when a container consistently uses more ephemeral storage than expected, which adversely impacts both the container and its hosting node. Key symptoms include:
- EphemeralStorage_HighestUsage: The container registers the highest ephemeral storage consumption compared to its peers.
- Node.ContainerEphemeralStorageUtilization.High: The node reports persistently elevated usage of container ephemeral storage.
- Node.DiskPressurePodEvictions.High: Sustained disk pressure leads to frequent pod evictions on the node.
- Node.DiskPressure: Node experiences disk pressure. Additionally, these conditions can propagate further, resulting in additional pod evictions due to insufficient ephemeral storage on the node.
Memory Noisy Neighbor
A container acting as a noisy neighbor consumes excessive memory, leading to abnormally high memory usage and contributing to node-level memory pressure that can trigger pod evictions. This issue occurs when a container consistently uses more memory than expected, which adversely impacts both the container and its hosting node. Key symptoms include:
- Memory_HighestUsage: The container registers the highest memory consumption compared to its peers.
- Node.ContainerMemoryUtilization.High: The node consistently exhibits elevated memory usage by containers.
- Node.MemoryPressurePodEvictions.High: Persistent memory pressure on the node results in frequent pod evictions.
- Node.MemoryPressure: There is a significant likelihood that the node experiences generalized memory pressure. These conditions can propagate further, leading to additional pod evictions as the node struggles to manage overall memory usage.
Controller
Image Pull Errors
Kubernetes controllers may encounter image pull errors when they cannot download container images from a registry, causing Pods to fail in starting or remain in a ImagePullBackOff
state. This disrupts the deployment of applications and can affect service availability.
FrequentPodEphemeralStorageEvictions
A Kubernetes workload is experiencing frequent pod evictions due to ephemeral storage exhaustion. This disrupts application availability and performance, as pods are terminated when they exceed their allocated storage limits or when node-level storage is under pressure. Ephemeral storage evictions occur when pods consume more storage than the configured limits or when the node’s ephemeral storage capacity is insufficient.
Malfunction
Multiple pods for a Kubernetes controller are in a "NotReady" state for an extended period, which can lead to service unavailability or degraded performance.
Node
Disk Pressure
Disk pressure on a Kubernetes node indicates that the node’s disk usage is high, potentially causing the eviction of pods, reduced performance, and the inability to schedule new pods. This affects application stability and the node’s overall functionality. Disk pressure can arise from insufficient disk space, often caused by log accumulation, container images, temporary files, or application data. Kubernetes monitors disk usage through the kubelet, and when usage exceeds certain thresholds, it triggers disk pressure. This may prompt Kubernetes to evict non-essential or low-priority pods to free up disk space.
Memory Pressure
Memory pressure on a Kubernetes node occurs when available memory falls below critical levels, potentially causing the eviction of pods and instability for applications running on the node. This reduces the node’s capacity to run workloads, potentially leading to service disruptions if insufficient resources are available across the cluster. Memory pressure typically results from pods consuming more memory than anticipated, memory leaks in applications, or running too many memory-intensive pods on a single node. Kubernetes detects low available memory via the kubelet and triggers evictions of lower-priority pods to free up resources, impacting workload stability.
VirtualMachine
Conntrack Table Congested
The conntrack table on a VM is congested, causing new network connections to fail. This typically results in connectivity issues for applications, degraded performance, or downtime for services dependent on network communication. The conntrack table is responsible for tracking active network connections and has a fixed size, which can be exhausted under high connection load. Conntrack table congestion occurs when the number of tracked connections exceeds the configured capacity. Linux systems typically manage the conntrack table via the nf_conntrack_max parameter, which sets the maximum number of simultaneous connections that can be tracked.
SNAT Ports Congested
The SNAT (Source Network Address Translation) ports on a virtual machine (VM) are congested, leading to outbound network connection failures or degraded performance for services relying on external APIs or resources. This issue primarily impacts VMs that need to establish multiple concurrent connections to the internet or external systems.
SNAT port congestion occurs when the available pool of ephemeral ports is exhausted. This is common in scenarios where a large number of outbound connections are initiated from a single VM using a NAT gateway or Azure Load Balancer. The problem is exacerbated by long-lived or idle connections that do not release ports promptly.
CPU Congested
A Virtual Machine (VM) experiencing CPU congestion can lead to sluggish application performance, delayed response times, or even timeout errors for users and processes. This typically indicates that the VM’s CPU is overutilized, potentially due to high resource demands from applications or insufficient CPU allocation. CPU congestion in a VM is often caused by excessive CPU demand from running processes, high-priority tasks, or inefficient applications consuming more resources than anticipated. It may also result from an incorrect allocation of CPU resources or contention with other VMs if the host is overcommitted, meaning the hypervisor has allocated more virtual CPU resources across VMs than the physical CPU can handle efficiently.
Memory Congested
Memory congestion in a Virtual Machine (VM) leads to slow system performance, application crashes, or even VM instability as the system struggles to allocate memory for running processes. This typically results in frequent swapping or out-of-memory (OOM) errors, impacting applications and user operations. Memory congestion occurs when the VM’s memory demand exceeds the allocated memory, causing the system to use swap space or kill processes to free up memory. This can result from applications with high memory requirements, memory leaks, or an under-provisioned memory configuration for the VM. Additionally, in environments with multiple VMs, overall host memory overcommitment can cause congestion, as VMs compete for shared memory resources.
Disk Total IOPs Congested
The total disk IOPS for a cloud VM are congested because the VM has reached its maximum allowable IOPS limit. This results in throttling, which can slow application performance and lead to delays or errors in read/write-heavy workloads. In cloud environments, the total IOPS capacity is often determined by the VM's size, regardless of the IOPS capacity of attached disks. If the cumulative IOPS demand from all attached disks exceeds the VM's maximum IOPS allocation, the VM will throttle I/O operations to stay within its limit. This issue is unrelated to individual disk performance and entirely tied to the VM’s instance type or configuration.
Disk Read IOPs Congested
The total disk read IOPS for a cloud VM are congested because the VM has reached its maximum allowable IOPS limit. This results in throttling, which can slow application performance and lead to delays or errors in read-heavy workloads. In cloud environments, the total IOPS capacity is often determined by the VM's size, regardless of the IOPS capacity of attached disks. If the cumulative read IOPS demand from all attached disks exceeds the VM's maximum IOPS allocation, the VM will throttle I/O operations to stay within its limit. This issue is unrelated to individual disk performance and entirely tied to the VM’s instance type or configuration.
Disk Write IOPs Congested
The total disk write IOPS for a cloud VM are congested because the VM has reached its maximum allowable IOPS limit. This results in throttling, which can slow application performance and lead to delays or errors in write-heavy workloads. In cloud environments, the total IOPS capacity is often determined by the VM's size, regardless of the IOPS capacity of attached disks. If the cumulative write IOPS demand from all attached disks exceeds the VM's maximum IOPS allocation, the VM will throttle I/O operations to stay within its limit. This issue is unrelated to individual disk performance and entirely tied to the VM’s instance type or configuration.
Disk Total Throughput Congested
The total disk throughput for a cloud VM is congested because the VM has reached its maximum allowable bandwidth. This can lead to slower data transfer rates for read/write-intensive applications, causing delays in processing and reduced system performance. In cloud environments, the total throughput (measured in MB/s) is limited by the VM's size, regardless of the capabilities of the attached disks. This bottleneck occurs when the combined throughput demand of all attached disks exceeds the VM’s allocated bandwidth.
Disk Read Throughput Congested
The total disk read throughput for a cloud VM is congested because the VM has reached its maximum allowable read bandwidth. This can lead to slower data transfer rates for read-intensive applications, causing delays in processing and reduced system performance. In cloud environments, the total read throughput (measured in MB/s) is limited by the VM's size, regardless of the capabilities of the attached disks. This bottleneck occurs when the combined read throughput demand of all attached disks exceeds the VM’s allocated bandwidth.
Disk Write Throughput Congested
The total disk write throughput for a cloud VM is congested because the VM has reached its maximum allowable write bandwidth. This can lead to slower data transfer rates for write-intensive applications, causing delays in processing and reduced system performance. In cloud environments, the total write throughput (measured in MB/s) is limited by the VM's size, regardless of the capabilities of the attached disks. This bottleneck occurs when the combined write throughput demand of all attached disks exceeds the VM’s allocated bandwidth.
Storage
Disk
Congested
The disk has reached full capacity, which prevents new data from being written and may cause applications to fail, especially those dependent on free disk space for logs, caching, or temporary files. This can also slow down or halt system operations if critical processes can no longer write to the disk. Disk capacity is exhausted due to an accumulation of data, logs, backups, temporary files, or possibly unmonitored growth in application-generated files. In systems where disk cleanup or archiving hasn’t been implemented, space gradually fills up until no storage is available.
Inode Usage Congested
The disk is experiencing inode exhaustion, meaning the file system has run out of inodes (metadata structures for file storage), which prevents new files from being created even if there is free disk space. This often causes errors in applications attempting to create files and can disrupt services reliant on file storage. Inode congestion usually happens when the disk has a large number of small files that consume inodes without occupying much actual disk space. Each file or directory requires an inode, and file systems have a fixed inode allocation. Once all inodes are used, no new files can be created until inode space is freed up.
IOPs Congested
The disk is experiencing Read/Write Operations Per Second (IOPS) congestion, meaning that the total IOPS capacity is fully utilized. This causes slow performance for applications that rely on disk access, leading to delayed data processing, system lags, or even timeouts. IOPS congestion generally occurs when the read and/or write requests to the disk exceed its IOPS capacity. Causes can include a high volume of concurrent read/write operations, resource-intensive applications, or inefficient disk utilization by underlying processes. In virtualized environments, shared storage across multiple VMs can also lead to IOPS saturation if multiple virtual machines have high I/O demands simultaneously.
Read IOPs Congested
The disk is experiencing Read Operations Per Second (IOPS) congestion, meaning that the total IOPS capacity is fully utilized. This causes slow performance for applications that rely on disk access, leading to delayed data processing, system lags, or even timeouts. IOPS congestion generally occurs when the read requests to the disk exceed its IOPS capacity. Causes can include a high volume of concurrent read operations, resource-intensive applications, or inefficient disk utilization by underlying processes. In virtualized environments, shared storage across multiple VMs can also lead to IOPS saturation if multiple virtual machines have high I/O demands simultaneously.
Write IOPs Congested
The disk is experiencing Write Operations Per Second (IOPS) congestion, meaning that the total IOPS capacity is fully utilized. This causes slow performance for applications that rely on disk access, leading to delayed data processing, system lags, or even timeouts. IOPS congestion generally occurs when the write requests to the disk exceed its IOPS capacity. Causes can include a high volume of concurrent write operations, resource-intensive applications, or inefficient disk utilization by underlying processes. In virtualized environments, shared storage across multiple VMs can also lead to IOPS saturation if multiple virtual machines have high I/O demands simultaneously.
Read Throughput Congested
The disk is experiencing congestion specifically in read throughput, which slows down data retrieval from the disk and can degrade the performance of applications reliant on high-speed data access. Read throughput congestion typically results from reaching the disk's maximum read bandwidth. This can happen when multiple processes or applications simultaneously request large amounts of data, or when a few processes perform heavy sequential reads. The disk’s read bandwidth limit, once exceeded, restricts further read speeds, causing delays and slower performance.
Write Throughput Congested
The disk is experiencing write throughput congestion, leading to slower data write speeds and affecting applications that require high-speed data recording. This issue can cause delays in data availability and reduced performance in write-intensive tasks. Write throughput congestion occurs when the disk’s maximum write bandwidth is exceeded. This can happen if large amounts of data are being written in quick succession or if multiple applications are performing intensive write operations simultaneously. Once the write bandwidth limit is reached, the disk begins to throttle, slowing down additional write operations.
Network
Network Endpoint
Invalid Server Certificate
The network endpoint is serving an invalid server certificate, resulting in a high rate of client request errors due to certificate validation failures. This issue propagates further, increasing the overall request error rate across the system. An invalid server certificate is detected when the certificate presented by the network endpoint fails to meet expected security standards.