Skip to main content

Release Management

Root causes that are triggered by a code change or new release and result in a measurable reliability regression.

Root Causes​


Code Change Regression: CPU Congestion​

After a version upgrade, application containers experience high CPU usage, leading to performance degradation or unresponsiveness. This issue impacts the system's ability to handle requests effectively, potentially causing downtime or delays for end users.
High CPU usage post-upgrade typically stems from changes in the application code, dependencies, or configurations. Common causes include:

  • Inefficient code introduced in the new version (for example, infinite loops, unoptimized algorithms).
  • Increased resource demands from new features or changes in workload patterns.
  • Memory leaks causing excessive garbage collection or other inefficiencies in runtime environments like Java or Node.js.
  • Suboptimal container resource limits that throttle performance.

Code Change Regression: Database Connection Pool Saturated​

After a version upgrade, the client-side database connection pool is exhausted when all available connections are in use, preventing new database queries from being executed. This can cause application requests to hang or fail, impacting user experience and potentially leading to downtime for database-dependent features.
The exhaustion occurs because the application exceeds the configured maximum number of connections in the pool. Common contributing factors include:

  • Long-running queries that hold connections for extended periods.
  • Connection leaks where connections are not properly closed or returned to the pool after use.
  • High traffic or spikes in concurrent requests exceeding the pool capacity.
  • Improper pool size configuration for the workload or database limits.

Code Change Regression: Frequent Crash Failure​

One or multiple containers of a workload are frequently crashing with a non-zero exit code after a version upgrade. This disrupts the application's functionality, leading to downtime or degraded performance depending on the workload design. The issue likely stems from changes introduced in the new version, exacerbating existing problems or introducing new incompatibilities.
The non-zero exit code indicates abnormal termination, and the version upgrade suggests additional factors such as:

  • New bugs introduced in the updated code, including unhandled exceptions, invalid logic, or runtime errors.
  • Incompatible configurations that no longer match the updated application’s requirements (for example, new required environment variables).
  • Changes in dependencies, such as a library update causing compatibility issues or stricter API validation.
  • External dependencies (for example, databases or third-party APIs) whose behavior no longer aligns with the updated application.
  • Resource constraints aggravated by the updated version's increased resource demands or changes in workload patterns.
  • Health check behavior changes leading to premature or unnecessary container restarts.

Code Change Regression: Frequent Memory Failure​

The application is running out of memory after a version upgrade, leading to crashes, degraded performance, or instability. This impacts availability and user experience, often requiring container restarts or manual intervention to restore functionality. The issue is likely tied to changes in the updated version that increase memory usage or introduce inefficiencies.
Post-upgrade, out-of-memory (OOM) errors are typically caused by:

  • New memory leaks in the updated code, where objects are not properly released.
  • Increased memory consumption from new features, changes in algorithms, or handling of larger data sets.
  • Updated dependencies introducing higher memory overhead.
  • Improper memory configuration, such as memory limits that are too restrictive for the updated workload.
  • Workload changes, like higher traffic or larger input data, which were not anticipated during the upgrade.

Code Change Regression: Inefficient Garbage Collection​

After a version upgrade, the garbage collector is frequently running, leading to performance degradation or crashes. This issue is likely caused by changes in the application code or dependencies that increase memory usage or introduce inefficiencies.


Code Change Regression: Java Heap Saturated​

After a version upgrade, the Java heap is frequently congested, leading to performance degradation or crashes. This issue is likely caused by changes in the application code or dependencies that increase memory usage or introduce inefficiencies.


Code Change Regression: Lock Contention​

After a version upgrade, the application is experiencing frequent locking contention, leading to performance degradation or crashes. This issue is likely caused by changes in the application code or dependencies that increase locking or introduce inefficiencies.


Code Change Regression: Memory Failure​

Memory failures after a code change can cause containers to crash or degrade performance, resulting in errors for end users or failed service requests. These issues occur when newly introduced code leads to unexpected increases in memory usage, triggering out-of-memory (OOM) errors and destabilizing the system.
The root cause is often linked to recent code modifications that introduce memory leaks, inefficient algorithms, or increased resource demands. These changes may cause the application running inside the container to consume more memory than its allocated limit. When the container exceeds this limit, the system triggers an OOM event, terminating the process and causing service disruptions. This is particularly likely if memory usage grows gradually, such as with leaks, or spikes during certain operations introduced by the new code.


Code Change Regression: Redis Connection Pool Saturated​

After a version upgrade, the Redis connection pool is frequently congested, leading to performance degradation or crashes. This issue is likely caused by changes in the application code or dependencies that increase Redis usage or introduce inefficiencies.


Code Change Regression: Slow Database Queries​

After a version upgrade, the application is experiencing slow database queries that lead to downstream slow consumer behavior and potential resource starvation. This condition affects instance performance, particularly when query execution times become excessively long.
Slow database queries indicate that interactions with the database are taking longer than expected, which can degrade overall system responsiveness. This problem propagates into a conditional state affecting individual instances, where prolonged query durations are likely to trigger further performance degradation. When query execution times exceed acceptable thresholds, the resulting slowdown becomes nearly certain to lead to resource starvation. Even under less severe conditions, the delay in database interactions can contribute to slow consumer behavior, compounding the overall impact on system performance.