CanaryRolloutFailed
Meaning
Canary rollout has failed.
Full context
A change delivered using a progressive rollout has failed to be enforced on some of the targets. The canary rollout completed with failure either because the delivered change was incorrect or a configured analysis has failed. For more information on progressive rollouts, see Rollout Configuration Changes.Symptom
A failed canary rollout will have the Ready status condition set to False and the Complete status condition set to True.
List all failed canary rollouts.
JSONPATH='{range .items[*]}{@.metadata.name}:{range @.status.conditions[*]}{@.type}={@.status}:{end}{"\n"}{end}'
kubectl get canaryrollout -o jsonpath="$JSONPATH" | grep "Ready=False" | grep "Complete=True" | cut -d':' -f1Inspect the canary Ready condition message for more details.
kubectl get canaryrollout <name> -o jsonpath='{.status.conditions[?(@.type=="Ready")]}' | jqImpact
The desired change delivered by the canary was not enforced on some targets.
Inspect the .spec.patch field to determine if this is a configuration change or NuoDB product version upgrade.
Changes in the service tiers are typically delivered via canary rollouts by updating the tier revision number (e.g. {"spec":{"type":{"tierRef":{"revision":"2"}}}}).
By default, all failed targets will be rollbacked automatically to limit the impact of a potentially incorrect change. The canary rollout will stop immediately after failure for at least one target, which means that the configuration change won’t be attempted on the remaining targets.
Diagnosis
- Check the canary rollout state using
kubectl describe canaryrollout <name>. - Check the canary rollout
Readycondition’s state and message. - Check the canary rollout events as described in Monitoring rollout progress
- Check last promoted targets for failed analysis using
kubectl get canaryrollout <name> -o jsonpath='{.status.lastPromotedTargets}' | jq - Diagnose issues with failed databases as described in Diagnosing database component. Review historical information recorded in events, logs and metrics because the target database might have been rollbacked already.
- Attempt to perform the change manually on a previously failed database or its cloned copy.
- Retry the canary rollout as described in Restart rollout
Scenarios
Scenario 1: Target analysis failure
The canary rollout executes analysis configured in the rollout template against all promoted targets. If some of the analysis doesn’t complete successfully after the predefined timeout, it will be marked as failed. Currently, there is no way to determine if the target is not ready because of the change promoted by the canary rollout or due to some unrelated reason.
Possible causes for target analysis failure:
- The change delivered by the canary rollout is invalid or incompatible with some targets
- The target was already in a failed state. The analysis is not run before target promotion and is not a gating condition for delivering a patch to a target. A target can be unready before and after the patch, which will fail the entire canary rollout. It is important to monitor and disable domains or databases that are in a failed state for a long time. Alternatively, such targets must be excluded from canary rollouts by using the canary label selector.
- The analysis timeout is too short for some targets. Rolling upgrade might take more time for some databases due to the number of engines or the time to perform journal recovery or SYNCING.
Example
Get the canary rollout name and its namespace from the alert’s labels. Inspect the canary rollout state in the Kubernetes cluster.
kubectl get canaryrollout n0.nano -n nuodb-cp-systemNotice that the READY value is False and COMPLETED is True, which means that the canary rollout has failed.
NAME PAUSED READY COMPLETED ROLLBACKED AGE
n0.nano False False True False 6m50sInspect the canary rollout failure message.
kubectl get canaryrollout n0.nano -o jsonpath='{.status.conditions[?(@.type=="Ready")]}' | jq{
"lastTransitionTime": "2026-01-29T09:23:21Z",
"message": "failed analysis: name=\"ready\", targets=[Database/acme-messaging-drive Database/acme-messaging-store]",
"observedGeneration": 1,
"reason": "CanaryAnalysisRunFailed",
"status": "False",
"type": "Ready"
}Check the canary rollout events.
kubectl describe canaryrollout n0.nanoThe output below shows that a configuration change has been promoted successfully to two databases in step 1, followed by another two databases in step 2. Both databases in step 2 did not become ready within the configured analysis timeout (in this case 3min). The “ready” analysis failed for these two targets, and rollback was performed on them to prevent any database downtime.
Type Reason Age From Message
---- ------ ---- ---- -------
Normal CanaryPromoteStep 5m22s nuodb-cp-operator Promote step (1/4) progressing target Database default/acme-messaging-demo
Normal CanaryPromoteStep 5m22s nuodb-cp-operator Promote step (1/4) progressing target Database default/acme-messaging-disk
Normal CanaryAnalysisRunSucceeded 4m22s nuodb-cp-operator Analysis step (1/4) analysis "ready" succeed for target Database default/acme-messaging-demo
Normal CanaryAnalysisRunSucceeded 4m22s nuodb-cp-operator Analysis step (1/4) analysis "ready" succeed for target Database default/acme-messaging-disk
Normal CanaryAnalysisRunSucceeded 4m22s nuodb-cp-operator Analysis step (1/4) analysis "synced" succeed for target Database default/acme-messaging-demo
Normal CanaryAnalysisRunSucceeded 4m22s nuodb-cp-operator Analysis step (1/4) analysis "synced" succeed for target Database default/acme-messaging-disk
Normal Progressing 4m22s nuodb-cp-operator Step (1/4) completed
Normal CanaryPromoteStep 4m22s nuodb-cp-operator Promote step (2/4) progressing target Database default/acme-messaging-drive
Normal CanaryPromoteStep 4m22s nuodb-cp-operator Promote step (2/4) progressing target Database default/acme-messaging-store
Normal CanaryAnalysisRunSucceeded 3m22s nuodb-cp-operator Analysis step (2/4) analysis "synced" succeed for target Database default/acme-messaging-drive
Normal CanaryAnalysisRunSucceeded 3m22s nuodb-cp-operator Analysis step (2/4) analysis "synced" succeed for target Database default/acme-messaging-store
Warning CanaryAnalysisRunFailed 81s nuodb-cp-operator Analysis step (2/4): analysis "ready" failed for target Database default/acme-messaging-drive: unexpected status for condition Ready expected=True, actual=False: unhealthy components: [transactionEngines]; unhealthy resources: [deployment/te-acme-messaging-drive-fbd7bd9] (timeout after 3m0s)
Warning CanaryAnalysisRunFailed 81s nuodb-cp-operator Analysis step (2/4): analysis "ready" failed for target Database default/acme-messaging-store: unexpected status for condition Ready expected=True, actual=False: unhealthy components: [transactionEngines]; unhealthy resources: [deployment/te-acme-messaging-store-9dz2w4z] (timeout after 3m0s)
Warning CanaryAnalysisRunFailed 81s nuodb-cp-operator Step (2/4) failed: failed analysis: name="ready", targets=[Database/acme-messaging-drive Database/acme-messaging-store]
Normal RollbackSucceededReason 81s nuodb-cp-operator Rollback target Database default/acme-messaging-drive
Normal RollbackSucceededReason 81s nuodb-cp-operator Rollback target Database default/acme-messaging-storeLet’s drill down and list the events for one of the failed deployments and its pods.
kubectl get events | grep te-acme-messaging-drive-fbd7bd9 | grep Warning6m13s Warning CanaryAnalysisRunFailed canaryrollout/n0.nano Analysis step (2/4): analysis "ready" failed for target Database default/acme-messaging-drive: unexpected status for condition Ready expected=True, actual=False: unhealthy components: [transactionEngines]; unhealthy resources: [deployment/te-acme-messaging-drive-fbd7bd9] (timeout after 3m0s)
...
9m8s Warning FailedScheduling pod/te-acme-messaging-drive-fbd7bd9-6f69c9ffcf-s5wr2 0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.The error above indicates not enough CPU available in the cluster to schedule the TE pod. This could mean that the performed change has bumped the CPU requests which was fine for the first two databases but eventually the cluster capacity was exhausted.
Inspect service tier change
To inspect the canary patch, execute:
kubectl get canary n0.nano -o jsonpath='{.spec.patch}'The output shows a change in the service tier revision, which means that a change was performed in the service tier or a Helm feature referenced by the service tier. For more information about shared database configuration, see Service tiers.
{"spec":{"type":{"tierRef":{"revision":"100"}}}}To compare service tier and Helm feature revisions, lets create a small utility function.
diff_revisions() {
kind_name=$1
new_rev=$2
prev_rev=$3
context=${4:-"3"}
diff -U ${context} \
<(kubectl get "$kind_name" \
-o jsonpath="{range @.status.history.revisions[?(@.generation==${prev_rev})]}{@.spec}{end}" | base64 -d | jq) \
<(kubectl get "$kind_name" \
-o jsonpath="{range @.status.history.revisions[?(@.generation==${new_rev})]}{@.spec}{end}" | base64 -d | jq)
}Compare the service tier revision 100 with the previous revision (in this case 99) using the following command:
diff_revisions "tier/n0.nano" 100 99--- /dev/fd/11 2026-01-29 12:15:45
+++ /dev/fd/12 2026-01-29 12:15:45
@@ -62,7 +62,7 @@
},
{
"name": "nano-resources",
- "revision": "6"
+ "revision": "7"
},
{
"name": "nano-disk",
In this case, the revision of nano-resources Helm feature is the only change, so it must have been updated.
Let’s compare its revisions.
diff_revisions "feature/nano-resources" 7 6 10--- /dev/fd/11 2026-01-29 12:16:11
+++ /dev/fd/12 2026-01-29 12:16:11
@@ -27,18 +27,18 @@
}
},
"te": {
"memoryOption": "500Mi",
"resources": {
"limits": {
"cpu": 8,
"memory": "500Mi"
},
"requests": {
- "cpu": "2",
+ "cpu": 4,
"memory": "500Mi"
}
}
}
}
}
}
The CPU requests for the TE engine have been increased, which affects all databases using the n0.nano service tier.
This aligns with the previous analysis for the database TE pod not being scheduled due to resource constraints.