From 5b37df0deb4b32e9bc935fc3227216caaf41eea7 Mon Sep 17 00:00:00 2001 From: Christopher Haar Date: Wed, 28 Jan 2026 12:47:41 +0100 Subject: [PATCH 1/2] docs(cb): add circuit breaker documentation Signed-off-by: Christopher Haar --- content/master/guides/metrics.md | 107 +++++++++++++++++ .../master/guides/troubleshoot-crossplane.md | 50 ++++++++ content/v2.1/guides/metrics.md | 109 +++++++++++++++++- .../v2.1/guides/troubleshoot-crossplane.md | 52 ++++++++- 4 files changed, 316 insertions(+), 2 deletions(-) diff --git a/content/master/guides/metrics.md b/content/master/guides/metrics.md index 7fa40c1b6..53a10a740 100644 --- a/content/master/guides/metrics.md +++ b/content/master/guides/metrics.md @@ -51,6 +51,113 @@ The Crossplane pod emits these metrics. | {{}}engine_watches_stopped_total{{}} | Total number of watches stopped | {{}} +### Circuit breaker metrics + +The circuit breaker prevents reconciliation thrashing by monitoring and rate-limiting watch events per Composite Resource (XR). These metrics help you identify and respond to excessive reconciliation activity. + +All circuit breaker metrics include a `controller` label formatted as `composite/.` (for example, `composite/xpostgresqlinstances.example.com`), providing visibility per XRD without creating high cardinality from individual XR instances. + +#### circuit_breaker_opens_total + +Tracks when a circuit breaker transitions from closed to open state. An increase indicates an XR is receiving excessive watch events and has triggered throttling. + +**Use this metric to:** +- Alert on XRs experiencing reconciliation thrashing +- Identify which XRD types are prone to excessive watch events +- Track the frequency of circuit breaker activations + +**Example PromQL queries:** +```promql +# Rate of circuit breaker opens over 5 minutes +rate(circuit_breaker_opens_total[5m]) + +# Count of circuit breaker opens by controller +sum by (controller) (circuit_breaker_opens_total) +``` + +#### circuit_breaker_closes_total + +Tracks when a circuit breaker transitions from open to closed state. This indicates an XR has recovered from excessive watch events and returned to normal operation. + +**Use this metric to:** +- Monitor recovery from reconciliation thrashing +- Verify circuit breakers are closing after cooldown periods +- Track overall circuit breaker lifecycle + +#### circuit_breaker_events_total + +Tracks all watch events processed by the circuit breaker, labeled by `result`: +- `Allowed`: Normal operation when circuit is closed - events proceed to reconciliation +- `Dropped`: Events blocked when circuit is fully open - indicates active throttling +- `Halfopen_allowed`: Limited probe events when circuit is half-open - circuit is testing for recovery + +**Use this metric to:** +- Monitor the volume of watch events per XR type +- Detect when events are being dropped (active throttling) +- Alert on high dropped event rates indicating potential issues +- Understand reconciliation pressure on specific controllers + +**Example PromQL queries:** +```promql +# Rate of dropped events (active throttling), aggregated per controller +sum by (controller) ( + rate(circuit_breaker_events_total{result="Dropped"}[5m]) +) + +# Percentage of events being dropped +sum by (controller) (rate(circuit_breaker_events_total{result="Dropped"}[5m])) +/ +sum by (controller) (rate(circuit_breaker_events_total[5m])) * 100 + +# Number of replicas per controller currently dropping events +count by (controller) ( + rate(circuit_breaker_events_total{result="Dropped"}[5m]) > 0 +) + +# Estimated number of circuit breaker opens over 5 minutes +sum by (controller) ( + increase(circuit_breaker_opens_total[5m]) +) + +# Alert condition: controllers under high watch pressure (severe overload) +sum by (controller) ( + rate(circuit_breaker_events_total{result="Dropped"}[5m]) +) > 1 +``` + +**Recommended alerts:** +```yaml +# Alert when circuit breaker is consistently dropping events +- alert: CircuitBreakerDropRatioHigh + expr: | + ( + sum by (controller)(rate(circuit_breaker_events_total{result="Dropped"}[5m])) + / + sum by (controller)(rate(circuit_breaker_events_total[5m])) + ) > 0.2 + for: 5m + labels: + severity: critical + annotations: + summary: "High circuit breaker drop ratio for {{ $labels.controller }}" + description: "More than 20% of events are being dropped by the circuit breaker for {{ $labels.controller }}, indicating sustained overload." + +# Alert when circuit breaker opens frequently +- alert: CircuitBreakerFrequentOpens + expr: | + sum by (controller) ( + rate(circuit_breaker_opens_total[5m]) + ) * 3600 > 6 + for: 15m + labels: + severity: warning + annotations: + summary: "Frequent circuit breaker opens for {{ $labels.controller }}" + description: "Circuit breaker for {{ $labels.controller }} is opening more than 6 times per hour, indicating reconciliation thrashing." +``` + +For more information on the circuit breaker feature and configuration, see [Troubleshooting - Circuit breaker]({{< ref "troubleshoot-crossplane#circuit-breaker-for-reconciliation-thrashing" >}}). + ## Provider metrics Crossplane providers emit these metrics. All providers built with crossplane-runtime emit the `crossplane_managed_resource_*` metrics. diff --git a/content/master/guides/troubleshoot-crossplane.md b/content/master/guides/troubleshoot-crossplane.md index 99f1d37f8..9f73131cb 100644 --- a/content/master/guides/troubleshoot-crossplane.md +++ b/content/master/guides/troubleshoot-crossplane.md @@ -195,6 +195,56 @@ For example, for a `CloudSQLInstance` managed resource (`database.gcp.crossplane kubectl patch cloudsqlinstance my-db -p '{"metadata":{"finalizers": []}}' --type=merge ``` +## Circuit breaker for reconciliation thrashing + +Crossplane includes a circuit breaker mechanism to prevent reconciliation thrashing. Thrashing occurs when controllers fight over composed resource state or enter tight reconciliation loops that could impact cluster performance. + +### How the circuit breaker works + +Each Composite Resource (XR) has its own token bucket-based circuit breaker that monitors reconciliation rates: + +- **Burst (capacity)**: Maximum number of events allowed in quick succession (default: 100) +- **Refill rate**: Sustained event rate after burst is exhausted (default: 1 event per second) +- **Cooldown**: Duration the circuit stays open before attempting recovery (default: 5 minutes) + +When an XR receives too many watch events (exceeding the burst and refill rate), the circuit breaker opens and blocks most reconciliation requests. While the circuit is open, Crossplane allows one request every 30 seconds to probe for recovery. + +### Detecting circuit breaker activation + +When the circuit breaker opens, the affected XR shows a `Responsive` condition: + +```yaml +conditions: +- type: Responsive + status: "False" + reason: WatchCircuitOpen + message: "Too many watch events from ConfigMap/my-config (default). Allowing events periodically." +``` + +The message identifies which resource is causing excessive watch events, helping you pinpoint the source of thrashing. + +### Configuring circuit breaker parameters + +The default circuit breaker settings work well for most environments. However, you may need to adjust them based on your composition patterns and cluster size. For example, increase the burst and refill rate for large-scale deployments with many rapidly updating XRs, or decrease them if you want stricter protection against thrashing. + +Configure circuit breaker parameters using Crossplane startup arguments via Helm: + +```shell +helm install crossplane --namespace crossplane-system --create-namespace crossplane-stable/crossplane \ + --set args='{"--debug","--circuit-breaker-burst=500.0","--circuit-breaker-refill-rate=5.0","--circuit-breaker-cooldown=1m"}' +``` + +Available parameters: +- `--circuit-breaker-burst`: Maximum burst of events (default: 100.0) +- `--circuit-breaker-refill-rate`: Events per second for sustained rate (default: 1.0) +- `--circuit-breaker-cooldown`: Duration to keep circuit open (default: 5m0s) + +### Monitoring with metrics + +Monitor circuit breaker activity using these Prometheus metrics. + +See the [Metrics guide]({{< ref "metrics#circuit-breaker-metrics" >}}) for detailed metric information. + ## Tips, tricks, and troubleshooting This section covers some common tips, tricks, and troubleshooting steps diff --git a/content/v2.1/guides/metrics.md b/content/v2.1/guides/metrics.md index 5282ee685..c7ffd949c 100644 --- a/content/v2.1/guides/metrics.md +++ b/content/v2.1/guides/metrics.md @@ -21,7 +21,7 @@ These Prometheus annotations expose the metrics: prometheus.io/path: /metrics prometheus.io/port: "8080" prometheus.io/scrape: "true" -``` +``` ## Crossplane core metrics @@ -51,6 +51,113 @@ The Crossplane pod emits these metrics. | {{}}engine_watches_stopped_total{{}} | Total number of watches stopped | {{}} +### Circuit breaker metrics + +The circuit breaker prevents reconciliation thrashing by monitoring and rate-limiting watch events per Composite Resource (XR). These metrics help you identify and respond to excessive reconciliation activity. + +All circuit breaker metrics include a `controller` label formatted as `composite/.` (for example, `composite/xpostgresqlinstances.example.com`), providing visibility per XRD without creating high cardinality from individual XR instances. + +#### circuit_breaker_opens_total + +Tracks when a circuit breaker transitions from closed to open state. An increase indicates an XR is receiving excessive watch events and has triggered throttling. + +**Use this metric to:** +- Alert on XRs experiencing reconciliation thrashing +- Identify which XRD types are prone to excessive watch events +- Track the frequency of circuit breaker activations + +**Example PromQL queries:** +```promql +# Rate of circuit breaker opens over 5 minutes +rate(circuit_breaker_opens_total[5m]) + +# Count of circuit breaker opens by controller +sum by (controller) (circuit_breaker_opens_total) +``` + +#### circuit_breaker_closes_total + +Tracks when a circuit breaker transitions from open to closed state. This indicates an XR has recovered from excessive watch events and returned to normal operation. + +**Use this metric to:** +- Monitor recovery from reconciliation thrashing +- Verify circuit breakers are closing after cooldown periods +- Track overall circuit breaker lifecycle + +#### circuit_breaker_events_total + +Tracks all watch events processed by the circuit breaker, labeled by `result`: +- `Allowed`: Normal operation when circuit is closed - events proceed to reconciliation +- `Dropped`: Events blocked when circuit is fully open - indicates active throttling +- `Halfopen_allowed`: Limited probe events when circuit is half-open - circuit is testing for recovery + +**Use this metric to:** +- Monitor the volume of watch events per XR type +- Detect when events are being dropped (active throttling) +- Alert on high dropped event rates indicating potential issues +- Understand reconciliation pressure on specific controllers + +**Example PromQL queries:** +```promql +# Rate of dropped events (active throttling), aggregated per controller +sum by (controller) ( + rate(circuit_breaker_events_total{result="Dropped"}[5m]) +) + +# Percentage of events being dropped +sum by (controller) (rate(circuit_breaker_events_total{result="Dropped"}[5m])) +/ +sum by (controller) (rate(circuit_breaker_events_total[5m])) * 100 + +# Number of replicas per controller currently dropping events +count by (controller) ( + rate(circuit_breaker_events_total{result="Dropped"}[5m]) > 0 +) + +# Estimated number of circuit breaker opens over 5 minutes +sum by (controller) ( + increase(circuit_breaker_opens_total[5m]) +) + +# Alert condition: controllers under high watch pressure (severe overload) +sum by (controller) ( + rate(circuit_breaker_events_total{result="Dropped"}[5m]) +) > 1 +``` + +**Recommended alerts:** +```yaml +# Alert when circuit breaker is consistently dropping events +- alert: CircuitBreakerDropRatioHigh + expr: | + ( + sum by (controller)(rate(circuit_breaker_events_total{result="Dropped"}[5m])) + / + sum by (controller)(rate(circuit_breaker_events_total[5m])) + ) > 0.2 + for: 5m + labels: + severity: critical + annotations: + summary: "High circuit breaker drop ratio for {{ $labels.controller }}" + description: "More than 20% of events are being dropped by the circuit breaker for {{ $labels.controller }}, indicating sustained overload." + +# Alert when circuit breaker opens frequently +- alert: CircuitBreakerFrequentOpens + expr: | + sum by (controller) ( + rate(circuit_breaker_opens_total[5m]) + ) * 3600 > 6 + for: 15m + labels: + severity: warning + annotations: + summary: "Frequent circuit breaker opens for {{ $labels.controller }}" + description: "Circuit breaker for {{ $labels.controller }} is opening more than 6 times per hour, indicating reconciliation thrashing." +``` + +For more information on the circuit breaker feature and configuration, see [Troubleshooting - Circuit breaker]({{< ref "troubleshoot-crossplane#circuit-breaker-for-reconciliation-thrashing" >}}). + ## Provider metrics Crossplane providers emit these metrics. All providers built with crossplane-runtime emit the `crossplane_managed_resource_*` metrics. diff --git a/content/v2.1/guides/troubleshoot-crossplane.md b/content/v2.1/guides/troubleshoot-crossplane.md index 99f1d37f8..17f104b8e 100644 --- a/content/v2.1/guides/troubleshoot-crossplane.md +++ b/content/v2.1/guides/troubleshoot-crossplane.md @@ -100,7 +100,7 @@ spec: spec: containers: - name: package-runtime - args: + args: - --debug --- apiVersion: pkg.crossplane.io/v1 @@ -195,6 +195,56 @@ For example, for a `CloudSQLInstance` managed resource (`database.gcp.crossplane kubectl patch cloudsqlinstance my-db -p '{"metadata":{"finalizers": []}}' --type=merge ``` +## Circuit breaker for reconciliation thrashing + +Crossplane includes a circuit breaker mechanism to prevent reconciliation thrashing. Thrashing occurs when controllers fight over composed resource state or enter tight reconciliation loops that could impact cluster performance. + +### How the circuit breaker works + +Each Composite Resource (XR) has its own token bucket-based circuit breaker that monitors reconciliation rates: + +- **Burst (capacity)**: Maximum number of events allowed in quick succession (default: 100) +- **Refill rate**: Sustained event rate after burst is exhausted (default: 1 event per second) +- **Cooldown**: Duration the circuit stays open before attempting recovery (default: 5 minutes) + +When an XR receives too many watch events (exceeding the burst and refill rate), the circuit breaker opens and blocks most reconciliation requests. While the circuit is open, Crossplane allows one request every 30 seconds to probe for recovery. + +### Detecting circuit breaker activation + +When the circuit breaker opens, the affected XR shows a `Responsive` condition: + +```yaml +conditions: +- type: Responsive + status: "False" + reason: WatchCircuitOpen + message: "Too many watch events from ConfigMap/my-config (default). Allowing events periodically." +``` + +The message identifies which resource is causing excessive watch events, helping you pinpoint the source of thrashing. + +### Configuring circuit breaker parameters + +The default circuit breaker settings work well for most environments. However, you may need to adjust them based on your composition patterns and cluster size. For example, increase the burst and refill rate for large-scale deployments with many rapidly updating XRs, or decrease them if you want stricter protection against thrashing. + +Configure circuit breaker parameters using Crossplane startup arguments via Helm: + +```shell +helm install crossplane --namespace crossplane-system --create-namespace crossplane-stable/crossplane \ + --set args='{"--debug","--circuit-breaker-burst=500.0","--circuit-breaker-refill-rate=5.0","--circuit-breaker-cooldown=1m"}' +``` + +Available parameters: +- `--circuit-breaker-burst`: Maximum burst of events (default: 100.0) +- `--circuit-breaker-refill-rate`: Events per second for sustained rate (default: 1.0) +- `--circuit-breaker-cooldown`: Duration to keep circuit open (default: 5m0s) + +### Monitoring with metrics + +Monitor circuit breaker activity using these Prometheus metrics. + +See the [Metrics guide]({{< ref "metrics#circuit-breaker-metrics" >}}) for detailed metric information. + ## Tips, tricks, and troubleshooting This section covers some common tips, tricks, and troubleshooting steps From 1e48c8d8bbd95db0e565214824dba5016371f62a Mon Sep 17 00:00:00 2001 From: Christopher Haar Date: Thu, 29 Jan 2026 10:03:22 +0100 Subject: [PATCH 2/2] lint(vale): set correct spellings Signed-off-by: Christopher Haar --- content/master/guides/metrics.md | 59 +++++++++++++---- .../master/guides/troubleshoot-crossplane.md | 63 +++++++++++++++++-- content/v2.1/guides/metrics.md | 55 ++++++++++++---- .../v2.1/guides/troubleshoot-crossplane.md | 61 ++++++++++++++++-- 4 files changed, 203 insertions(+), 35 deletions(-) diff --git a/content/master/guides/metrics.md b/content/master/guides/metrics.md index 53a10a740..39944af4e 100644 --- a/content/master/guides/metrics.md +++ b/content/master/guides/metrics.md @@ -21,7 +21,7 @@ These Prometheus annotations expose the metrics: prometheus.io/path: /metrics prometheus.io/port: "8080" prometheus.io/scrape: "true" -``` +``` ## Crossplane core metrics @@ -42,22 +42,35 @@ The Crossplane pod emits these metrics. | {{}}function_run_function_response_cache_bytes_deleted_total{{}} | Total number of RunFunctionResponse bytes deleted from cache | | {{}}function_run_function_response_cache_read_seconds{{}} | Histogram of cache read latency (seconds) | | {{}}function_run_function_response_cache_write_seconds{{}} | Histogram of cache write latency (seconds) | -| {{}}circuit_breaker_opens_total{{}} | Number of times the XR circuit breaker transitioned from closed to open | -| {{}}circuit_breaker_closes_total{{}} | Number of times the XR circuit breaker transitioned from open to closed | -| {{}}circuit_breaker_events_total{{}} | Number of XR watch events handled by the circuit breaker, labeled by outcome | -| {{}}engine_controllers_started_total{{}} | Total number of controllers started | -| {{}}engine_controllers_stopped_total{{}} | Total number of controllers stopped | -| {{}}engine_watches_started_total{{}} | Total number of watches started | -| {{}}engine_watches_stopped_total{{}} | Total number of watches stopped | +| {{}}engine_controllers_started_total{{}} | Total number of controllers started | +| {{}}engine_controllers_stopped_total{{}} | Total number of controllers stopped | +| {{}}engine_watches_started_total{{}} | Total number of watches started | +| {{}}engine_watches_stopped_total{{}} | Total number of watches stopped | {{}} ### Circuit breaker metrics -The circuit breaker prevents reconciliation thrashing by monitoring and rate-limiting watch events per Composite Resource (XR). These metrics help you identify and respond to excessive reconciliation activity. + + +The circuit breaker prevents reconciliation thrashing by monitoring and rate-limiting watch events per Composite Resource (XR). Crossplane core emits these metrics to help you identify and respond to excessive reconciliation activity. + + + +{{< table "table table-hover table-striped table-sm">}} +| Metric Name | Description | +| --- | --- | +| {{}}circuit_breaker_opens_total{{}} | Number of times the XR circuit breaker transitioned from closed to open | +| {{}}circuit_breaker_closes_total{{}} | Number of times the XR circuit breaker transitioned from open to closed | +| {{}}circuit_breaker_events_total{{}} | Number of XR watch events handled by the circuit breaker, labeled by outcome | +{{}} All circuit breaker metrics include a `controller` label formatted as `composite/.` (for example, `composite/xpostgresqlinstances.example.com`), providing visibility per XRD without creating high cardinality from individual XR instances. + + #### circuit_breaker_opens_total + + Tracks when a circuit breaker transitions from closed to open state. An increase indicates an XR is receiving excessive watch events and has triggered throttling. @@ -66,7 +79,9 @@ Tracks when a circuit breaker transitions from closed to open state. An increase - Identify which XRD types are prone to excessive watch events - Track the frequency of circuit breaker activations + **Example PromQL queries:** + ```promql # Rate of circuit breaker opens over 5 minutes rate(circuit_breaker_opens_total[5m]) @@ -75,29 +90,47 @@ rate(circuit_breaker_opens_total[5m]) sum by (controller) (circuit_breaker_opens_total) ``` + + #### circuit_breaker_closes_total + + Tracks when a circuit breaker transitions from open to closed state. This indicates an XR has recovered from excessive watch events and returned to normal operation. **Use this metric to:** + - Monitor recovery from reconciliation thrashing + + - Verify circuit breakers are closing after cooldown periods -- Track overall circuit breaker lifecycle + +- Track circuit breaker lifecycle + + #### circuit_breaker_events_total + + Tracks all watch events processed by the circuit breaker, labeled by `result`: + - `Allowed`: Normal operation when circuit is closed - events proceed to reconciliation + - `Dropped`: Events blocked when circuit is fully open - indicates active throttling -- `Halfopen_allowed`: Limited probe events when circuit is half-open - circuit is testing for recovery + +- `HalfOpenAllowed`: Limited probe events when circuit is half-open - circuit is testing for recovery + **Use this metric to:** -- Monitor the volume of watch events per XR type -- Detect when events are being dropped (active throttling) +- Track the volume of watch events per XR type +- Detect when the circuit drops events (active throttling) - Alert on high dropped event rates indicating potential issues - Understand reconciliation pressure on specific controllers + **Example PromQL queries:** + ```promql # Rate of dropped events (active throttling), aggregated per controller sum by (controller) ( diff --git a/content/master/guides/troubleshoot-crossplane.md b/content/master/guides/troubleshoot-crossplane.md index 9f73131cb..38a5fcd81 100644 --- a/content/master/guides/troubleshoot-crossplane.md +++ b/content/master/guides/troubleshoot-crossplane.md @@ -100,7 +100,7 @@ spec: spec: containers: - name: package-runtime - args: + args: - --debug --- apiVersion: pkg.crossplane.io/v1 @@ -197,21 +197,31 @@ kubectl patch cloudsqlinstance my-db -p '{"metadata":{"finalizers": []}}' --type ## Circuit breaker for reconciliation thrashing + Crossplane includes a circuit breaker mechanism to prevent reconciliation thrashing. Thrashing occurs when controllers fight over composed resource state or enter tight reconciliation loops that could impact cluster performance. + ### How the circuit breaker works + Each Composite Resource (XR) has its own token bucket-based circuit breaker that monitors reconciliation rates: + - **Burst (capacity)**: Maximum number of events allowed in quick succession (default: 100) -- **Refill rate**: Sustained event rate after burst is exhausted (default: 1 event per second) + +- **Refill rate**: Sustained event rate after the burst capacity is exhausted (default: 1 event per second) + + - **Cooldown**: Duration the circuit stays open before attempting recovery (default: 5 minutes) + + When an XR receives too many watch events (exceeding the burst and refill rate), the circuit breaker opens and blocks most reconciliation requests. While the circuit is open, Crossplane allows one request every 30 seconds to probe for recovery. + ### Detecting circuit breaker activation -When the circuit breaker opens, the affected XR shows a `Responsive` condition: +XRs have a `Responsive` condition that tracks circuit breaker state. When the circuit breaker opens, this condition changes to `False`: ```yaml conditions: @@ -223,25 +233,66 @@ conditions: The message identifies which resource is causing excessive watch events, helping you pinpoint the source of thrashing. +### Identifying and fixing root causes + + +Watch events occur when resources change in your cluster. Excessive watch events typically indicate composition patterns that cause loops, such as resources updating each other in cycles or external systems reverting changes made by Crossplane. + +**To identify the source of excessive watch events:** + +The XR's `Responsive` condition message identifies the problematic resource. Monitor this resource for modification events: + + +```shell +kubectl get -n --output-watch-events --watch-only +``` + +**Common root causes and fixes:** + + +- **Feedback loops in patches**: Review Composition patches for logic that creates circular updates where changes trigger more changes +- **External controller conflicts**: Other controllers or operators might modify the same resources, fighting with Crossplane for control +- **Frequent connection detail updates**: Consider if all fields need to be in connection details, as updates to connection secrets trigger watch events + +Investigate and fix the root cause before adjusting circuit breaker thresholds. + + ### Configuring circuit breaker parameters -The default circuit breaker settings work well for most environments. However, you may need to adjust them based on your composition patterns and cluster size. For example, increase the burst and refill rate for large-scale deployments with many rapidly updating XRs, or decrease them if you want stricter protection against thrashing. + + +The default circuit breaker settings work well for most environments. You may need to adjust them based on your composition patterns and cluster size. + + + + +For example, increase the burst and refill rate for large-scale deployments with XRs updating frequently, or decrease them if you want stricter protection against thrashing. + + Configure circuit breaker parameters using Crossplane startup arguments via Helm: ```shell helm install crossplane --namespace crossplane-system --create-namespace crossplane-stable/crossplane \ - --set args='{"--debug","--circuit-breaker-burst=500.0","--circuit-breaker-refill-rate=5.0","--circuit-breaker-cooldown=1m"}' + --set args='{"--circuit-breaker-burst=500.0","--circuit-breaker-refill-rate=5.0","--circuit-breaker-cooldown=1m"}' ``` Available parameters: - `--circuit-breaker-burst`: Maximum burst of events (default: 100.0) - `--circuit-breaker-refill-rate`: Events per second for sustained rate (default: 1.0) + + + - `--circuit-breaker-cooldown`: Duration to keep circuit open (default: 5m0s) + + + ### Monitoring with metrics -Monitor circuit breaker activity using these Prometheus metrics. + +Track circuit breaker activity using these Prometheus metrics. + See the [Metrics guide]({{< ref "metrics#circuit-breaker-metrics" >}}) for detailed metric information. diff --git a/content/v2.1/guides/metrics.md b/content/v2.1/guides/metrics.md index c7ffd949c..6cc1a47a6 100644 --- a/content/v2.1/guides/metrics.md +++ b/content/v2.1/guides/metrics.md @@ -42,22 +42,35 @@ The Crossplane pod emits these metrics. | {{}}function_run_function_response_cache_bytes_deleted_total{{}} | Total number of RunFunctionResponse bytes deleted from cache | | {{}}function_run_function_response_cache_read_seconds{{}} | Histogram of cache read latency (seconds) | | {{}}function_run_function_response_cache_write_seconds{{}} | Histogram of cache write latency (seconds) | -| {{}}circuit_breaker_opens_total{{}} | Number of times the XR circuit breaker transitioned from closed to open | -| {{}}circuit_breaker_closes_total{{}} | Number of times the XR circuit breaker transitioned from open to closed | -| {{}}circuit_breaker_events_total{{}} | Number of XR watch events handled by the circuit breaker, labeled by outcome | -| {{}}engine_controllers_started_total{{}} | Total number of controllers started | -| {{}}engine_controllers_stopped_total{{}} | Total number of controllers stopped | -| {{}}engine_watches_started_total{{}} | Total number of watches started | -| {{}}engine_watches_stopped_total{{}} | Total number of watches stopped | +| {{}}engine_controllers_started_total{{}} | Total number of controllers started | +| {{}}engine_controllers_stopped_total{{}} | Total number of controllers stopped | +| {{}}engine_watches_started_total{{}} | Total number of watches started | +| {{}}engine_watches_stopped_total{{}} | Total number of watches stopped | {{}} ### Circuit breaker metrics -The circuit breaker prevents reconciliation thrashing by monitoring and rate-limiting watch events per Composite Resource (XR). These metrics help you identify and respond to excessive reconciliation activity. + + +The circuit breaker prevents reconciliation thrashing by monitoring and rate-limiting watch events per Composite Resource (XR). Crossplane core emits these metrics to help you identify and respond to excessive reconciliation activity. + + + +{{< table "table table-hover table-striped table-sm">}} +| Metric Name | Description | +| --- | --- | +| {{}}circuit_breaker_opens_total{{}} | Number of times the XR circuit breaker transitioned from closed to open | +| {{}}circuit_breaker_closes_total{{}} | Number of times the XR circuit breaker transitioned from open to closed | +| {{}}circuit_breaker_events_total{{}} | Number of XR watch events handled by the circuit breaker, labeled by outcome | +{{}} All circuit breaker metrics include a `controller` label formatted as `composite/.` (for example, `composite/xpostgresqlinstances.example.com`), providing visibility per XRD without creating high cardinality from individual XR instances. + + #### circuit_breaker_opens_total + + Tracks when a circuit breaker transitions from closed to open state. An increase indicates an XR is receiving excessive watch events and has triggered throttling. @@ -66,7 +79,9 @@ Tracks when a circuit breaker transitions from closed to open state. An increase - Identify which XRD types are prone to excessive watch events - Track the frequency of circuit breaker activations + **Example PromQL queries:** + ```promql # Rate of circuit breaker opens over 5 minutes rate(circuit_breaker_opens_total[5m]) @@ -75,29 +90,47 @@ rate(circuit_breaker_opens_total[5m]) sum by (controller) (circuit_breaker_opens_total) ``` + + #### circuit_breaker_closes_total + + Tracks when a circuit breaker transitions from open to closed state. This indicates an XR has recovered from excessive watch events and returned to normal operation. **Use this metric to:** + - Monitor recovery from reconciliation thrashing + + - Verify circuit breakers are closing after cooldown periods -- Track overall circuit breaker lifecycle + +- Track circuit breaker lifecycle + + #### circuit_breaker_events_total + + Tracks all watch events processed by the circuit breaker, labeled by `result`: + - `Allowed`: Normal operation when circuit is closed - events proceed to reconciliation + - `Dropped`: Events blocked when circuit is fully open - indicates active throttling + - `Halfopen_allowed`: Limited probe events when circuit is half-open - circuit is testing for recovery + **Use this metric to:** -- Monitor the volume of watch events per XR type -- Detect when events are being dropped (active throttling) +- Track the volume of watch events per XR type +- Detect when the circuit drops events (active throttling) - Alert on high dropped event rates indicating potential issues - Understand reconciliation pressure on specific controllers + **Example PromQL queries:** + ```promql # Rate of dropped events (active throttling), aggregated per controller sum by (controller) ( diff --git a/content/v2.1/guides/troubleshoot-crossplane.md b/content/v2.1/guides/troubleshoot-crossplane.md index 17f104b8e..38a5fcd81 100644 --- a/content/v2.1/guides/troubleshoot-crossplane.md +++ b/content/v2.1/guides/troubleshoot-crossplane.md @@ -197,21 +197,31 @@ kubectl patch cloudsqlinstance my-db -p '{"metadata":{"finalizers": []}}' --type ## Circuit breaker for reconciliation thrashing + Crossplane includes a circuit breaker mechanism to prevent reconciliation thrashing. Thrashing occurs when controllers fight over composed resource state or enter tight reconciliation loops that could impact cluster performance. + ### How the circuit breaker works + Each Composite Resource (XR) has its own token bucket-based circuit breaker that monitors reconciliation rates: + - **Burst (capacity)**: Maximum number of events allowed in quick succession (default: 100) -- **Refill rate**: Sustained event rate after burst is exhausted (default: 1 event per second) + +- **Refill rate**: Sustained event rate after the burst capacity is exhausted (default: 1 event per second) + + - **Cooldown**: Duration the circuit stays open before attempting recovery (default: 5 minutes) + + When an XR receives too many watch events (exceeding the burst and refill rate), the circuit breaker opens and blocks most reconciliation requests. While the circuit is open, Crossplane allows one request every 30 seconds to probe for recovery. + ### Detecting circuit breaker activation -When the circuit breaker opens, the affected XR shows a `Responsive` condition: +XRs have a `Responsive` condition that tracks circuit breaker state. When the circuit breaker opens, this condition changes to `False`: ```yaml conditions: @@ -223,25 +233,66 @@ conditions: The message identifies which resource is causing excessive watch events, helping you pinpoint the source of thrashing. +### Identifying and fixing root causes + + +Watch events occur when resources change in your cluster. Excessive watch events typically indicate composition patterns that cause loops, such as resources updating each other in cycles or external systems reverting changes made by Crossplane. + +**To identify the source of excessive watch events:** + +The XR's `Responsive` condition message identifies the problematic resource. Monitor this resource for modification events: + + +```shell +kubectl get -n --output-watch-events --watch-only +``` + +**Common root causes and fixes:** + + +- **Feedback loops in patches**: Review Composition patches for logic that creates circular updates where changes trigger more changes +- **External controller conflicts**: Other controllers or operators might modify the same resources, fighting with Crossplane for control +- **Frequent connection detail updates**: Consider if all fields need to be in connection details, as updates to connection secrets trigger watch events + +Investigate and fix the root cause before adjusting circuit breaker thresholds. + + ### Configuring circuit breaker parameters -The default circuit breaker settings work well for most environments. However, you may need to adjust them based on your composition patterns and cluster size. For example, increase the burst and refill rate for large-scale deployments with many rapidly updating XRs, or decrease them if you want stricter protection against thrashing. + + +The default circuit breaker settings work well for most environments. You may need to adjust them based on your composition patterns and cluster size. + + + + +For example, increase the burst and refill rate for large-scale deployments with XRs updating frequently, or decrease them if you want stricter protection against thrashing. + + Configure circuit breaker parameters using Crossplane startup arguments via Helm: ```shell helm install crossplane --namespace crossplane-system --create-namespace crossplane-stable/crossplane \ - --set args='{"--debug","--circuit-breaker-burst=500.0","--circuit-breaker-refill-rate=5.0","--circuit-breaker-cooldown=1m"}' + --set args='{"--circuit-breaker-burst=500.0","--circuit-breaker-refill-rate=5.0","--circuit-breaker-cooldown=1m"}' ``` Available parameters: - `--circuit-breaker-burst`: Maximum burst of events (default: 100.0) - `--circuit-breaker-refill-rate`: Events per second for sustained rate (default: 1.0) + + + - `--circuit-breaker-cooldown`: Duration to keep circuit open (default: 5m0s) + + + ### Monitoring with metrics -Monitor circuit breaker activity using these Prometheus metrics. + +Track circuit breaker activity using these Prometheus metrics. + See the [Metrics guide]({{< ref "metrics#circuit-breaker-metrics" >}}) for detailed metric information.