diff --git a/content/master/guides/metrics.md b/content/master/guides/metrics.md index 7fa40c1b6..39944af4e 100644 --- a/content/master/guides/metrics.md +++ b/content/master/guides/metrics.md @@ -21,7 +21,7 @@ These Prometheus annotations expose the metrics: prometheus.io/path: /metrics prometheus.io/port: "8080" prometheus.io/scrape: "true" -``` +``` ## Crossplane core metrics @@ -42,15 +42,155 @@ The Crossplane pod emits these metrics. | {{}}function_run_function_response_cache_bytes_deleted_total{{}} | Total number of RunFunctionResponse bytes deleted from cache | | {{}}function_run_function_response_cache_read_seconds{{}} | Histogram of cache read latency (seconds) | | {{}}function_run_function_response_cache_write_seconds{{}} | Histogram of cache write latency (seconds) | -| {{}}circuit_breaker_opens_total{{}} | Number of times the XR circuit breaker transitioned from closed to open | -| {{}}circuit_breaker_closes_total{{}} | Number of times the XR circuit breaker transitioned from open to closed | -| {{}}circuit_breaker_events_total{{}} | Number of XR watch events handled by the circuit breaker, labeled by outcome | -| {{}}engine_controllers_started_total{{}} | Total number of controllers started | -| {{}}engine_controllers_stopped_total{{}} | Total number of controllers stopped | -| {{}}engine_watches_started_total{{}} | Total number of watches started | -| {{}}engine_watches_stopped_total{{}} | Total number of watches stopped | +| {{}}engine_controllers_started_total{{}} | Total number of controllers started | +| {{}}engine_controllers_stopped_total{{}} | Total number of controllers stopped | +| {{}}engine_watches_started_total{{}} | Total number of watches started | +| {{}}engine_watches_stopped_total{{}} | Total number of watches stopped | +{{}} + +### Circuit breaker metrics + + + +The circuit breaker prevents reconciliation thrashing by monitoring and rate-limiting watch events per Composite Resource (XR). Crossplane core emits these metrics to help you identify and respond to excessive reconciliation activity. + + + +{{< table "table table-hover table-striped table-sm">}} +| Metric Name | Description | +| --- | --- | +| {{}}circuit_breaker_opens_total{{}} | Number of times the XR circuit breaker transitioned from closed to open | +| {{}}circuit_breaker_closes_total{{}} | Number of times the XR circuit breaker transitioned from open to closed | +| {{}}circuit_breaker_events_total{{}} | Number of XR watch events handled by the circuit breaker, labeled by outcome | {{}} +All circuit breaker metrics include a `controller` label formatted as `composite/.` (for example, `composite/xpostgresqlinstances.example.com`), providing visibility per XRD without creating high cardinality from individual XR instances. + + + +#### circuit_breaker_opens_total + + + +Tracks when a circuit breaker transitions from closed to open state. An increase indicates an XR is receiving excessive watch events and has triggered throttling. + +**Use this metric to:** +- Alert on XRs experiencing reconciliation thrashing +- Identify which XRD types are prone to excessive watch events +- Track the frequency of circuit breaker activations + + +**Example PromQL queries:** + +```promql +# Rate of circuit breaker opens over 5 minutes +rate(circuit_breaker_opens_total[5m]) + +# Count of circuit breaker opens by controller +sum by (controller) (circuit_breaker_opens_total) +``` + + + +#### circuit_breaker_closes_total + + + +Tracks when a circuit breaker transitions from open to closed state. This indicates an XR has recovered from excessive watch events and returned to normal operation. + +**Use this metric to:** + +- Monitor recovery from reconciliation thrashing + + +- Verify circuit breakers are closing after cooldown periods + +- Track circuit breaker lifecycle + + + +#### circuit_breaker_events_total + + + +Tracks all watch events processed by the circuit breaker, labeled by `result`: + +- `Allowed`: Normal operation when circuit is closed - events proceed to reconciliation + +- `Dropped`: Events blocked when circuit is fully open - indicates active throttling + +- `HalfOpenAllowed`: Limited probe events when circuit is half-open - circuit is testing for recovery + + +**Use this metric to:** +- Track the volume of watch events per XR type +- Detect when the circuit drops events (active throttling) +- Alert on high dropped event rates indicating potential issues +- Understand reconciliation pressure on specific controllers + + +**Example PromQL queries:** + +```promql +# Rate of dropped events (active throttling), aggregated per controller +sum by (controller) ( + rate(circuit_breaker_events_total{result="Dropped"}[5m]) +) + +# Percentage of events being dropped +sum by (controller) (rate(circuit_breaker_events_total{result="Dropped"}[5m])) +/ +sum by (controller) (rate(circuit_breaker_events_total[5m])) * 100 + +# Number of replicas per controller currently dropping events +count by (controller) ( + rate(circuit_breaker_events_total{result="Dropped"}[5m]) > 0 +) + +# Estimated number of circuit breaker opens over 5 minutes +sum by (controller) ( + increase(circuit_breaker_opens_total[5m]) +) + +# Alert condition: controllers under high watch pressure (severe overload) +sum by (controller) ( + rate(circuit_breaker_events_total{result="Dropped"}[5m]) +) > 1 +``` + +**Recommended alerts:** +```yaml +# Alert when circuit breaker is consistently dropping events +- alert: CircuitBreakerDropRatioHigh + expr: | + ( + sum by (controller)(rate(circuit_breaker_events_total{result="Dropped"}[5m])) + / + sum by (controller)(rate(circuit_breaker_events_total[5m])) + ) > 0.2 + for: 5m + labels: + severity: critical + annotations: + summary: "High circuit breaker drop ratio for {{ $labels.controller }}" + description: "More than 20% of events are being dropped by the circuit breaker for {{ $labels.controller }}, indicating sustained overload." + +# Alert when circuit breaker opens frequently +- alert: CircuitBreakerFrequentOpens + expr: | + sum by (controller) ( + rate(circuit_breaker_opens_total[5m]) + ) * 3600 > 6 + for: 15m + labels: + severity: warning + annotations: + summary: "Frequent circuit breaker opens for {{ $labels.controller }}" + description: "Circuit breaker for {{ $labels.controller }} is opening more than 6 times per hour, indicating reconciliation thrashing." +``` + +For more information on the circuit breaker feature and configuration, see [Troubleshooting - Circuit breaker]({{< ref "troubleshoot-crossplane#circuit-breaker-for-reconciliation-thrashing" >}}). + ## Provider metrics Crossplane providers emit these metrics. All providers built with crossplane-runtime emit the `crossplane_managed_resource_*` metrics. diff --git a/content/master/guides/troubleshoot-crossplane.md b/content/master/guides/troubleshoot-crossplane.md index 99f1d37f8..38a5fcd81 100644 --- a/content/master/guides/troubleshoot-crossplane.md +++ b/content/master/guides/troubleshoot-crossplane.md @@ -100,7 +100,7 @@ spec: spec: containers: - name: package-runtime - args: + args: - --debug --- apiVersion: pkg.crossplane.io/v1 @@ -195,6 +195,107 @@ For example, for a `CloudSQLInstance` managed resource (`database.gcp.crossplane kubectl patch cloudsqlinstance my-db -p '{"metadata":{"finalizers": []}}' --type=merge ``` +## Circuit breaker for reconciliation thrashing + + +Crossplane includes a circuit breaker mechanism to prevent reconciliation thrashing. Thrashing occurs when controllers fight over composed resource state or enter tight reconciliation loops that could impact cluster performance. + + +### How the circuit breaker works + + +Each Composite Resource (XR) has its own token bucket-based circuit breaker that monitors reconciliation rates: + + +- **Burst (capacity)**: Maximum number of events allowed in quick succession (default: 100) + +- **Refill rate**: Sustained event rate after the burst capacity is exhausted (default: 1 event per second) + + +- **Cooldown**: Duration the circuit stays open before attempting recovery (default: 5 minutes) + + + +When an XR receives too many watch events (exceeding the burst and refill rate), the circuit breaker opens and blocks most reconciliation requests. While the circuit is open, Crossplane allows one request every 30 seconds to probe for recovery. + + +### Detecting circuit breaker activation + +XRs have a `Responsive` condition that tracks circuit breaker state. When the circuit breaker opens, this condition changes to `False`: + +```yaml +conditions: +- type: Responsive + status: "False" + reason: WatchCircuitOpen + message: "Too many watch events from ConfigMap/my-config (default). Allowing events periodically." +``` + +The message identifies which resource is causing excessive watch events, helping you pinpoint the source of thrashing. + +### Identifying and fixing root causes + + +Watch events occur when resources change in your cluster. Excessive watch events typically indicate composition patterns that cause loops, such as resources updating each other in cycles or external systems reverting changes made by Crossplane. + +**To identify the source of excessive watch events:** + +The XR's `Responsive` condition message identifies the problematic resource. Monitor this resource for modification events: + + +```shell +kubectl get -n --output-watch-events --watch-only +``` + +**Common root causes and fixes:** + + +- **Feedback loops in patches**: Review Composition patches for logic that creates circular updates where changes trigger more changes +- **External controller conflicts**: Other controllers or operators might modify the same resources, fighting with Crossplane for control +- **Frequent connection detail updates**: Consider if all fields need to be in connection details, as updates to connection secrets trigger watch events + +Investigate and fix the root cause before adjusting circuit breaker thresholds. + + +### Configuring circuit breaker parameters + + + +The default circuit breaker settings work well for most environments. You may need to adjust them based on your composition patterns and cluster size. + + + + +For example, increase the burst and refill rate for large-scale deployments with XRs updating frequently, or decrease them if you want stricter protection against thrashing. + + + +Configure circuit breaker parameters using Crossplane startup arguments via Helm: + +```shell +helm install crossplane --namespace crossplane-system --create-namespace crossplane-stable/crossplane \ + --set args='{"--circuit-breaker-burst=500.0","--circuit-breaker-refill-rate=5.0","--circuit-breaker-cooldown=1m"}' +``` + +Available parameters: +- `--circuit-breaker-burst`: Maximum burst of events (default: 100.0) +- `--circuit-breaker-refill-rate`: Events per second for sustained rate (default: 1.0) + + + +- `--circuit-breaker-cooldown`: Duration to keep circuit open (default: 5m0s) + + + + +### Monitoring with metrics + + +Track circuit breaker activity using these Prometheus metrics. + + +See the [Metrics guide]({{< ref "metrics#circuit-breaker-metrics" >}}) for detailed metric information. + ## Tips, tricks, and troubleshooting This section covers some common tips, tricks, and troubleshooting steps diff --git a/content/v2.1/guides/metrics.md b/content/v2.1/guides/metrics.md index 5282ee685..6cc1a47a6 100644 --- a/content/v2.1/guides/metrics.md +++ b/content/v2.1/guides/metrics.md @@ -21,7 +21,7 @@ These Prometheus annotations expose the metrics: prometheus.io/path: /metrics prometheus.io/port: "8080" prometheus.io/scrape: "true" -``` +``` ## Crossplane core metrics @@ -42,15 +42,155 @@ The Crossplane pod emits these metrics. | {{}}function_run_function_response_cache_bytes_deleted_total{{}} | Total number of RunFunctionResponse bytes deleted from cache | | {{}}function_run_function_response_cache_read_seconds{{}} | Histogram of cache read latency (seconds) | | {{}}function_run_function_response_cache_write_seconds{{}} | Histogram of cache write latency (seconds) | -| {{}}circuit_breaker_opens_total{{}} | Number of times the XR circuit breaker transitioned from closed to open | -| {{}}circuit_breaker_closes_total{{}} | Number of times the XR circuit breaker transitioned from open to closed | -| {{}}circuit_breaker_events_total{{}} | Number of XR watch events handled by the circuit breaker, labeled by outcome | -| {{}}engine_controllers_started_total{{}} | Total number of controllers started | -| {{}}engine_controllers_stopped_total{{}} | Total number of controllers stopped | -| {{}}engine_watches_started_total{{}} | Total number of watches started | -| {{}}engine_watches_stopped_total{{}} | Total number of watches stopped | +| {{}}engine_controllers_started_total{{}} | Total number of controllers started | +| {{}}engine_controllers_stopped_total{{}} | Total number of controllers stopped | +| {{}}engine_watches_started_total{{}} | Total number of watches started | +| {{}}engine_watches_stopped_total{{}} | Total number of watches stopped | +{{}} + +### Circuit breaker metrics + + + +The circuit breaker prevents reconciliation thrashing by monitoring and rate-limiting watch events per Composite Resource (XR). Crossplane core emits these metrics to help you identify and respond to excessive reconciliation activity. + + + +{{< table "table table-hover table-striped table-sm">}} +| Metric Name | Description | +| --- | --- | +| {{}}circuit_breaker_opens_total{{}} | Number of times the XR circuit breaker transitioned from closed to open | +| {{}}circuit_breaker_closes_total{{}} | Number of times the XR circuit breaker transitioned from open to closed | +| {{}}circuit_breaker_events_total{{}} | Number of XR watch events handled by the circuit breaker, labeled by outcome | {{}} +All circuit breaker metrics include a `controller` label formatted as `composite/.` (for example, `composite/xpostgresqlinstances.example.com`), providing visibility per XRD without creating high cardinality from individual XR instances. + + + +#### circuit_breaker_opens_total + + + +Tracks when a circuit breaker transitions from closed to open state. An increase indicates an XR is receiving excessive watch events and has triggered throttling. + +**Use this metric to:** +- Alert on XRs experiencing reconciliation thrashing +- Identify which XRD types are prone to excessive watch events +- Track the frequency of circuit breaker activations + + +**Example PromQL queries:** + +```promql +# Rate of circuit breaker opens over 5 minutes +rate(circuit_breaker_opens_total[5m]) + +# Count of circuit breaker opens by controller +sum by (controller) (circuit_breaker_opens_total) +``` + + + +#### circuit_breaker_closes_total + + + +Tracks when a circuit breaker transitions from open to closed state. This indicates an XR has recovered from excessive watch events and returned to normal operation. + +**Use this metric to:** + +- Monitor recovery from reconciliation thrashing + + +- Verify circuit breakers are closing after cooldown periods + +- Track circuit breaker lifecycle + + + +#### circuit_breaker_events_total + + + +Tracks all watch events processed by the circuit breaker, labeled by `result`: + +- `Allowed`: Normal operation when circuit is closed - events proceed to reconciliation + +- `Dropped`: Events blocked when circuit is fully open - indicates active throttling + +- `Halfopen_allowed`: Limited probe events when circuit is half-open - circuit is testing for recovery + + +**Use this metric to:** +- Track the volume of watch events per XR type +- Detect when the circuit drops events (active throttling) +- Alert on high dropped event rates indicating potential issues +- Understand reconciliation pressure on specific controllers + + +**Example PromQL queries:** + +```promql +# Rate of dropped events (active throttling), aggregated per controller +sum by (controller) ( + rate(circuit_breaker_events_total{result="Dropped"}[5m]) +) + +# Percentage of events being dropped +sum by (controller) (rate(circuit_breaker_events_total{result="Dropped"}[5m])) +/ +sum by (controller) (rate(circuit_breaker_events_total[5m])) * 100 + +# Number of replicas per controller currently dropping events +count by (controller) ( + rate(circuit_breaker_events_total{result="Dropped"}[5m]) > 0 +) + +# Estimated number of circuit breaker opens over 5 minutes +sum by (controller) ( + increase(circuit_breaker_opens_total[5m]) +) + +# Alert condition: controllers under high watch pressure (severe overload) +sum by (controller) ( + rate(circuit_breaker_events_total{result="Dropped"}[5m]) +) > 1 +``` + +**Recommended alerts:** +```yaml +# Alert when circuit breaker is consistently dropping events +- alert: CircuitBreakerDropRatioHigh + expr: | + ( + sum by (controller)(rate(circuit_breaker_events_total{result="Dropped"}[5m])) + / + sum by (controller)(rate(circuit_breaker_events_total[5m])) + ) > 0.2 + for: 5m + labels: + severity: critical + annotations: + summary: "High circuit breaker drop ratio for {{ $labels.controller }}" + description: "More than 20% of events are being dropped by the circuit breaker for {{ $labels.controller }}, indicating sustained overload." + +# Alert when circuit breaker opens frequently +- alert: CircuitBreakerFrequentOpens + expr: | + sum by (controller) ( + rate(circuit_breaker_opens_total[5m]) + ) * 3600 > 6 + for: 15m + labels: + severity: warning + annotations: + summary: "Frequent circuit breaker opens for {{ $labels.controller }}" + description: "Circuit breaker for {{ $labels.controller }} is opening more than 6 times per hour, indicating reconciliation thrashing." +``` + +For more information on the circuit breaker feature and configuration, see [Troubleshooting - Circuit breaker]({{< ref "troubleshoot-crossplane#circuit-breaker-for-reconciliation-thrashing" >}}). + ## Provider metrics Crossplane providers emit these metrics. All providers built with crossplane-runtime emit the `crossplane_managed_resource_*` metrics. diff --git a/content/v2.1/guides/troubleshoot-crossplane.md b/content/v2.1/guides/troubleshoot-crossplane.md index 99f1d37f8..38a5fcd81 100644 --- a/content/v2.1/guides/troubleshoot-crossplane.md +++ b/content/v2.1/guides/troubleshoot-crossplane.md @@ -100,7 +100,7 @@ spec: spec: containers: - name: package-runtime - args: + args: - --debug --- apiVersion: pkg.crossplane.io/v1 @@ -195,6 +195,107 @@ For example, for a `CloudSQLInstance` managed resource (`database.gcp.crossplane kubectl patch cloudsqlinstance my-db -p '{"metadata":{"finalizers": []}}' --type=merge ``` +## Circuit breaker for reconciliation thrashing + + +Crossplane includes a circuit breaker mechanism to prevent reconciliation thrashing. Thrashing occurs when controllers fight over composed resource state or enter tight reconciliation loops that could impact cluster performance. + + +### How the circuit breaker works + + +Each Composite Resource (XR) has its own token bucket-based circuit breaker that monitors reconciliation rates: + + +- **Burst (capacity)**: Maximum number of events allowed in quick succession (default: 100) + +- **Refill rate**: Sustained event rate after the burst capacity is exhausted (default: 1 event per second) + + +- **Cooldown**: Duration the circuit stays open before attempting recovery (default: 5 minutes) + + + +When an XR receives too many watch events (exceeding the burst and refill rate), the circuit breaker opens and blocks most reconciliation requests. While the circuit is open, Crossplane allows one request every 30 seconds to probe for recovery. + + +### Detecting circuit breaker activation + +XRs have a `Responsive` condition that tracks circuit breaker state. When the circuit breaker opens, this condition changes to `False`: + +```yaml +conditions: +- type: Responsive + status: "False" + reason: WatchCircuitOpen + message: "Too many watch events from ConfigMap/my-config (default). Allowing events periodically." +``` + +The message identifies which resource is causing excessive watch events, helping you pinpoint the source of thrashing. + +### Identifying and fixing root causes + + +Watch events occur when resources change in your cluster. Excessive watch events typically indicate composition patterns that cause loops, such as resources updating each other in cycles or external systems reverting changes made by Crossplane. + +**To identify the source of excessive watch events:** + +The XR's `Responsive` condition message identifies the problematic resource. Monitor this resource for modification events: + + +```shell +kubectl get -n --output-watch-events --watch-only +``` + +**Common root causes and fixes:** + + +- **Feedback loops in patches**: Review Composition patches for logic that creates circular updates where changes trigger more changes +- **External controller conflicts**: Other controllers or operators might modify the same resources, fighting with Crossplane for control +- **Frequent connection detail updates**: Consider if all fields need to be in connection details, as updates to connection secrets trigger watch events + +Investigate and fix the root cause before adjusting circuit breaker thresholds. + + +### Configuring circuit breaker parameters + + + +The default circuit breaker settings work well for most environments. You may need to adjust them based on your composition patterns and cluster size. + + + + +For example, increase the burst and refill rate for large-scale deployments with XRs updating frequently, or decrease them if you want stricter protection against thrashing. + + + +Configure circuit breaker parameters using Crossplane startup arguments via Helm: + +```shell +helm install crossplane --namespace crossplane-system --create-namespace crossplane-stable/crossplane \ + --set args='{"--circuit-breaker-burst=500.0","--circuit-breaker-refill-rate=5.0","--circuit-breaker-cooldown=1m"}' +``` + +Available parameters: +- `--circuit-breaker-burst`: Maximum burst of events (default: 100.0) +- `--circuit-breaker-refill-rate`: Events per second for sustained rate (default: 1.0) + + + +- `--circuit-breaker-cooldown`: Duration to keep circuit open (default: 5m0s) + + + + +### Monitoring with metrics + + +Track circuit breaker activity using these Prometheus metrics. + + +See the [Metrics guide]({{< ref "metrics#circuit-breaker-metrics" >}}) for detailed metric information. + ## Tips, tricks, and troubleshooting This section covers some common tips, tricks, and troubleshooting steps