Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
156 changes: 148 additions & 8 deletions content/master/guides/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ These Prometheus annotations expose the metrics:
prometheus.io/path: /metrics
prometheus.io/port: "8080"
prometheus.io/scrape: "true"
```
```

## Crossplane core metrics

Expand All @@ -42,15 +42,155 @@ The Crossplane pod emits these metrics.
| {{<hover label="function_run_function_response_cache_bytes_deleted_total" line="10">}}function_run_function_response_cache_bytes_deleted_total{{</hover>}} | Total number of RunFunctionResponse bytes deleted from cache |
| {{<hover label="function_run_function_response_cache_read_seconds" line="11">}}function_run_function_response_cache_read_seconds{{</hover>}} | Histogram of cache read latency (seconds) |
| {{<hover label="function_run_function_response_cache_write_seconds" line="12">}}function_run_function_response_cache_write_seconds{{</hover>}} | Histogram of cache write latency (seconds) |
| {{<hover label="circuit_breaker_opens_total" line="13">}}circuit_breaker_opens_total{{</hover>}} | Number of times the XR circuit breaker transitioned from closed to open |
| {{<hover label="circuit_breaker_closes_total" line="14">}}circuit_breaker_closes_total{{</hover>}} | Number of times the XR circuit breaker transitioned from open to closed |
| {{<hover label="circuit_breaker_events_total" line="15">}}circuit_breaker_events_total{{</hover>}} | Number of XR watch events handled by the circuit breaker, labeled by outcome |
| {{<hover label="engine_controllers_started_total" line="16">}}engine_controllers_started_total{{</hover>}} | Total number of controllers started |
| {{<hover label="engine_controllers_stopped_total" line="17">}}engine_controllers_stopped_total{{</hover>}} | Total number of controllers stopped |
| {{<hover label="engine_watches_started_total" line="18">}}engine_watches_started_total{{</hover>}} | Total number of watches started |
| {{<hover label="engine_watches_stopped_total" line="19">}}engine_watches_stopped_total{{</hover>}} | Total number of watches stopped |
| {{<hover label="engine_controllers_started_total" line="13">}}engine_controllers_started_total{{</hover>}} | Total number of controllers started |
| {{<hover label="engine_controllers_stopped_total" line="14">}}engine_controllers_stopped_total{{</hover>}} | Total number of controllers stopped |
| {{<hover label="engine_watches_started_total" line="15">}}engine_watches_started_total{{</hover>}} | Total number of watches started |
| {{<hover label="engine_watches_stopped_total" line="16">}}engine_watches_stopped_total{{</hover>}} | Total number of watches stopped |
{{</table >}}

### Circuit breaker metrics

<!-- vale Crossplane.Spelling = NO -->
<!-- vale write-good.Passive = NO -->
The circuit breaker prevents reconciliation thrashing by monitoring and rate-limiting watch events per Composite Resource (XR). Crossplane core emits these metrics to help you identify and respond to excessive reconciliation activity.
<!-- vale write-good.Passive = YES -->
<!-- vale Crossplane.Spelling = YES -->

{{< table "table table-hover table-striped table-sm">}}
| Metric Name | Description |
| --- | --- |
| {{<hover label="circuit_breaker_opens_total" line="1">}}circuit_breaker_opens_total{{</hover>}} | Number of times the XR circuit breaker transitioned from closed to open |
| {{<hover label="circuit_breaker_closes_total" line="2">}}circuit_breaker_closes_total{{</hover>}} | Number of times the XR circuit breaker transitioned from open to closed |
| {{<hover label="circuit_breaker_events_total" line="3">}}circuit_breaker_events_total{{</hover>}} | Number of XR watch events handled by the circuit breaker, labeled by outcome |
{{</table >}}

All circuit breaker metrics include a `controller` label formatted as `composite/<plural>.<group>` (for example, `composite/xpostgresqlinstances.example.com`), providing visibility per XRD without creating high cardinality from individual XR instances.

<!-- vale Google.Headings = NO -->
<!-- vale Crossplane.Spelling = NO -->
#### circuit_breaker_opens_total
<!-- vale Crossplane.Spelling = YES -->
<!-- vale Google.Headings = YES -->

Tracks when a circuit breaker transitions from closed to open state. An increase indicates an XR is receiving excessive watch events and has triggered throttling.

**Use this metric to:**
- Alert on XRs experiencing reconciliation thrashing
- Identify which XRD types are prone to excessive watch events
- Track the frequency of circuit breaker activations

<!-- vale Crossplane.Spelling = NO -->
**Example PromQL queries:**
<!-- vale Crossplane.Spelling = YES -->
```promql
# Rate of circuit breaker opens over 5 minutes
rate(circuit_breaker_opens_total[5m])

# Count of circuit breaker opens by controller
sum by (controller) (circuit_breaker_opens_total)
```

<!-- vale Google.Headings = NO -->
<!-- vale Crossplane.Spelling = NO -->
#### circuit_breaker_closes_total
<!-- vale Crossplane.Spelling = YES -->
<!-- vale Google.Headings = YES -->

Tracks when a circuit breaker transitions from open to closed state. This indicates an XR has recovered from excessive watch events and returned to normal operation.

**Use this metric to:**
<!-- vale write-good.TooWordy = NO -->
- Monitor recovery from reconciliation thrashing
<!-- vale write-good.TooWordy = YES -->
<!-- vale Crossplane.Spelling = NO -->
- Verify circuit breakers are closing after cooldown periods
<!-- vale Crossplane.Spelling = YES -->
- Track circuit breaker lifecycle

<!-- vale Google.Headings = NO -->
<!-- vale Crossplane.Spelling = NO -->
#### circuit_breaker_events_total
<!-- vale Crossplane.Spelling = YES -->
<!-- vale Google.Headings = YES -->

Tracks all watch events processed by the circuit breaker, labeled by `result`:
<!-- vale write-good.Passive = NO -->
- `Allowed`: Normal operation when circuit is closed - events proceed to reconciliation
<!-- vale write-good.Passive = YES -->
- `Dropped`: Events blocked when circuit is fully open - indicates active throttling
<!-- vale Crossplane.Spelling = NO -->
- `HalfOpenAllowed`: Limited probe events when circuit is half-open - circuit is testing for recovery
<!-- vale Crossplane.Spelling = YES -->

**Use this metric to:**
- Track the volume of watch events per XR type
- Detect when the circuit drops events (active throttling)
- Alert on high dropped event rates indicating potential issues
- Understand reconciliation pressure on specific controllers

<!-- vale Crossplane.Spelling = NO -->
**Example PromQL queries:**
<!-- vale Crossplane.Spelling = YES -->
```promql
# Rate of dropped events (active throttling), aggregated per controller
sum by (controller) (
rate(circuit_breaker_events_total{result="Dropped"}[5m])
)

# Percentage of events being dropped
sum by (controller) (rate(circuit_breaker_events_total{result="Dropped"}[5m]))
/
sum by (controller) (rate(circuit_breaker_events_total[5m])) * 100

# Number of replicas per controller currently dropping events
count by (controller) (
rate(circuit_breaker_events_total{result="Dropped"}[5m]) > 0
)

# Estimated number of circuit breaker opens over 5 minutes
sum by (controller) (
increase(circuit_breaker_opens_total[5m])
)

# Alert condition: controllers under high watch pressure (severe overload)
sum by (controller) (
rate(circuit_breaker_events_total{result="Dropped"}[5m])
) > 1
```

**Recommended alerts:**
```yaml
# Alert when circuit breaker is consistently dropping events
- alert: CircuitBreakerDropRatioHigh
expr: |
(
sum by (controller)(rate(circuit_breaker_events_total{result="Dropped"}[5m]))
/
sum by (controller)(rate(circuit_breaker_events_total[5m]))
) > 0.2
for: 5m
labels:
severity: critical
annotations:
summary: "High circuit breaker drop ratio for {{ $labels.controller }}"
description: "More than 20% of events are being dropped by the circuit breaker for {{ $labels.controller }}, indicating sustained overload."

# Alert when circuit breaker opens frequently
- alert: CircuitBreakerFrequentOpens
expr: |
sum by (controller) (
rate(circuit_breaker_opens_total[5m])
) * 3600 > 6
for: 15m
labels:
severity: warning
annotations:
summary: "Frequent circuit breaker opens for {{ $labels.controller }}"
description: "Circuit breaker for {{ $labels.controller }} is opening more than 6 times per hour, indicating reconciliation thrashing."
```

For more information on the circuit breaker feature and configuration, see [Troubleshooting - Circuit breaker]({{< ref "troubleshoot-crossplane#circuit-breaker-for-reconciliation-thrashing" >}}).

## Provider metrics

Crossplane providers emit these metrics. All providers built with crossplane-runtime emit the `crossplane_managed_resource_*` metrics.
Expand Down
103 changes: 102 additions & 1 deletion content/master/guides/troubleshoot-crossplane.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ spec:
spec:
containers:
- name: package-runtime
args:
args:
- --debug
---
apiVersion: pkg.crossplane.io/v1
Expand Down Expand Up @@ -195,6 +195,107 @@ For example, for a `CloudSQLInstance` managed resource (`database.gcp.crossplane
kubectl patch cloudsqlinstance my-db -p '{"metadata":{"finalizers": []}}' --type=merge
```

## Circuit breaker for reconciliation thrashing

<!-- vale alex.ProfanityUnlikely = NO -->
Crossplane includes a circuit breaker mechanism to prevent reconciliation thrashing. Thrashing occurs when controllers fight over composed resource state or enter tight reconciliation loops that could impact cluster performance.
<!-- vale alex.ProfanityUnlikely = YES -->

### How the circuit breaker works

<!-- vale Crossplane.Spelling = NO -->
Each Composite Resource (XR) has its own token bucket-based circuit breaker that monitors reconciliation rates:
<!-- vale Crossplane.Spelling = YES -->

- **Burst (capacity)**: Maximum number of events allowed in quick succession (default: 100)
<!-- vale write-good.Passive = NO -->
- **Refill rate**: Sustained event rate after the burst capacity is exhausted (default: 1 event per second)
<!-- vale write-good.Passive = YES -->
<!-- vale Crossplane.Spelling = NO -->
- **Cooldown**: Duration the circuit stays open before attempting recovery (default: 5 minutes)
<!-- vale Crossplane.Spelling = YES -->

<!-- vale write-good.Weasel = NO -->
When an XR receives too many watch events (exceeding the burst and refill rate), the circuit breaker opens and blocks most reconciliation requests. While the circuit is open, Crossplane allows one request every 30 seconds to probe for recovery.
<!-- vale write-good.Weasel = YES -->

### Detecting circuit breaker activation

XRs have a `Responsive` condition that tracks circuit breaker state. When the circuit breaker opens, this condition changes to `False`:

```yaml
conditions:
- type: Responsive
status: "False"
reason: WatchCircuitOpen
message: "Too many watch events from ConfigMap/my-config (default). Allowing events periodically."
```

The message identifies which resource is causing excessive watch events, helping you pinpoint the source of thrashing.

### Identifying and fixing root causes

<!-- vale write-good.TooWordy = NO -->
Watch events occur when resources change in your cluster. Excessive watch events typically indicate composition patterns that cause loops, such as resources updating each other in cycles or external systems reverting changes made by Crossplane.

**To identify the source of excessive watch events:**

The XR's `Responsive` condition message identifies the problematic resource. Monitor this resource for modification events:
<!-- vale write-good.TooWordy = YES -->

```shell
kubectl get <resource-kind> <resource-name> -n <namespace> --output-watch-events --watch-only
```

**Common root causes and fixes:**

<!-- vale write-good.TooWordy = NO -->
- **Feedback loops in patches**: Review Composition patches for logic that creates circular updates where changes trigger more changes
- **External controller conflicts**: Other controllers or operators might modify the same resources, fighting with Crossplane for control
- **Frequent connection detail updates**: Consider if all fields need to be in connection details, as updates to connection secrets trigger watch events

Investigate and fix the root cause before adjusting circuit breaker thresholds.
<!-- vale write-good.TooWordy = YES -->

### Configuring circuit breaker parameters

<!-- vale write-good.Weasel = NO -->
<!-- vale write-good.TooWordy = NO -->
The default circuit breaker settings work well for most environments. You may need to adjust them based on your composition patterns and cluster size.
<!-- vale write-good.Weasel = YES -->
<!-- vale write-good.TooWordy = YES -->
<!-- vale Crossplane.Spelling = NO -->
<!-- vale Microsoft.Adverbs = NO -->
For example, increase the burst and refill rate for large-scale deployments with XRs updating frequently, or decrease them if you want stricter protection against thrashing.
<!-- vale Crossplane.Spelling = YES -->
<!-- vale Microsoft.Adverbs = YES -->

Configure circuit breaker parameters using Crossplane startup arguments via Helm:

```shell
helm install crossplane --namespace crossplane-system --create-namespace crossplane-stable/crossplane \
--set args='{"--circuit-breaker-burst=500.0","--circuit-breaker-refill-rate=5.0","--circuit-breaker-cooldown=1m"}'
```

Available parameters:
- `--circuit-breaker-burst`: Maximum burst of events (default: 100.0)
- `--circuit-breaker-refill-rate`: Events per second for sustained rate (default: 1.0)
<!-- vale Crossplane.Spelling = NO -->
<!-- vale Google.Units = NO -->
<!-- vale gitlab.Units = NO -->
- `--circuit-breaker-cooldown`: Duration to keep circuit open (default: 5m0s)
<!-- vale Crossplane.Spelling = YES -->
<!-- vale Google.Units = YES -->
<!-- vale gitlab.Units = YES -->

### Monitoring with metrics

<!-- vale write-good.TooWordy = NO -->
Track circuit breaker activity using these Prometheus metrics.
<!-- vale write-good.TooWordy = YES -->

See the [Metrics guide]({{< ref "metrics#circuit-breaker-metrics" >}}) for detailed metric information.

## Tips, tricks, and troubleshooting

This section covers some common tips, tricks, and troubleshooting steps
Expand Down
Loading