-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Summary
For enterprise-scale deployments (1000s of repositories, multiple orgs, high webhook throughput), AOF needs native horizontal scaling support with a message queue architecture.
Current Architecture
GitHub Webhook → AOF Daemon (single process) → Execute Agent/Fleet/Flow
Limitations:
- Single process handles all events
- Synchronous webhook processing
- No built-in queue for backpressure handling
- Memory grows with trigger count
- Single point of failure
Proposed Architecture
┌─────────────────┐
│ Ingress/LB │
└────────┬────────┘
│
┌──────────────┼──────────────┐
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ AOF Gateway │ │ AOF Gateway │ │ AOF Gateway │
│ (stateless) │ │ (stateless) │ │ (stateless) │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└───────────────┼───────────────┘
▼
┌─────────────────┐
│ Redis/NATS │
│ (message queue)│
└────────┬────────┘
│
┌─────────────┼─────────────┐
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ AOF Worker │ │ AOF Worker │ │ AOF Worker │
│ (executor) │ │ (executor) │ │ (executor) │
└─────────────┘ └─────────────┘ └─────────────┘
Components
-
AOF Gateway (stateless)
- Receives webhooks
- Validates signatures
- Matches to trigger
- Publishes to queue
- Returns 202 Accepted immediately
-
Message Queue (Redis Streams or NATS JetStream)
- Durable message storage
- Consumer groups for load distribution
- Dead letter queue for failed events
- Retry with exponential backoff
-
AOF Worker (stateless)
- Consumes from queue
- Executes agents/fleets/flows
- Reports results back to queue
- Horizontally scalable
Configuration
apiVersion: aof.dev/v1
kind: DaemonConfig
metadata:
name: aof-gateway
spec:
mode: gateway # New: gateway | worker | standalone (default)
queue:
type: redis # redis | nats
url: redis://redis-cluster:6379
# Or for NATS:
# type: nats
# url: nats://nats-cluster:4222
# Queue settings
stream: aof-events
consumer_group: aof-workers
max_retries: 3
retry_delay_ms: 1000
dead_letter_queue: aof-dlq
# Gateway-specific settings
gateway:
ack_timeout_ms: 5000 # Return 202 within 5s
# Worker-specific settings
worker:
concurrency: 10 # Parallel event processing
prefetch: 5 # Events to prefetchImplementation Plan
Phase 1: Queue Abstraction
- Define
MessageQueuetrait - Implement Redis Streams backend
- Implement NATS JetStream backend
- Add queue configuration to DaemonConfig
Phase 2: Gateway Mode
- Add
mode: gatewayoption - Separate webhook handling from execution
- Publish events to queue
- Return 202 Accepted immediately
Phase 3: Worker Mode
- Add
mode: workeroption - Consume from queue
- Execute agents/fleets/flows
- Handle failures and retries
Phase 4: Observability
- Queue depth metrics
- Processing latency metrics
- Dead letter queue alerting
- Distributed tracing (OpenTelemetry)
Phase 5: Advanced Features
- Priority queues (critical events first)
- Rate limiting per org/repo
- Event deduplication
- Graceful shutdown with drain
Kubernetes Deployment
# Gateway Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: aof-gateway
spec:
replicas: 3
template:
spec:
containers:
- name: aof
args: [serve, --mode=gateway]
resources:
requests:
cpu: 100m
memory: 128Mi
---
# Worker Deployment (auto-scaling)
apiVersion: apps/v1
kind: Deployment
metadata:
name: aof-worker
spec:
replicas: 5
template:
spec:
containers:
- name: aof
args: [serve, --mode=worker]
resources:
requests:
cpu: 500m
memory: 512Mi
---
# HPA for workers
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: aof-worker-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: aof-worker
minReplicas: 2
maxReplicas: 20
metrics:
- type: External
external:
metric:
name: redis_stream_lag
target:
type: AverageValue
averageValue: 100Benefits
- Scalability: Add workers to handle more load
- Reliability: Events persisted in queue, survive restarts
- Backpressure: Queue absorbs traffic spikes
- Isolation: Workers can be specialized (frontend, backend, infra)
- Observability: Queue metrics for capacity planning
Related Issues
- feat: Add team/role-based user authorization for GitHub triggers #45 - Team/role-based authorization
- feat: Native multi-organization support with per-org credentials #46 - Multi-organization support
Acceptance Criteria
- Queue abstraction with Redis and NATS backends
- Gateway mode for webhook ingestion
- Worker mode for event processing
- Kubernetes manifests for horizontal deployment
- Helm chart with scaling options
- Documentation for enterprise deployment
- Benchmark showing 10x throughput improvement
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request