Skip to content

feat: Horizontal scaling with message queue (Redis/NATS) for enterprise deployments #47

@gouravjshah

Description

@gouravjshah

Summary

For enterprise-scale deployments (1000s of repositories, multiple orgs, high webhook throughput), AOF needs native horizontal scaling support with a message queue architecture.

Current Architecture

GitHub Webhook → AOF Daemon (single process) → Execute Agent/Fleet/Flow

Limitations:

  • Single process handles all events
  • Synchronous webhook processing
  • No built-in queue for backpressure handling
  • Memory grows with trigger count
  • Single point of failure

Proposed Architecture

                    ┌─────────────────┐
                    │   Ingress/LB    │
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              ▼              ▼              ▼
     ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
     │ AOF Gateway │ │ AOF Gateway │ │ AOF Gateway │
     │ (stateless) │ │ (stateless) │ │ (stateless) │
     └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
            │               │               │
            └───────────────┼───────────────┘
                            ▼
                   ┌─────────────────┐
                   │   Redis/NATS    │
                   │  (message queue)│
                   └────────┬────────┘
                            │
              ┌─────────────┼─────────────┐
              ▼             ▼             ▼
     ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
     │ AOF Worker  │ │ AOF Worker  │ │ AOF Worker  │
     │ (executor)  │ │ (executor)  │ │ (executor)  │
     └─────────────┘ └─────────────┘ └─────────────┘

Components

  1. AOF Gateway (stateless)

    • Receives webhooks
    • Validates signatures
    • Matches to trigger
    • Publishes to queue
    • Returns 202 Accepted immediately
  2. Message Queue (Redis Streams or NATS JetStream)

    • Durable message storage
    • Consumer groups for load distribution
    • Dead letter queue for failed events
    • Retry with exponential backoff
  3. AOF Worker (stateless)

    • Consumes from queue
    • Executes agents/fleets/flows
    • Reports results back to queue
    • Horizontally scalable

Configuration

apiVersion: aof.dev/v1
kind: DaemonConfig
metadata:
  name: aof-gateway

spec:
  mode: gateway  # New: gateway | worker | standalone (default)
  
  queue:
    type: redis  # redis | nats
    url: redis://redis-cluster:6379
    # Or for NATS:
    # type: nats
    # url: nats://nats-cluster:4222
    
    # Queue settings
    stream: aof-events
    consumer_group: aof-workers
    max_retries: 3
    retry_delay_ms: 1000
    dead_letter_queue: aof-dlq
    
  # Gateway-specific settings
  gateway:
    ack_timeout_ms: 5000  # Return 202 within 5s
    
  # Worker-specific settings  
  worker:
    concurrency: 10  # Parallel event processing
    prefetch: 5      # Events to prefetch

Implementation Plan

Phase 1: Queue Abstraction

  • Define MessageQueue trait
  • Implement Redis Streams backend
  • Implement NATS JetStream backend
  • Add queue configuration to DaemonConfig

Phase 2: Gateway Mode

  • Add mode: gateway option
  • Separate webhook handling from execution
  • Publish events to queue
  • Return 202 Accepted immediately

Phase 3: Worker Mode

  • Add mode: worker option
  • Consume from queue
  • Execute agents/fleets/flows
  • Handle failures and retries

Phase 4: Observability

  • Queue depth metrics
  • Processing latency metrics
  • Dead letter queue alerting
  • Distributed tracing (OpenTelemetry)

Phase 5: Advanced Features

  • Priority queues (critical events first)
  • Rate limiting per org/repo
  • Event deduplication
  • Graceful shutdown with drain

Kubernetes Deployment

# Gateway Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: aof-gateway
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: aof
          args: [serve, --mode=gateway]
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
---
# Worker Deployment (auto-scaling)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: aof-worker
spec:
  replicas: 5
  template:
    spec:
      containers:
        - name: aof
          args: [serve, --mode=worker]
          resources:
            requests:
              cpu: 500m
              memory: 512Mi
---
# HPA for workers
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: aof-worker-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: aof-worker
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: External
      external:
        metric:
          name: redis_stream_lag
        target:
          type: AverageValue
          averageValue: 100

Benefits

  • Scalability: Add workers to handle more load
  • Reliability: Events persisted in queue, survive restarts
  • Backpressure: Queue absorbs traffic spikes
  • Isolation: Workers can be specialized (frontend, backend, infra)
  • Observability: Queue metrics for capacity planning

Related Issues

Acceptance Criteria

  • Queue abstraction with Redis and NATS backends
  • Gateway mode for webhook ingestion
  • Worker mode for event processing
  • Kubernetes manifests for horizontal deployment
  • Helm chart with scaling options
  • Documentation for enterprise deployment
  • Benchmark showing 10x throughput improvement

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions