-
Notifications
You must be signed in to change notification settings - Fork 0
refactor: remove build queue for immediate build execution #56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Replace the explicit build queue with immediate build execution. Builds now start immediately when created instead of being queued. Key changes: - Delete queue.go and queue_test.go - Remove StatusQueued - builds start with StatusBuilding - Add ErrResourcesExhausted error for resource limit detection - Add 503 response with Retry-After header to OpenAPI spec - Update manager to start builds in goroutines directly - Simplify CancelBuild to only handle running builds - Update RecoverPendingBuilds to start builds immediately - Remove queue-related metrics (queueLength, activeBuilds) - Update tests and documentation If host resources are exhausted during instance creation, the build will fail with a clear error message. The API includes a 503 response type with Retry-After header for future pre-check implementation.
✱ Stainless preview buildsThis PR will update the Edit this comment to update it. It will appear in the SDK's changelogs. ✅ hypeman-typescript studio · code · diff
✅ hypeman-go studio · code · diff
⏳ These are partial results; builds are still running. This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push. |
|
A couple things to consider with the queue removal:
|
rgarcia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
| if err != nil { | ||
| // Check if this is a resource exhaustion error | ||
| errStr := err.Error() | ||
| if strings.Contains(errStr, "exceeds") && strings.Contains(errStr, "limit") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unclear how this ErrResourcesExhausted error makes it back to the API handler - since executeBuild is called from runBuild which runs in a background goroutine, the caller has already received a 202 by the time this runs. the 503 path in the handler seems unreachable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely right - that was exactly the issue. I've fixed this by adding a synchronous preflight resource check in CreateBuild that runs before spawning the async goroutine:
// Preflight check: verify resources are available before accepting the build
builderMemory := int64(policy.MemoryMB) * 1024 * 1024
if err := m.instanceManager.CheckResourceAvailability(ctx, policy.CPUs, builderMemory); err != nil {
if errors.Is(err, instances.ErrResourcesExhausted) {
return nil, fmt.Errorf("%w: %v", ErrResourcesExhausted, err)
}
return nil, fmt.Errorf("check resource availability: %w", err)
}Added a new CheckResourceAvailability() method to instances.Manager that checks per-instance and aggregate limits without actually creating an instance. Now the 503 path is reachable when resources are truly exhausted at build creation time.
Also added proper sentinel errors (ErrResourcesExhausted) to both packages so we can use errors.Is() instead of brittle string matching.
- High: Add synchronous preflight resource check in CreateBuild to return 503 when resources are exhausted (instead of failing asynchronously) - Medium: Handle StatusQueued in recovery for backward compatibility with builds created before queue removal - Medium: Check terminal states in updateStatus to prevent race condition where cancelled builds get overwritten by runBuild goroutine Changes: - Add ErrResourcesExhausted sentinel error to instances package - Add CheckResourceAvailability() method to instances.Manager interface - Add isTerminalStatus() helper and terminal state checks in manager.go - Update CancelBuild to set cancelled status before deleting instance - Handle StatusQueued in listPendingBuilds for backward compat
StatusQueued constant was removed, use "queued" string literal instead.
The exec endpoint uses WebSocket, not REST API. The previous curl POST approach returned HTTP 405 and empty responses. Changes: - Use websocat WebSocket client for exec (auto-downloads if needed) - Write JSON command to temp file to avoid shell escaping issues - Add wait_for_agent parameter to give guest agent time to initialize - Increase sleep before exec from 2s to 5s for reliability - Verify exec success by checking for expected output (Alpine Linux)
… result Fixed a race condition where a build could still execute after being cancelled. The issue was: 1. runBuild checks if build is in terminal state 2. CancelBuild is called, setting status to "cancelled" 3. runBuild calls updateStatus(StatusBuilding) which silently returns 4. runBuild proceeds to executeBuild, wasting compute resources The fix: - updateStatus now returns bool indicating if the update was applied - runBuild checks this return value and aborts if false - This prevents creating builder VMs for already-cancelled builds
| return nil, fmt.Errorf("%w: %v", ErrResourcesExhausted, err) | ||
| } | ||
| return nil, fmt.Errorf("check resource availability: %w", err) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preflight resource check has TOCTOU race condition
Medium Severity
The CheckResourceAvailability preflight check runs before acquiring createMu lock. Multiple concurrent CreateBuild requests can all pass this check simultaneously before any build actually consumes resources (which happens asynchronously in executeBuild). This means under concurrent load, users may receive 202 responses for builds that immediately fail due to resource exhaustion, defeating the intended 503 fast-fail behavior. Moving the check inside the lock would at least serialize the checks and make them more accurate relative to pending builds.
Summary
lib/builds/queue.goandlib/builds/queue_test.goErrResourcesExhaustederror and 503 response type for resource exhaustionqueue_positionfieldMotivation
The build queue added complexity without providing significant benefits:
Key Changes
Files Changed
lib/builds/queue.go,lib/builds/queue_test.gomanager.go,types.go,errors.go,metrics.go,storage.goopenapi.yaml, updatedcmd/api/api/builds.gomanager_test.goto remove queue testsREADME.mdto reflect new architectureTest plan
go build ./...- project compilesgo test ./lib/builds/...- all builds package tests passbuildingstatusNote
Modernizes the build flow to execute immediately and surface capacity errors early.
lib/builds/queue.goand tests; initial status isbuilding; recovery restarts pending builds directlyinstances.CheckResourceAvailability; introduceErrResourcesExhausted; map to503 Service UnavailablewithRetry-Afterin API (cmd/api/...,oapi); surface from manager and instance layerqueuedstatus, deprecatequeue_position, change 202 description to “started”, and add 503 response; regenerate client/server codehypeman_build_duration_secondsandhypeman_builds_total; remove queue/active gaugesREADME.md, revise tests to expect immediatebuilding, remove queue tests; enhance e2e script to use WebSocket exec (websocat); minor dep tidy (go.mod)Written by Cursor Bugbot for commit 587b767. This will update automatically on new commits. Configure here.