Skip to content

Conversation

@benjibc
Copy link
Contributor

@benjibc benjibc commented Jan 7, 2026

Motivation

  • Capture an explicit start timestamp for each rollout so per-rollout latency and trace alignment can be computed and rendered.
  • The frontend needs an overall latency column for each rollout to help visualize performance (OTEL-style waterfall was requested).
  • Existing created_at timestamps refer to the invocation and cannot be used to compute per-rollout start times.
  • Make rollout timing available in the TS schema so the UI can sort/filter by latency.

Description

  • Added rollout_start_time to ExecutionMetadata in eval_protocol/models.py and extended the TypeScript schema in vite-app/src/types/eval-protocol.ts with rollout_start_time, rollout_duration_seconds, and eval_duration_seconds.
  • Stamp rollout_start_time at rollout start in the main rollout entry points by setting it before the processing timer in processors such as default_single_turn_rollout_process.py, default_pydantic_ai_rollout_processor.py, remote_rollout_processor.py, github_action_rollout_processor.py, openenv_rollout_processor.py, default_klavis_sandbox_rollout_processor.py, default_agent_rollout_processor.py, tinker_rollout_processor.py, priority_scheduler.py, and mcp/execution/manager.py.
  • Surface rollout latency in the frontend by adding a sortable Rollout Latency column in vite-app/src/components/EvaluationTable.tsx, a RowRolloutDuration renderer, and wiring the cell in vite-app/src/components/EvaluationRow.tsx to display execution_metadata.rollout_duration_seconds formatted as seconds.
  • Minor plumbing to ensure rollout durations are still computed where previously used (rollout_duration_seconds assignments remain unchanged) while providing the start timestamp for future trace alignment.

Testing

  • Attempted to run pre-commit via make pre-commit to run linters/type checks but it failed because the pre-commit tool is not installed in the environment.
  • Attempted npm install in vite-app to validate frontend dependencies but it failed with Cannot read properties of null (reading 'matches') from npm in this environment.
  • No unit test suite was executed successfully in this environment as automated checks above did not complete.
  • All code changes were added and committed locally (git commit) after the edits completed successfully.

Codex Task


Note

Adds explicit rollout timing to enable accurate per-rollout latency and sorting.

  • Protocol: Add rollout_start_time to ExecutionMetadata plus rollout_duration_seconds and eval_duration_seconds fields; keep existing duration plumbing; serialize in models.py and TS ExecutionMetadataSchema.
  • Processors/manager: Set execution_metadata.rollout_start_time at rollout start and compute rollout_duration_seconds in tinker_rollout_processor.py, mcp/execution/manager.py, default_single_turn_rollout_process.py, default_pydantic_ai_rollout_processor.py, default_agent_rollout_processor.py, default_klavis_sandbox_rollout_processor.py, openenv_rollout_processor.py, remote_rollout_processor.py, github_action_rollout_processor.py, and priority_scheduler.py.
  • Frontend: Add sortable "Rollout Latency" column in EvaluationTable.tsx and render with RowRolloutDuration in EvaluationRow.tsx using execution_metadata.rollout_duration_seconds.

Written by Cursor Bugbot for commit 9171bc5. This will update automatically on new commits. Configure here.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9171bc586c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +66 to +67
if row.execution_metadata.rollout_start_time is None:
row.execution_metadata.rollout_start_time = datetime.now(timezone.utc)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reset rollout_start_time on retries

In retry flows (rollout_processor_with_retry reuses the same EvaluationRow), this if ... is None guard means the timestamp is only set on the first attempt. If the first attempt fails and the row is retried, the successful attempt keeps the earlier rollout_start_time while rollout_duration_seconds reflects only the last attempt, so any latency calculation or trace alignment based on rollout_start_time will be too early by the time spent in prior retries. Consider resetting rollout_start_time at the start of each attempt (or in the retry wrapper) to keep these timings consistent.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants