Eval Protocol integration for evaluating VLM browser agents using Kernel browser pools.
This package provides standardized evaluation of browser-based VLM agents using the Eval Protocol framework.
The integration provides:
KernelBrowserRolloutProcessor: A custom rollout processor that runs multi-step browser episodes using Kernel browser poolscore/: Vendored agent code (QwenAgent, WebJudge, browser adapter) from kernel-tinker-rlagent_auth/: Agent Auth benchmark configuration and custom actionstest_agent_auth.py: Example evaluation test for the Agent Auth benchmark
┌─────────────────────────────────────────────────────────────────┐
│ Eval Protocol │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ @evaluation_test(...) │ │
│ │ async def test_agent_auth(row): │ │
│ │ # Rollout already executed by processor │ │
│ │ trajectory = get_trajectory(row) │ │
│ │ score = webjudge.evaluate(trajectory) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ KernelBrowserRolloutProcessor │ │
│ │ 1. Acquire browser from Kernel pool │ │
│ │ 2. Navigate to initial URL │ │
│ │ 3. Run agent loop (QwenAgent) │ │
│ │ - Screenshot → VLM predict → Execute → Repeat │ │
│ │ 4. Capture trajectory (screenshots, actions, messages) │ │
│ │ 5. Release browser │ │
│ └──────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────┐
│ Kernel Browser Pool │
│ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │ 🌐 │ │ 🌐 │ │ 🌐 │ │
│ └─────┘ └─────┘ └─────┘ │
└─────────────────────────────┘
- Kernel API Key: Get from https://onkernel.com
- OpenAI API Key: Get from https://platform.openai.com (for WebJudge scoring)
- Fireworks API Key: Get from https://fireworks.ai (for VLM inference)
- Browser Pool: Create a Kernel browser pool named
eval-browser-pool
# Clone this repo
git clone <this-repo> kernel-eval-protocol
cd kernel-eval-protocol
# Install dependencies
pip install -r requirements.txt
# Set environment variables (or create a .env file)
export KERNEL_API_KEY="your-kernel-key"
export OPENAI_API_KEY="your-openai-key"
export FIREWORKS_API_KEY="your-fireworks-key"Create .env file
KERNEL_API_KEY=your-kernel-key
OPENAI_API_KEY=your-openai-key
FIREWORKS_API_KEY=your-fireworks-key# Using Kernel CLI
kernel pools create eval-browser-pool --size 20# Activate venv
source .venv/bin/activate
# Run evaluation
pytest test_agent_auth.py -vsCreate your own evaluation test:
from eval_protocol.pytest import evaluation_test
from eval_protocol.models import EvaluateResult
from kernel_browser_rollout_processor import (
KernelBrowserRolloutProcessor,
decode_screenshots,
)
from core.reward_models.webjudge import Trajectory, WebJudge
from agent_auth.actions import AGENT_AUTH_ACTIONS
from agent_auth.config import get_agent_auth_system_prompt
@evaluation_test(
input_dataset=["your_tasks.jsonl"],
rollout_processor=KernelBrowserRolloutProcessor(
pool_name="your-pool",
max_steps=15,
system_prompt=get_agent_auth_system_prompt(),
extra_actions=AGENT_AUTH_ACTIONS,
),
completion_params=[{"model": "accounts/fireworks/models/qwen3-vl-30b-a3b-thinking"}],
)
async def test_your_evaluation(row):
# Trajectory data is in row.execution_metadata.extra
extra = row.execution_metadata.extra
screenshots = decode_screenshots(extra["screenshots_b64"])
actions = extra["action_history"]
# Message history (including tool_calls) is in row.messages
messages = row.messages
# Your evaluation logic here
score = your_scorer(screenshots, actions)
row.evaluation_result = EvaluateResult(score=score, reason="...")
return rowCreate a reinforcement fine-tuning job using evaluation results:
ep create rft --base-model accounts/fireworks/models/qwen3-vl-8b-instruct --chunk-size 50 --max-context-length 32768 --batch-size 32768 --epochs 4kernel-eval-protocol/
├── core/ # Vendored from kernel-tinker-rl
│ ├── agent.py # QwenAgent - VLM agent with message history
│ ├── agent_loop.py # Multi-step agent loop
│ ├── browser.py # Kernel browser adapter
│ ├── actions.py # Action definitions (click, type, etc.)
│ ├── prompts.py # System prompt builder
│ ├── tracking.py # Episode tracking utilities
│ ├── utils.py # Helper utilities
│ └── reward_models/
│ ├── base.py # Base reward model
│ └── webjudge.py # WebJudge LLM-as-Judge scorer
├── agent_auth/ # Agent Auth benchmark
│ ├── actions.py # FoundInputsAction for form discovery
│ └── config.py # System prompt configuration
├── kernel_browser_rollout_processor.py # Eval Protocol rollout processor
├── test_agent_auth.py # Example evaluation test
├── tasks.jsonl # Agent Auth task dataset
├── requirements.txt # Python dependencies
├── pytest.ini # Pytest configuration
└── README.md
{
"id": "gandi-net-register",
"initial_url": "https://gandi.net",
"task": "Navigate to gandi.net and find the register page..."
}The agent preserves full conversation history including native tool calls:
[
{"role": "system", "content": "...system prompt..."},
{"role": "user", "content": [{"type": "image_url", ...}, {"type": "text", ...}]},
{"role": "assistant", "content": "...", "tool_calls": [{"id": "...", "function": {...}}]},
...
]{
# For evaluation (WebJudge)
"screenshots_b64": ["base64...", ...], # PNG images as base64
"action_history": ["click(100, 200)", ...],
# Task info
"task": "Navigate to...",
"task_id": "gandi-net-register",
"initial_url": "https://gandi.net",
"final_url": "https://gandi.net/register",
# Episode metadata
"termination_reason": "terminal_action",
"terminal_action": "found_inputs(...)",
"steps_completed": 5,
"error": null,
# Step details (for debugging)
"step_results": [...]
}The core/ directory is vendored from kernel-tinker-rl with the following modifications:
The original QwenAgent only stored text responses, discarding native tool calls. We modified it to preserve the full conversation history:
- Added
messages: list[dict]toAgentState - In
predict(), now stores:- System message (on first call)
- User messages (with screenshots + instruction)
- Assistant responses including
tool_callswhen present
- Added
get_messages()method to retrieve the full history
This enables accurate conversation replay in the Eval Protocol UI.
The original code used max_tokens and temperature=0 which work with OpenRouter but not with direct OpenAI API calls (newer models like gpt-5-mini):
- Changed
max_tokens→max_completion_tokens - Removed
temperature=0(newer OpenAI models only supporttemperature=1)
These changes allow WebJudge to work with direct OpenAI API calls instead of requiring OpenRouter.
- kernel-tinker-rl - VLM RL training for computer use (source of vendored core/)
- Eval Protocol - Pytest-based LLM evaluation framework
- Kernel - Browser-as-a-service