Skip to content

eval-protocol/kernel-quickstart

Repository files navigation

Kernel Eval Protocol

Eval Protocol integration for evaluating VLM browser agents using Kernel browser pools.

This package provides standardized evaluation of browser-based VLM agents using the Eval Protocol framework.

Overview

The integration provides:

  • KernelBrowserRolloutProcessor: A custom rollout processor that runs multi-step browser episodes using Kernel browser pools
  • core/: Vendored agent code (QwenAgent, WebJudge, browser adapter) from kernel-tinker-rl
  • agent_auth/: Agent Auth benchmark configuration and custom actions
  • test_agent_auth.py: Example evaluation test for the Agent Auth benchmark

How It Works

┌─────────────────────────────────────────────────────────────────┐
│                     Eval Protocol                               │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  @evaluation_test(...)                                    │  │
│  │  async def test_agent_auth(row):                         │  │
│  │      # Rollout already executed by processor              │  │
│  │      trajectory = get_trajectory(row)                     │  │
│  │      score = webjudge.evaluate(trajectory)               │  │
│  └──────────────────────────────────────────────────────────┘  │
│                            │                                    │
│                            ▼                                    │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  KernelBrowserRolloutProcessor                            │  │
│  │    1. Acquire browser from Kernel pool                    │  │
│  │    2. Navigate to initial URL                             │  │
│  │    3. Run agent loop (QwenAgent)                          │  │
│  │       - Screenshot → VLM predict → Execute → Repeat       │  │
│  │    4. Capture trajectory (screenshots, actions, messages) │  │
│  │    5. Release browser                                     │  │
│  └──────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘
                             │
                             ▼
              ┌─────────────────────────────┐
              │      Kernel Browser Pool     │
              │  ┌─────┐ ┌─────┐ ┌─────┐    │
              │  │ 🌐  │ │ 🌐  │ │ 🌐  │    │
              │  └─────┘ └─────┘ └─────┘    │
              └─────────────────────────────┘

Setup

Prerequisites

  1. Kernel API Key: Get from https://onkernel.com
  2. OpenAI API Key: Get from https://platform.openai.com (for WebJudge scoring)
  3. Fireworks API Key: Get from https://fireworks.ai (for VLM inference)
  4. Browser Pool: Create a Kernel browser pool named eval-browser-pool

Installation

# Clone this repo
git clone <this-repo> kernel-eval-protocol
cd kernel-eval-protocol

# Install dependencies
pip install -r requirements.txt

# Set environment variables (or create a .env file)
export KERNEL_API_KEY="your-kernel-key"
export OPENAI_API_KEY="your-openai-key"
export FIREWORKS_API_KEY="your-fireworks-key"

Create .env file

KERNEL_API_KEY=your-kernel-key
OPENAI_API_KEY=your-openai-key
FIREWORKS_API_KEY=your-fireworks-key

Create Browser Pool

# Using Kernel CLI
kernel pools create eval-browser-pool --size 20

Usage

Run Agent Auth Evaluation

# Activate venv
source .venv/bin/activate

# Run evaluation
pytest test_agent_auth.py -vs

Custom Evaluations

Create your own evaluation test:

from eval_protocol.pytest import evaluation_test
from eval_protocol.models import EvaluateResult
from kernel_browser_rollout_processor import (
    KernelBrowserRolloutProcessor,
    decode_screenshots,
)
from core.reward_models.webjudge import Trajectory, WebJudge
from agent_auth.actions import AGENT_AUTH_ACTIONS
from agent_auth.config import get_agent_auth_system_prompt

@evaluation_test(
    input_dataset=["your_tasks.jsonl"],
    rollout_processor=KernelBrowserRolloutProcessor(
        pool_name="your-pool",
        max_steps=15,
        system_prompt=get_agent_auth_system_prompt(),
        extra_actions=AGENT_AUTH_ACTIONS,
    ),
    completion_params=[{"model": "accounts/fireworks/models/qwen3-vl-30b-a3b-thinking"}],
)
async def test_your_evaluation(row):
    # Trajectory data is in row.execution_metadata.extra
    extra = row.execution_metadata.extra
    screenshots = decode_screenshots(extra["screenshots_b64"])
    actions = extra["action_history"]
    
    # Message history (including tool_calls) is in row.messages
    messages = row.messages
    
    # Your evaluation logic here
    score = your_scorer(screenshots, actions)
    
    row.evaluation_result = EvaluateResult(score=score, reason="...")
    return row

Train with RFT

Create a reinforcement fine-tuning job using evaluation results:

ep create rft --base-model accounts/fireworks/models/qwen3-vl-8b-instruct --chunk-size 50 --max-context-length 32768 --batch-size 32768 --epochs 4

Project Structure

kernel-eval-protocol/
├── core/                          # Vendored from kernel-tinker-rl
│   ├── agent.py                   # QwenAgent - VLM agent with message history
│   ├── agent_loop.py              # Multi-step agent loop
│   ├── browser.py                 # Kernel browser adapter
│   ├── actions.py                 # Action definitions (click, type, etc.)
│   ├── prompts.py                 # System prompt builder
│   ├── tracking.py                # Episode tracking utilities
│   ├── utils.py                   # Helper utilities
│   └── reward_models/
│       ├── base.py                # Base reward model
│       └── webjudge.py            # WebJudge LLM-as-Judge scorer
├── agent_auth/                    # Agent Auth benchmark
│   ├── actions.py                 # FoundInputsAction for form discovery
│   └── config.py                  # System prompt configuration
├── kernel_browser_rollout_processor.py  # Eval Protocol rollout processor
├── test_agent_auth.py             # Example evaluation test
├── tasks.jsonl                    # Agent Auth task dataset
├── requirements.txt               # Python dependencies
├── pytest.ini                     # Pytest configuration
└── README.md

Data Format

Input Dataset (tasks.jsonl)

{
  "id": "gandi-net-register",
  "initial_url": "https://gandi.net",
  "task": "Navigate to gandi.net and find the register page..."
}

Message History (row.messages)

The agent preserves full conversation history including native tool calls:

[
    {"role": "system", "content": "...system prompt..."},
    {"role": "user", "content": [{"type": "image_url", ...}, {"type": "text", ...}]},
    {"role": "assistant", "content": "...", "tool_calls": [{"id": "...", "function": {...}}]},
    ...
]

Trajectory Data (row.execution_metadata.extra)

{
    # For evaluation (WebJudge)
    "screenshots_b64": ["base64...", ...],  # PNG images as base64
    "action_history": ["click(100, 200)", ...],
    
    # Task info
    "task": "Navigate to...",
    "task_id": "gandi-net-register",
    "initial_url": "https://gandi.net",
    "final_url": "https://gandi.net/register",
    
    # Episode metadata
    "termination_reason": "terminal_action",
    "terminal_action": "found_inputs(...)",
    "steps_completed": 5,
    "error": null,
    
    # Step details (for debugging)
    "step_results": [...]
}

Modifications to Vendored Code

The core/ directory is vendored from kernel-tinker-rl with the following modifications:

1. Message History Preservation (core/agent.py)

The original QwenAgent only stored text responses, discarding native tool calls. We modified it to preserve the full conversation history:

  • Added messages: list[dict] to AgentState
  • In predict(), now stores:
    • System message (on first call)
    • User messages (with screenshots + instruction)
    • Assistant responses including tool_calls when present
  • Added get_messages() method to retrieve the full history

This enables accurate conversation replay in the Eval Protocol UI.

2. WebJudge OpenAI API Compatibility (core/reward_models/webjudge.py)

The original code used max_tokens and temperature=0 which work with OpenRouter but not with direct OpenAI API calls (newer models like gpt-5-mini):

  • Changed max_tokensmax_completion_tokens
  • Removed temperature=0 (newer OpenAI models only support temperature=1)

These changes allow WebJudge to work with direct OpenAI API calls instead of requiring OpenRouter.

Related Projects

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages