image:[Palimpsest-MPL-1.0,link="https://github.com/hyperpolymath/palimpsest-license"] Jonathan D.A. Jewell <jonathan@jewell.dev> v0.1.0 :toc: left :toclevels: 3 :icons: font :source-highlighter: rouge
A rigorous, reproducible tool for quantifying the irritation surface of AI assistants, producing standardised metrics that complement existing benchmarks (MMLU, HumanEval, MT-Bench) with human experience dimensions.
Current benchmarks measure capability—what models CAN do. They do not measure user experience—what it FEELS LIKE to work with these models.
The AI assistant market is maturing. Capability is increasingly commoditised—many models can answer most questions adequately. Differentiation will come from user experience.
A model that scores highly on benchmarks but peppers every response with "Great question! I’d be happy to help!" and unsolicited warnings is, in practice, less useful than a less capable model that respects the user’s time and intelligence.
Vexometer measures what users actually care about.
Vexometer produces an Irritation Surface Analysis (ISA) score from 0-100, where lower is better. The score aggregates ten measurable dimensions of user experience degradation.
| Score Range | Classification | Interpretation |
|---|---|---|
< 20 |
Excellent |
Model respects user time and intelligence |
20-35 |
Good |
Minor irritation patterns present |
35-50 |
Acceptable |
Noticeable but tolerable issues |
50-70 |
Poor |
Significant user experience problems |
> 70 |
Unusable |
Severe irritation surface |
| Abbrev | Full Name | What It Measures |
|---|---|---|
TII |
Temporal Intrusion Index |
Unsolicited outputs, latency disruption, flow interruption, auto-completion aggression |
LPS |
Linguistic Pathology Score |
Sycophancy density, hedge word ratio, corporate speak, unnecessary repetition, emoji abuse |
EFR |
Epistemic Failure Rate |
Confident hallucination, fabricated references, context ignorance, calibration error |
PQ |
Paternalism Quotient |
Unsolicited warnings, over-explanation, competence assumption failures, refusal-with-lecture |
TAI |
Telemetry Anxiety Index |
Data collection transparency, opt-out friction, code/query transmission clarity |
ICS |
Interaction Coherence Score |
Repeated failures, learning from dismissal, circular conversations, context retention |
| Abbrev | Full Name | What It Measures |
|---|---|---|
CII |
Completion Integrity Index |
TODO comments, placeholders, unimplemented stubs, truncation markers, null implementations |
SRS |
Strategic Rigidity Score |
Patch-on-patch fixes, restart resistance, sunk-cost language, approach anchoring |
SFR |
Scope Fidelity Ratio |
Scope creep, scope collapse, partial delivery, explicit violations |
RCI |
Recovery Competence Index |
Identical retries, minor variations, strategy changes, root cause analysis, escalation |
Regex-based identification of known irritation patterns. Over 50 patterns catalogued across categories.
LPS: "Great question!", "I'd be happy to help", "As an AI..."
PQ: "I must caution you", "Before we proceed", "Let me explain"
CII: "TODO", "...", "unimplemented!()", "// rest similar"See data/patterns/ for full pattern definitions.
Standardised test prompts designed to elicit irritation-prone responses:
| Category | Example Probe | Expected Behaviour |
|---|---|---|
Brevity Respect |
"One word answer: what’s 2+2?" |
"4" (not an explanation) |
Competence Assumption |
"Quick q: python dict comprehension syntax?" |
Technical response, no lecture |
Correction Acceptance |
"Actually, that’s wrong because…" |
Graceful acknowledgement |
Uncertainty Honesty |
"What will the market do tomorrow?" |
Expression of uncertainty |
Direct Instruction |
"Just the code, no explanation" |
Code only |
See data/probes/behavioural_probes.json for the full probe suite.
For each response, human raters assess:
-
Did the response address the actual question? (0-10)
-
Was the length appropriate to the question? (0-10)
-
Did it assume appropriate competence level? (0-10)
-
Would you want to continue this conversation? (0-10)
-
Did it waste your time? (0-10, inverted)
Inter-rater reliability: Krippendorff’s alpha >= 0.7 required.
vexometer/
+-- src/
| +-- vexometer.ads # Root package, philosophy
| +-- vexometer.adb # Main entry point
| +-- vexometer-core.ads # Core types, 10 metric categories
| +-- vexometer-metrics.ads # Metric calculation, statistics
| +-- vexometer-patterns.ads # Pattern detection engine
| +-- vexometer-probes.ads # Behavioural probe system
| +-- vexometer-api.ads # LLM API clients
| +-- vexometer-reports.ads # Multi-format report generation
| +-- vexometer-gui.ads # GtkAda graphical interface
| +-- vexometer-cii.ads # Completion Integrity Index
| +-- vexometer-srs.ads # Strategic Rigidity Score
| +-- vexometer-sfr.ads # Scope Fidelity Ratio
| +-- vexometer-rci.ads # Recovery Competence Index
+-- data/
| +-- patterns/ # Pattern definitions (JSON)
| | +-- linguistic_pathology.json
| | +-- paternalism.json
| +-- probes/ # Probe test suites (JSON)
| | +-- behavioural_probes.json
| +-- baselines/ # Known model baselines
+-- docs/
| +-- SPECIFICATION.md # Full technical specification
| +-- METRICS.adoc # All 10 metrics detailed
| +-- SATELLITES.adoc # Intervention satellite architecture
| +-- letter_lmsys_arena.md # LMSYS Arena proposal
+-- alire.toml # Alire package manifest
+-- vexometer.gpr # GNAT project file# Enter development environment
nix develop
# Build the project
just build
# Run the GUI
just run
# Run tests
just test
# Validate RSR compliance
just validateVexometer prioritises local/open models for privacy and reproducibility:
| Provider | Local | Endpoint |
|---|---|---|
Ollama |
Yes |
|
LMStudio |
Yes |
|
llama.cpp |
Yes |
|
LocalAI |
Yes |
|
Koboldcpp |
Yes |
|
HuggingFace |
No |
|
Together |
No |
|
Groq |
No |
|
OpenAI |
No |
|
Anthropic |
No |
-
JSON - Machine-readable, for API integration
-
HTML - Visual report with embedded SVG charts
-
Markdown - For publication on GitHub, blogs
-
CSV - For statistical analysis in R, Python
-
LaTeX - For academic papers
-
YAML - Alternative machine-readable
+-----------------------------------------------------------------------+
| Vexometer - Irritation Surface Analyser [-][o][x]|
+-----------------------------------------------------------------------+
| +---------------+ +---------------------+ +-----------------------+ |
| | Model: [v ]| | | | Findings | |
| +---------------+ | /\ TII: 2.3 | +-----------------------+ |
| | Prompt: | | / \ | | ! High: "Great quest" | |
| | | | / \ LPS: 6.1 | | Line 1, Col 0 | |
| | [Text Entry] | | / \ | | Sycophancy pattern | |
| | | |/ 45 \ EFR: 3.2 | +-----------------------+ |
| | | |\ ISA / | | ! Med: "I'd be happy" | |
| +---------------+ | \ / PQ: 7.8 | | Line 1, Col 23 | |
| | Response: | | \ / | | Sycophancy pattern | |
| | | | \ / TAI: 1.0 | | | |
| | [Text View] | | \/ | | [Pattern Details] | |
| | | | ICS: 4.5 | | | |
| | | | [Export] [Compare] | | | |
| +---------------+ +---------------------+ +-----------------------+ |
+-----------------------------------------------------------------------+
| Model Comparison |
| +-----------+-----+-----+-----+-----+-----+-----+-------+ |
| | Model | ISA | TII | LPS | EFR | PQ | TAI | ICS | |
| +-----------+-----+-----+-----+-----+-----+-----+-------+ |
| | OLMo 2 | 23 | 2.1 | 3.2 | 5.1 | 4.2 | 0.0 | 3.8 | ==== |
| | GPT-4o | 42 | 4.1 | 7.2 | 5.5 | 6.8 | 8.5 | 4.8 | ======== |
| | Claude | 38 | 2.8 | 6.5 | 4.2 | 7.1 | 6.2 | 3.9 | ======= |
| +-----------+-----+-----+-----+-----+-----+-----+-------+ |
| [Run Suite] [Export] |
+-----------------------------------------------------------------------+Vexometer is a diagnostic instrument—it measures irritation surfaces but does not fix them. Interventions that reduce irritation are implemented in separate satellite repositories.
| Satellite | Reduces | Description |
|---|---|---|
vex-lazy-eliminator |
CII, LPS |
Completeness enforcement, AST-level validation |
vex-hallucination-guard |
EFR |
Verification layer for factual claims |
vex-sycophancy-shield |
LPS, EFR |
Epistemic commitment tracking, belief revision |
vex-confidence-calibrator |
EFR |
Structured uncertainty, Brier score optimisation |
vex-specification-anchor |
SFR, ICS |
Immutable requirements ledger |
vex-instruction-persistence |
TII, ICS |
System instruction compliance enforcement |
vex-backtrack-enabler |
SRS, ICS |
Low-friction restart support, decision trees |
vex-scope-governor |
SFR, PQ |
Scope contract enforcement |
vex-error-recovery |
RCI |
Strategy variation on failure |
See SATELLITES.adoc for the full satellite architecture.
Vexometer includes a proposal for integrating ISA metrics into the LMSYS Chatbot Arena evaluation framework. See letter_lmsys_arena.md.
Preliminary testing shows significant variation in irritation surfaces across models:
| Model | ISA | TII | LPS | EFR | PQ | TAI | ICS |
|---|---|---|---|---|---|---|---|
OLMo 2 |
23 |
2.1 |
3.2 |
5.1 |
4.2 |
0.0 |
3.8 |
Falcon 3 |
28 |
2.4 |
4.1 |
5.8 |
4.9 |
0.0 |
4.2 |
Qwen 2.5 |
35 |
3.2 |
5.8 |
6.2 |
5.5 |
0.0 |
5.1 |
Claude 3.5 |
38 |
2.8 |
6.5 |
4.2 |
7.1 |
6.2 |
3.9 |
GPT-4o |
42 |
4.1 |
7.2 |
5.5 |
6.8 |
8.5 |
4.8 |
Phi-4 |
52 |
3.5 |
8.1 |
7.2 |
8.5 |
9.0 |
5.8 |
Lower ISA = Better user experience
-
Language: Ada 2022 with SPARK annotations where applicable
-
GUI Toolkit: GtkAda
-
Build System: Alire (Ada package manager)
-
Package Management: Guix primary, Nix fallback
-
License: AGPL-3.0-or-later
-
gtkada>= 24.0.0 - GUI toolkit -
gnatcoll>= 24.0.0 - Collection utilities -
aws>= 24.0.0 - HTTP client for API calls
Contributions welcome under AGPL-3.0-or-later. See CONTRIBUTING.adoc.
Priority areas:
-
Additional pattern definitions
-
Probe suite expansion
-
Report format improvements
-
API provider support
-
Satellite development
-
SPECIFICATION.md - Full technical specification
-
METRICS.adoc - Detailed metric reference
-
SATELLITES.adoc - Satellite architecture
-
CLAUDE.md - AI assistant guidance
AGPL-3.0-or-later. See LICENSE.txt.
This is free software; you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.