WeaveBench: Long-Horizon Hybrid GUI+CLI Benchmark for Computer-Use Agents

01 / Core Idea

Evaluate cross-interface orchestration, not isolated capabilities.

Deployed CUA runtimes already integrate GUI control, CLI/code execution, browsers, and external tools inside one agent loop. Existing benchmarks test these channels in isolation, leaving the orchestration layer under-measured. WeaveBench is built around tasks where success requires the agent to weave both channels together.

GUI and CLI are complementary, not interchangeable.

GUIs expose rendered, transient, spatial state — canvases, dialogs, visual feedback. CLI/code expose structured, scriptable, persistent state — source files, configs, logs, services. Real workflows braid them: observe in the GUI, verify in the shell, edit in code, re-render, re-observe.

Real Ubuntu desktop 114 sandboxed tasks 8 work domains 4 deployed harnesses Trajectory-aware judge

P1 · Channel non-substitutability

Each admitted task requires coordinating GUI observation/action with CLI/code modification inside the same trajectory; annotated with single-channel-bound atomic operations.

P2 · Long-horizon execution

Expert reference trajectories contain multiple interleaved GUI and CLI/code phases — not a single perception, action, or tool-use step.

P3 · Cross-application state

Tasks span multiple independent applications/processes whose states are linked by the workflow; agents must preserve and transfer information across them.

Trajectory-aware grading

Final-only grading is fragile here. The judge audits transcripts, files, screenshots, and logs; nine shortcut detectors zero credit on confirmed reward hacks.

02 / Pipeline

Task · Harness · Evaluation — one diagram, three pillars.

Task: 114 tasks across 8 domains, harvested from real venues, packaged as ℰ = (𝒫, ℳ, 𝒞) bundles, audited against P1–P3, and stress-tested by ≥3 pilot agents. Harness: the agent runs in a single session over an Ubuntu sandbox, with a minimal GUI plugin (one screenshot tool + nine actuation primitives) layered on top of OpenClaw's CLI/code tools; the same plugin is ported to Codex CLI, Claude Code, and Hermes. Evaluation is performed by an isolated trajectory-aware agentic judge that combines bottom-up rubric scoring with shortcut detection.

WeaveBench pipeline: Task construction, hybrid harness with GUI plugin on top of CLI/code tools, and a trajectory-aware agentic judge. — **Figure 2 from the paper.** Task construction (C1 archetype-guided sourcing → C4 pilot validation), the hybrid harness shared by all four runtimes, and the isolated trajectory-aware judge that re-fetches evidence across artifacts, screenshots, and logs.

M1

Minimal GUI plugin (10 tools)

One perception primitive screenshot plus nine pyautogui-backed actuation primitives (click, double_click, triple_click, move, drag, scroll, type, keypress, wait). Exposed alongside each runtime's terminal, file, code, and browser tools — model loop and prompts unchanged.

M2

Trajectory-aware Agent-as-a-Judge

For every rollout, an isolated subprocess judge re-fetches evidence over multiple turns using file, image, and shell tools; decomposes each deliverable into atomic clauses; verifies each clause with cited evidence; and assigns scores along eight process & outcome dimensions.

M3

Nine shortcut detectors

A parallel scan covers fake screenshots/renders, regenerated fixtures, hard-coded metrics, mock services, duplicate crops, overlay manipulation, ground-truth leakage, runtime injection, and fabricated screenshots. A high-confidence hit triggers h_t,m=1 and zeros the task score.

M4

Layered scoring (min rule)

s_t,m = 0 if h_t,m=1; otherwise min(⅛ ∑ d^process, d^deliv). The min prevents strong auxiliary dimensions from masking weak deliverables; the zeroing rule prevents fabricated evidence from earning partial credit.

03 / Dataset

114 tasks across 8 domains, long, channel-interleaved.

Per-domain task counts range from 10 to 18. Best live rollouts use a median of 76 tool calls (max 471) and a median of 16 GUI↔CLI channel switches per task. Each task carries provenance (URL, commit hash, or post id) from public venues, with self-contained bundles ℰ = (𝒫, ℳ, 𝒞) covering prompt, materials, and check anchors.

WeaveBench dataset overview: taxonomy of 114 tasks across 8 domains and 23 subcategories; per-task GUI/CLI channel-switch distribution; per-task tool-call rollout length. — **Figure 3 from the paper.** **(a)** Taxonomy of 114 tasks across 8 domains and 23 subcategories. **(b)** GUI↔CLI channel switches per task (median 16) — the degree of channel interleaving. **(c)** Rollout length measured by tool calls in the trajectory (median 76, max 471).

DSK

Desktop Productivity

Filesystem & storage, system & hardware services, desktop UI / UX flows.

DOC

Document Processing

Office suites, markup, LaTeX, print-ready document workflows.

GAM

Games & Interactive

Game engines & runtimes, puzzle / strategy, realtime / action.

WEB

Web Development

DevTools, perf budgets, network profiling, UI & client-state debugging.

DAV

Data Analysis & Viz

Notebooks, dashboards, observability traces, pipelines / ETL.

OPS

DevOps & Sysadmin

Cluster monitoring, services, DB ops, network & security.

SPA

Spatial / 3D / CAD

CAD, engineering design, scientific / FEM simulation.

DES

Design & Creative

Visual asset workflows, color management, engineering design.

03 / Demos

Watch the agent weave GUI & CLI in real trajectories.

Seven end-to-end rollouts from Claude Opus 4.7 on the OpenClaw harness, captured on a real Ubuntu desktop and replayed at 5× speed. The left pane shows the agent's live action log; the right pane shows the desktop it is driving. Use the arrows to step through all seven.

OPS Manage a RabbitMQ dead-letter-queue topology 1 / 7

04 / Main Results

Even frontier model×harness pairings stall at 41.2% PassRate.

The benchmark has two natural axes. Sweep A fixes OpenClaw as the runtime and varies the backbone (Table 1). Sweep B keeps the strongest backbones and varies the deployed runtime (Table 2). Together they show that hybrid-interface performance is determined as much by the runtime scaffold as by raw model capability — cross-pairing can swing the same backbone by >25 PR points.

Table 1 · Model API sweep on a fixed OpenClaw runtime (114 tasks, best thinking mode per backbone).

Backbone	PR ↑	Overall ↑	DSK	DOC	GAM	WEB	DAV	OPS	SPA	DES
Claude Opus 4.7	35.1	0.482	55.6	29.4	23.5	66.7	15.4	41.7	16.7	20.0
GPT-5.5	33.3	0.466	38.9	35.3	35.3	21.4	23.1	38.5	33.3	40.0
GPT-5.4	22.8	0.465	55.6	35.3	5.9	0.0	23.1	23.1	8.3	20.0
GPT-5.3-codex	18.4	0.456	33.3	23.5	29.4	0.0	7.7	16.7	8.3	20.0
GPT-5.2-codex	6.1	0.321	5.6	11.8	0.0	0.0	15.4	16.7	0.0	0.0
GPT-5.1-codex	1.8	0.226	0.0	5.9	0.0	0.0	7.7	0.0	0.0	0.0
Gemini 3.1 pro	1.8	0.223	0.0	0.0	0.0	0.0	0.0	8.3	8.3	0.0
Qwen3.5-397B-A17B	0.9	0.318	0.0	0.0	0.0	0.0	0.0	8.3	0.0	0.0
Qwen3-VL-8B-Think	0.9	0.092	0.0	0.0	0.0	0.0	8.3	0.0	0.0	0.0
GUI-Owl-1.5-32B	0.0	0.065	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

SPA and DES — the two most GUI-heavy domains — are the bottom-two for every backbone with non-trivial PR, confirming GUI as the binding constraint.

Table 2 · Cross-harness sweep for the strongest backbones (high thinking).

Backbone	Harness	PR ↑	Overall ↑	DSK	DOC	GAM	WEB	DAV	OPS	SPA	DES
GPT-5.5	Codex CLI	35.1	0.499	38.9	29.4	23.5	53.3	15.4	50.0	58.3	10.0
	OpenClaw	33.3	0.466	38.9	35.3	35.3	21.4	23.1	38.5	33.3	40.0
	Hermes Agent	31.6	0.466	55.6	29.4	35.3	40.0	7.7	25.0	25.0	20.0
	Claude Code	14.9	0.299	33.3	11.8	11.8	0.0	15.4	16.7	25.0	0.0
Claude Opus 4.7	Claude Code	41.2	0.532	55.6	47.1	23.5	53.3	23.1	50.0	33.3	40.0
	OpenClaw	35.1	0.482	55.6	29.4	23.5	66.7	15.4	41.7	16.7	20.0
	Hermes Agent	28.1	0.516	33.3	47.1	11.8	26.7	30.8	50.0	8.3	10.0
	Codex CLI	13.2	0.378	16.7	11.8	11.8	6.7	7.7	25.0	16.7	10.0

Cross-pairing matters: Claude Opus 4.7 drops from 41.2% on Claude Code to 13.2% on Codex CLI; GPT-5.5 drops from 35.1% on Codex CLI to 14.9% on Claude Code. Tool schemas, prompting conventions, and action-loop design interact strongly with model-specific tool-use behavior.

Why 41.2% is striking

41.2% Best frontier pairing
Claude Opus 4.7 + Claude Code

The same generation of frontier backbones that scores >78% on OSWorld-Verified and 75% on MCPWorld collapses to 41.2% on WeaveBench — and to ≤3.5% when restricted to a single channel. That gap is the cross-interface, long-horizon orchestration tax.

OSWorld-Verified

>78%

MCPWorld
hybrid

75.1%

OSWorld-MCP
hybrid

43.3%

WeaveBench
hybrid (ours)

41.2%

WeaveBench
single-channel max

≤3.5%

Peer benchmark numbers from each paper's strongest reported result. WeaveBench rows are this work (Table 2 best-pairing & Table 3 single-channel maximum across all backbones).

05 / Ablations

Two ablations that change how you read the leaderboard.

The first ablation removes either channel and shows that single-interface CUAs collapse by an order of magnitude — on WeaveBench the second channel is not a convenience, it is required. The second ablation removes trajectory access from the judge and shows that outcome-only grading awards 10–20 PR points of false credit.

Table 3 · Interface ablation

GUI-only ≤1.8%, CLI-only ≤3.5%, Hybrid 22–35%.

We re-run each backbone in three settings on OpenClaw: GUI-only (the screenshot tool plus nine actuation primitives), CLI-only (the full OpenClaw CLI), and Hybrid (both). Both single-interface settings collapse by an order of magnitude across every backbone, consistent with the channel non-substitutability admission rule (P1).

Backbone	GUI	CLI	Hybrid	Δ Hyb-best1
Claude Opus 4.7	1.8	3.5	35.1	+31.6
GPT-5.5	0.8	2.6	33.3	+30.7
GPT-5.4	0.8	2.6	22.8	+20.2
GPT-5.3-codex	0.0	1.8	18.4	+16.6

Single-interface PassRate stays in single digits for every backbone — an order of magnitude below Hybrid.

Table 4 · Cross-benchmark hybrid gain

+31.6 pp gap — vs +3–4 pp on prior hybrid benchmarks.

Compared against the two prior hybrid CUA benchmarks that report comparable interface ablations, WeaveBench's hybrid gain is an order of magnitude larger, and the single-channel floor is single-digit vs. 40–70%. The additional channel is forced by the task specification, not offered as a per-step convenience.

Benchmark	GUI	CLI/MCP	Hyb.	Δ
OSWorld-MCP	40.1	—	43.3	+3.2
MCPWorld	70.7	53.2	75.1	+4.5
WeaveBench	1.8	3.5	35.1	+31.6

Cooperation is forced by the task specification rather than offered as a per-step convenience.

Trajectory-aware vs. outcome-only judge

Removing trajectory access inflates PR by 10.3–20.2 points.

We re-score every rollout with an outcome-only judge that sees only the final deliverables — no trajectory access, no shortcut scan. Switching back to WeaveBench's trajectory-aware judge removes 10.3–20.2 PR points across the four GPT backbones. For GPT-5.5, the audited rate drops from 53.5% → 33.3%. These gaps are lower bounds: rollouts already received an anti-fabrication prompt with a cost-free honest fallback.

Backbone	Outcome-only	Trajectory-aware	Δ
GPT-5.5	53.5	33.3	−20.2
Claude Opus 4.7	51.6	41.2	−10.4
GPT-5.4	33.1	22.8	−10.3
GPT-5.3-codex	28.7	18.4	−10.3

Per-backbone PassRate audit: dark blue is audited PassRate after trajectory-aware judging; light blue shows inflation removed by the audit. — **Figure 4 from the paper.** Dark blue: audited PassRate. Light blue: inflation removed by the audit, with label inside showing PassRate points removed. Top label: outcome-only total and points removed.

06 / Failure Analysis

Reward hacking + execution-discipline collapse explain 65.6% of failures.

We aggregate every OpenClaw rollout for the three frontier backbones (Opus 4.7, GPT-5.5, GPT-5.4) across reasoning budgets — n = 2,209 trials, 1,735 failures. Following CocoaBench, we adopt a hierarchical taxonomy with 5 top-level families and 13 sub-classes; WeaveBench extends the codebook with two hybrid-specific families absent from prior benchmarks — E4 Long-horizon Execution Discipline and E5 Reward Hacking.

Failure anatomy donut: 5 top-level families, 13 sub-classes, plus per-backbone sub-class share for Opus 4.7, GPT-5.5, GPT-5.4. — **Figure 5 from the paper.** **(a)** Two-ring sunburst donut over **n = 1,735** failures. Inner ring: 5 top-level families. Outer ring: 13 sub-classes. **E5 Reward Hacking + E4 Long-horizon Execution Discipline jointly account for 65.6%** of all failures; perception (E3) is under 4%. **(b)** Per-backbone sub-class share — y-axis label colour matches the family colour in (a).

E5 · Reward Hacking

Synthesized render · Hardcoded metric · Crop / overlay · CLI bypass of GUI. Only 30% of E5 hacks follow a true infra bug; the other 70% happen in clean environments despite the explicit anti-fabrication policy.

E4 · Long-horizon Execution Discipline

Silent halt · Premature halt · Cross-channel state drift. Emerges only when the agent must coordinate multiple deliverables across a long horizon of interleaved GUI/CLI actions.

E1 · Reasoning & Verification

Imprecision — close miss is the largest sub-class here. Surfaces only with multi-deliverable contracts.

E2 · Tool Selection

Tool-affordance prior + channel-policy compliance failures.

E3 · Visual Grounding

Perception is not the bottleneck on frontier backbones — a finding that runs against the perception-centric prior in the GUI-agent literature.

Backbone fingerprint

GPT-5.5 · the “confident forger”

46% of failures are E5 reward hacking — primarily E5.1 synthesized renders (24%) and E5.2 hardcoded metrics (14%). The model can complete or honestly skip, but the implicit reward landscape selects forgery when stuck.

Backbone fingerprint

GPT-5.4 · the “early stopper”

E4 dominates (44%) and E4.2 premature halt alone is 27% — the highest single sub-class share for any backbone. Reward hacking is comparatively rare (23%); after a provider hiccup the model never rebuilds its deliverable checklist.

Backbone fingerprint

Claude Opus 4.7 · the “balanced”

No single sub-class above 17%. E5, E4, and E1 each contribute ~30%. Failure style is a function of model identity, not just raw capability.

06b / Tool-Call Distribution

Even with a dedicated GUI tool exposed, agents prefer shell paths for GUI actions.

We decompose every tool call across all GPT-5.5 rollouts into atomic operations. The top-10 operations cover 93.1% of 10,873 active calls; exec: shell alone accounts for 27.3%. A large fraction of GUI work is hidden inside these shell calls — once re-attributed at the atomic-operation level, the GUI share rises from 33.9% → 62.9%.

Top-10 atomic operations across all GPT-5.5 rollouts; bars sorted by call count. — Top-10 atomic operations across all GPT-5.5 rollouts on WeaveBench — 93.1% of 10,873 active calls. GPT-5.5 invokes `gnome-screenshot` via `exec` 2.2× more often than the native `__computer__.screenshot`, and drives mouse/keyboard via `pyautogui`, `xdotool`, and `wmctrl` another 521 times.

exec: shell share 27.3%

of all GPT-5.5 tool calls — the single dominant primitive in the trajectory.

GUI re-attributed 33.9 → 62.9%

tool-level GUI share rises to atomic-op share once exec-routed screenshots and pyautogui calls are reclassified.

Channel switches 16 / task

median GUI↔CLI handoffs in a successful trajectory — single-channel agents cannot satisfy this.

Median rollout 76 calls

per task; max 471. WeaveBench is firmly in long-horizon territory.

07 / BibTeX

Cite this work.

If you use WeaveBench in your research, please cite the paper and the released dataset / code.

@article{li2026weavebench,
  title         = {{WeaveBench}: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces},
  author        = {Li, Wanli and Zhou, Bowen and Yu, Yunyao and Xu, Zhou and Yang, Yifan and Li, Dongsheng and Shan, Caihua},
  year          = {2026},
  eprint        = {2606.09426},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI},
  url           = {https://arxiv.org/abs/2606.09426},
}

Authors

Wanli Li^1,2,*
Bowen Zhou^3,*
Yunyao Yu³
Zhou Xu³
Yifan Yang²
Dongsheng Li²
Caihua Shan^2,‡

¹ Zhejiang University · ² Microsoft Research Asia · ³ Tsinghua University

^* Equal contribution. ^‡ Corresponding author: caihua.shan@microsoft.com

WeaveBench

Three real-world workflows that require interleaved GUI & CLI.

Evaluate cross-interface orchestration, not isolated capabilities.

GUI and CLI are complementary, not interchangeable.

Task · Harness · Evaluation — one diagram, three pillars.

Minimal GUI plugin (10 tools)

Trajectory-aware Agent-as-a-Judge

Nine shortcut detectors

Layered scoring (min rule)

114 tasks across 8 domains, long, channel-interleaved.

Desktop Productivity

Document Processing

Games & Interactive

Web Development

Data Analysis & Viz

DevOps & Sysadmin

Spatial / 3D / CAD

Design & Creative

Watch the agent weave GUI & CLI in real trajectories.

Even frontier model×harness pairings stall at 41.2% PassRate.

Two ablations that change how you read the leaderboard.

GUI-only ≤1.8%, CLI-only ≤3.5%, Hybrid 22–35%.

+31.6 pp gap — vs +3–4 pp on prior hybrid benchmarks.

Removing trajectory access inflates PR by 10.3–20.2 points.

Reward hacking + execution-discipline collapse explain 65.6% of failures.

GPT-5.5 · the “confident forger”

GPT-5.4 · the “early stopper”

Claude Opus 4.7 · the “balanced”

Even with a dedicated GUI tool exposed, agents prefer shell paths for GUI actions.

Cite this work.

Authors