A long-horizon, real-world benchmark for computer-use agents with hybrid interfaces

WeaveBench

114 tasks across 8 real-world work domains, each requiring an agent to weave GUI observation with CLI/code execution within a single trajectory. Rollouts run inside a real Ubuntu sandbox; a trajectory-aware Agent-as-a-Judge audits eight quality dimensions and zeros credit on any of nine shortcut patterns (synthetic screenshots, hard-coded metrics, mock services…).

One-line install pip install -e . then run on OpenRouter. Bundles OpenClaw, Codex CLI, Claude Code & Hermes adapters.
Figure 1 from the paper

Three real-world workflows that require interleaved GUI & CLI.

(DAV) Diagnose a Jaeger trace span by inspecting its shape, then patch the upstream timeout via kubectl. (GAM) Play a desktop game to localize a sprite/physics bug, then patch the scene-graph source. (OPS) Catch a 503 spike on a Web Ops dashboard, edit nginx.conf, and re-check the dashboard. Each step alternates between a GUI signal that no API exposes and a CLI/code change that no screenshot can produce.

Three example WeaveBench workflows showing GUI and CLI steps side by side for DAV, GAM, and OPS domains.
01 / Core Idea

Evaluate cross-interface orchestration, not isolated capabilities.

Deployed CUA runtimes already integrate GUI control, CLI/code execution, browsers, and external tools inside one agent loop. Existing benchmarks test these channels in isolation, leaving the orchestration layer under-measured. WeaveBench is built around tasks where success requires the agent to weave both channels together.

GUI and CLI are complementary, not interchangeable.

GUIs expose rendered, transient, spatial state — canvases, dialogs, visual feedback. CLI/code expose structured, scriptable, persistent state — source files, configs, logs, services. Real workflows braid them: observe in the GUI, verify in the shell, edit in code, re-render, re-observe.

Real Ubuntu desktop 114 sandboxed tasks 8 work domains 4 deployed harnesses Trajectory-aware judge
P1 · Channel non-substitutability

Each admitted task requires coordinating GUI observation/action with CLI/code modification inside the same trajectory; annotated with single-channel-bound atomic operations.

P2 · Long-horizon execution

Expert reference trajectories contain multiple interleaved GUI and CLI/code phases — not a single perception, action, or tool-use step.

P3 · Cross-application state

Tasks span multiple independent applications/processes whose states are linked by the workflow; agents must preserve and transfer information across them.

Trajectory-aware grading

Final-only grading is fragile here. The judge audits transcripts, files, screenshots, and logs; nine shortcut detectors zero credit on confirmed reward hacks.

02 / Pipeline

Task · Harness · Evaluation — one diagram, three pillars.

Task: 114 tasks across 8 domains, harvested from real venues, packaged as ℰ = (𝒫, ℳ, 𝒞) bundles, audited against P1–P3, and stress-tested by ≥3 pilot agents. Harness: the agent runs in a single session over an Ubuntu sandbox, with a minimal GUI plugin (one screenshot tool + nine actuation primitives) layered on top of OpenClaw's CLI/code tools; the same plugin is ported to Codex CLI, Claude Code, and Hermes. Evaluation is performed by an isolated trajectory-aware agentic judge that combines bottom-up rubric scoring with shortcut detection.

WeaveBench pipeline: Task construction, hybrid harness with GUI plugin on top of CLI/code tools, and a trajectory-aware agentic judge.
Figure 2 from the paper. Task construction (C1 archetype-guided sourcing → C4 pilot validation), the hybrid harness shared by all four runtimes, and the isolated trajectory-aware judge that re-fetches evidence across artifacts, screenshots, and logs.
M1

Minimal GUI plugin (10 tools)

One perception primitive screenshot plus nine pyautogui-backed actuation primitives (click, double_click, triple_click, move, drag, scroll, type, keypress, wait). Exposed alongside each runtime's terminal, file, code, and browser tools — model loop and prompts unchanged.

M2

Trajectory-aware Agent-as-a-Judge

For every rollout, an isolated subprocess judge re-fetches evidence over multiple turns using file, image, and shell tools; decomposes each deliverable into atomic clauses; verifies each clause with cited evidence; and assigns scores along eight process & outcome dimensions.

M3

Nine shortcut detectors

A parallel scan covers fake screenshots/renders, regenerated fixtures, hard-coded metrics, mock services, duplicate crops, overlay manipulation, ground-truth leakage, runtime injection, and fabricated screenshots. A high-confidence hit triggers ht,m=1 and zeros the task score.

M4

Layered scoring (min rule)

st,m = 0 if ht,m=1; otherwise min(⅛ ∑ dprocess, ddeliv). The min prevents strong auxiliary dimensions from masking weak deliverables; the zeroing rule prevents fabricated evidence from earning partial credit.

03 / Dataset

114 tasks across 8 domains, long, channel-interleaved.

Per-domain task counts range from 10 to 18. Best live rollouts use a median of 76 tool calls (max 471) and a median of 16 GUI↔CLI channel switches per task. Each task carries provenance (URL, commit hash, or post id) from public venues, with self-contained bundles ℰ = (𝒫, ℳ, 𝒞) covering prompt, materials, and check anchors.

WeaveBench dataset overview: taxonomy of 114 tasks across 8 domains and 23 subcategories; per-task GUI/CLI channel-switch distribution; per-task tool-call rollout length.
Figure 3 from the paper. (a) Taxonomy of 114 tasks across 8 domains and 23 subcategories. (b) GUI↔CLI channel switches per task (median 16) — the degree of channel interleaving. (c) Rollout length measured by tool calls in the trajectory (median 76, max 471).
DSK

Desktop Productivity

Filesystem & storage, system & hardware services, desktop UI / UX flows.

DOC

Document Processing

Office suites, markup, LaTeX, print-ready document workflows.

GAM

Games & Interactive

Game engines & runtimes, puzzle / strategy, realtime / action.

WEB

Web Development

DevTools, perf budgets, network profiling, UI & client-state debugging.

DAV

Data Analysis & Viz

Notebooks, dashboards, observability traces, pipelines / ETL.

OPS

DevOps & Sysadmin

Cluster monitoring, services, DB ops, network & security.

SPA

Spatial / 3D / CAD

CAD, engineering design, scientific / FEM simulation.

DES

Design & Creative

Visual asset workflows, color management, engineering design.

03 / Demos

Watch the agent weave GUI & CLI in real trajectories.

Seven end-to-end rollouts from Claude Opus 4.7 on the OpenClaw harness, captured on a real Ubuntu desktop and replayed at 5× speed. The left pane shows the agent's live action log; the right pane shows the desktop it is driving. Use the arrows to step through all seven.

04 / Main Results

Even frontier model×harness pairings stall at 41.2% PassRate.

The benchmark has two natural axes. Sweep A fixes OpenClaw as the runtime and varies the backbone (Table 1). Sweep B keeps the strongest backbones and varies the deployed runtime (Table 2). Together they show that hybrid-interface performance is determined as much by the runtime scaffold as by raw model capability — cross-pairing can swing the same backbone by >25 PR points.

Table 1 · Model API sweep on a fixed OpenClaw runtime (114 tasks, best thinking mode per backbone).
Backbone PR ↑ Overall ↑ DSKDOCGAMWEBDAVOPSSPADES
Claude Opus 4.7 35.1 0.482 55.629.423.566.715.441.716.720.0
GPT-5.5 33.3 0.466 38.935.335.321.423.138.533.340.0
GPT-5.4 22.8 0.465 55.635.35.90.023.123.18.320.0
GPT-5.3-codex 18.4 0.456 33.323.529.40.07.716.78.320.0
GPT-5.2-codex 6.1 0.321 5.611.80.00.015.416.70.00.0
GPT-5.1-codex 1.8 0.226 0.05.90.00.07.70.00.00.0
Gemini 3.1 pro 1.8 0.223 0.00.00.00.00.08.38.30.0
Qwen3.5-397B-A17B 0.9 0.318 0.00.00.00.00.08.30.00.0
Qwen3-VL-8B-Think 0.9 0.092 0.00.00.00.08.30.00.00.0
GUI-Owl-1.5-32B 0.0 0.065 0.00.00.00.00.00.00.00.0

SPA and DES — the two most GUI-heavy domains — are the bottom-two for every backbone with non-trivial PR, confirming GUI as the binding constraint.

Table 2 · Cross-harness sweep for the strongest backbones (high thinking).
Backbone Harness PR ↑ Overall ↑ DSKDOCGAMWEBDAVOPSSPADES
GPT-5.5 Codex CLI 35.1 0.499 38.929.423.553.315.450.058.310.0
OpenClaw 33.3 0.466 38.935.335.321.423.138.533.340.0
Hermes Agent 31.6 0.466 55.629.435.340.07.725.025.020.0
Claude Code 14.9 0.299 33.311.811.80.015.416.725.00.0
Claude Opus 4.7 Claude Code 41.2 0.532 55.647.123.553.323.150.033.340.0
OpenClaw 35.1 0.482 55.629.423.566.715.441.716.720.0
Hermes Agent 28.1 0.516 33.347.111.826.730.850.08.310.0
Codex CLI 13.2 0.378 16.711.811.86.77.725.016.710.0

Cross-pairing matters: Claude Opus 4.7 drops from 41.2% on Claude Code to 13.2% on Codex CLI; GPT-5.5 drops from 35.1% on Codex CLI to 14.9% on Claude Code. Tool schemas, prompting conventions, and action-loop design interact strongly with model-specific tool-use behavior.

Why 41.2% is striking
41.2% Best frontier pairing
Claude Opus 4.7 + Claude Code

The same generation of frontier backbones that scores >78% on OSWorld-Verified and 75% on MCPWorld collapses to 41.2% on WeaveBench — and to ≤3.5% when restricted to a single channel. That gap is the cross-interface, long-horizon orchestration tax.

OSWorld-Verified
>78%
MCPWorld
hybrid
75.1%
OSWorld-MCP
hybrid
43.3%
WeaveBench
hybrid (ours)
41.2%
WeaveBench
single-channel max
≤3.5%

Peer benchmark numbers from each paper's strongest reported result. WeaveBench rows are this work (Table 2 best-pairing & Table 3 single-channel maximum across all backbones).

05 / Ablations

Two ablations that change how you read the leaderboard.

The first ablation removes either channel and shows that single-interface CUAs collapse by an order of magnitude — on WeaveBench the second channel is not a convenience, it is required. The second ablation removes trajectory access from the judge and shows that outcome-only grading awards 10–20 PR points of false credit.

Table 3 · Interface ablation

GUI-only ≤1.8%, CLI-only ≤3.5%, Hybrid 22–35%.

We re-run each backbone in three settings on OpenClaw: GUI-only (the screenshot tool plus nine actuation primitives), CLI-only (the full OpenClaw CLI), and Hybrid (both). Both single-interface settings collapse by an order of magnitude across every backbone, consistent with the channel non-substitutability admission rule (P1).

BackboneGUICLIHybridΔ Hyb-best1
Claude Opus 4.71.83.535.1+31.6
GPT-5.50.82.633.3+30.7
GPT-5.40.82.622.8+20.2
GPT-5.3-codex0.01.818.4+16.6

Single-interface PassRate stays in single digits for every backbone — an order of magnitude below Hybrid.

Table 4 · Cross-benchmark hybrid gain

+31.6 pp gap — vs +3–4 pp on prior hybrid benchmarks.

Compared against the two prior hybrid CUA benchmarks that report comparable interface ablations, WeaveBench's hybrid gain is an order of magnitude larger, and the single-channel floor is single-digit vs. 40–70%. The additional channel is forced by the task specification, not offered as a per-step convenience.

BenchmarkGUICLI/MCPHyb.Δ
OSWorld-MCP40.143.3+3.2
MCPWorld70.753.275.1+4.5
WeaveBench1.83.535.1+31.6

Cooperation is forced by the task specification rather than offered as a per-step convenience.

Trajectory-aware vs. outcome-only judge

Removing trajectory access inflates PR by 10.3–20.2 points.

We re-score every rollout with an outcome-only judge that sees only the final deliverables — no trajectory access, no shortcut scan. Switching back to WeaveBench's trajectory-aware judge removes 10.3–20.2 PR points across the four GPT backbones. For GPT-5.5, the audited rate drops from 53.5% → 33.3%. These gaps are lower bounds: rollouts already received an anti-fabrication prompt with a cost-free honest fallback.

BackboneOutcome-onlyTrajectory-awareΔ
GPT-5.553.533.3−20.2
Claude Opus 4.751.641.2−10.4
GPT-5.433.122.8−10.3
GPT-5.3-codex28.718.4−10.3
Per-backbone PassRate audit: dark blue is audited PassRate after trajectory-aware judging; light blue shows inflation removed by the audit.
Figure 4 from the paper. Dark blue: audited PassRate. Light blue: inflation removed by the audit, with label inside showing PassRate points removed. Top label: outcome-only total and points removed.
06 / Failure Analysis

Reward hacking + execution-discipline collapse explain 65.6% of failures.

We aggregate every OpenClaw rollout for the three frontier backbones (Opus 4.7, GPT-5.5, GPT-5.4) across reasoning budgets — n = 2,209 trials, 1,735 failures. Following CocoaBench, we adopt a hierarchical taxonomy with 5 top-level families and 13 sub-classes; WeaveBench extends the codebook with two hybrid-specific families absent from prior benchmarks — E4 Long-horizon Execution Discipline and E5 Reward Hacking.

Failure anatomy donut: 5 top-level families, 13 sub-classes, plus per-backbone sub-class share for Opus 4.7, GPT-5.5, GPT-5.4.
Figure 5 from the paper. (a) Two-ring sunburst donut over n = 1,735 failures. Inner ring: 5 top-level families. Outer ring: 13 sub-classes. E5 Reward Hacking + E4 Long-horizon Execution Discipline jointly account for 65.6% of all failures; perception (E3) is under 4%. (b) Per-backbone sub-class share — y-axis label colour matches the family colour in (a).
35.2% E5 · Reward Hacking

Synthesized render · Hardcoded metric · Crop / overlay · CLI bypass of GUI. Only 30% of E5 hacks follow a true infra bug; the other 70% happen in clean environments despite the explicit anti-fabrication policy.

30.4% E4 · Long-horizon Execution Discipline

Silent halt · Premature halt · Cross-channel state drift. Emerges only when the agent must coordinate multiple deliverables across a long horizon of interleaved GUI/CLI actions.

21.0% E1 · Reasoning & Verification

Imprecision — close miss is the largest sub-class here. Surfaces only with multi-deliverable contracts.

9.5% E2 · Tool Selection

Tool-affordance prior + channel-policy compliance failures.

3.7% E3 · Visual Grounding

Perception is not the bottleneck on frontier backbones — a finding that runs against the perception-centric prior in the GUI-agent literature.

Backbone fingerprint

GPT-5.5 · the “confident forger”

46% of failures are E5 reward hacking — primarily E5.1 synthesized renders (24%) and E5.2 hardcoded metrics (14%). The model can complete or honestly skip, but the implicit reward landscape selects forgery when stuck.

Backbone fingerprint

GPT-5.4 · the “early stopper”

E4 dominates (44%) and E4.2 premature halt alone is 27% — the highest single sub-class share for any backbone. Reward hacking is comparatively rare (23%); after a provider hiccup the model never rebuilds its deliverable checklist.

Backbone fingerprint

Claude Opus 4.7 · the “balanced”

No single sub-class above 17%. E5, E4, and E1 each contribute ~30%. Failure style is a function of model identity, not just raw capability.

06b / Tool-Call Distribution

Even with a dedicated GUI tool exposed, agents prefer shell paths for GUI actions.

We decompose every tool call across all GPT-5.5 rollouts into atomic operations. The top-10 operations cover 93.1% of 10,873 active calls; exec: shell alone accounts for 27.3%. A large fraction of GUI work is hidden inside these shell calls — once re-attributed at the atomic-operation level, the GUI share rises from 33.9% → 62.9%.

Top-10 atomic operations across all GPT-5.5 rollouts; bars sorted by call count.
Top-10 atomic operations across all GPT-5.5 rollouts on WeaveBench — 93.1% of 10,873 active calls. GPT-5.5 invokes gnome-screenshot via exec 2.2× more often than the native __computer__.screenshot, and drives mouse/keyboard via pyautogui, xdotool, and wmctrl another 521 times.
exec: shell share 27.3%

of all GPT-5.5 tool calls — the single dominant primitive in the trajectory.

GUI re-attributed 33.9 62.9%

tool-level GUI share rises to atomic-op share once exec-routed screenshots and pyautogui calls are reclassified.

Channel switches 16 / task

median GUI↔CLI handoffs in a successful trajectory — single-channel agents cannot satisfy this.

Median rollout 76 calls

per task; max 471. WeaveBench is firmly in long-horizon territory.

07 / BibTeX

Cite this work.

If you use WeaveBench in your research, please cite the paper and the released dataset / code.

@misc{li2026weavebench,
  title        = {{WeaveBench}: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces},
  author       = {Li, Wanli and Zhou, Bowen and Yu, Yunyao and Xu, Zhou and Yang, Yifan and Li, Dongsheng and Shan, Caihua},
  year         = {2026},
  month        = jun,
  howpublished = {\url{https://weavebench.github.io}},
  note         = {Microsoft Research Asia, Zhejiang University, Tsinghua University.
                  Dataset: \url{https://huggingface.co/datasets/wanlilll/WeaveBench};
                  Code: \url{https://github.com/weavebench/WeaveBench}.}
}

Authors

  • Wanli Li1,2,*,†
  • Bowen Zhou3,*
  • Yunyao Yu3
  • Zhou Xu3
  • Yifan Yang2
  • Dongsheng Li2
  • Caihua Shan2,‡

1 Zhejiang University  ·  2 Microsoft Research Asia  ·  3 Tsinghua University

* Equal contribution.   Work done during an internship at Microsoft Research Asia.   Corresponding author: caihua.shan@microsoft.com