Item: WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
Rating: 68.83
Author: GitHub Roast

← Back to the board

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

Shuangrui Ding, Xuanlang Dai, Long Xing, Shengyuan Ding et al.

68.83/100

📘 Readable

Decent, has merit

Content 66.0 · Citation bonus +2.8 · 6 citations

💡 This paper proposes WildClawBench, the first native-runtime, real-CLI-tool long-horizon multimodal agent benchmark with 60 human-authored bilingual tasks. Testing 19 frontier models shows the best ach

#沙盒打假人#长周期Agent照妖镜#框架偏差揭露者#多模态Agent考公#容器化复现标杆#Sandbox Debunker#Long-horizon Agent Truth#Harness Bias Exposer#Multimodal Agent Exam#Containerized Reproducib

Roast another paper →

Score breakdown

Novelty7.0 / 10

Rigor8.0 / 10

Significance9.0 / 10

Clarity9.0 / 10

Reproducibility9.0 / 10

🌸 Praise

🌶️ Roast 🌸 Praise

This tone hasn't been generated yet — roast it again to create it.