🔥 GitHub Roast
← Back to the board
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
Shuangrui Ding, Xuanlang Dai, Long Xing, Shengyuan Ding et al.
68.83/100
📘 Readable
Decent, has merit
Content 66.0 · Citation bonus +2.8 · 6 citations

💡 This paper proposes WildClawBench, the first native-runtime, real-CLI-tool long-horizon multimodal agent benchmark with 60 human-authored bilingual tasks. Testing 19 frontier models shows the best ach

#沙盒打假人#长周期Agent照妖镜#框架偏差揭露者#多模态Agent考公#容器化复现标杆#Sandbox Debunker#Long-horizon Agent Truth#Harness Bias Exposer#Multimodal Agent Exam#Containerized Reproducib

Score breakdown

Novelty7.0 / 10
Rigor8.0 / 10
Significance9.0 / 10
Clarity9.0 / 10
Reproducibility9.0 / 10

This tone hasn't been generated yet — roast it again to create it.