Item: NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents
Rating: 63.6
Author: GitHub Roast

← 返回论文榜

NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

Jingzhe Ding, Shengda Long, Changxin Pu, Huan Zhou et al.

63.60/100

🫥 平庸

增量有限 · 存在感薄弱

内容分 63.6 · 引用加成 +0.0 · 暂无引用数据

💡 本文提出NL2Repo-Bench基准，专门评估coding agent的长程仓库生成能力：要求模型仅从单一自然语言需求文档，自主完成架构设计、依赖管理、多模块实现，输出完整可安装的Python库，实验发现当前最强模型平均测试通过率不足40%，长程推理是核心瓶颈。

#coding agent#长程代码生成#软件工程基准#LLM能力短板#真实落地评估#coding agent truth serum#long-horizon code genera#software engineering ben#LLM capability gap#real-world deployment ev

去评测另一篇 →

维度评分

创新性7.0 / 10

严谨性8.0 / 10

意义9.0 / 10

清晰度9.0 / 10

可复现性7.0 / 10

🌸 夸夸

🌶️ 辣评 🌸 夸夸

这个语气还没生成 —— 去重新评测一次即可生成。