Item: Do Not Let Low-Probability Tokens Over-Dominate in RL for LLMs
Rating: 52.8
Author: GitHub Roast

← Back to the board

Do Not Let Low-Probability Tokens Over-Dominate in RL for LLMs

Zhihe Yang, Xufang Luo, Zilong Wang, Dongqi Han et al.

52.80/100

🫥 Mediocre

Incremental, thin

Content 52.8 · Citation bonus +0.0 · no citation data

💡 This paper identifies that low-probability tokens dominate LLM RL training updates due to large gradient magnitudes, proposes Advantage Reweighting and Lopti to suppress low-probability token gradient

#RL训练LLM#梯度不平衡#GRPO优化#低概率token#逻辑推理#RL for LLM#gradient imbalance#GRPO optimization#low-probability token#logic reasoning

Roast another paper →

Score breakdown

Novelty6.0 / 10

Rigor5.0 / 10

Significance8.0 / 10

Clarity8.0 / 10

Reproducibility7.0 / 10

🌸 Praise

🌶️ Roast 🌸 Praise

This tone hasn't been generated yet — roast it again to create it.