Motivation
Reinforcement Learning (RL) training for Large Language Models often suffers from entropy collapse, where the model’s output distribution becomes overly deterministic early in training. This severely limits exploration and prevents the model from discovering diverse reasoning paths. Understanding and mitigating entropy collapse is crucial for successful long-form reasoning tasks where exploration of different solution strategies is essential.
Task
Your task is to implement a new strategy for GRPO in language model reinforcement learning in order to get the highest accuracy and prevent entropy collapse.
We provide a GRPO algorithm for you as a background knowledge. For a specific question-answer pair $(q, a)$, the behavior policy $\pi_\theta^{\mathit{old}}$ samples a group of $G$ individual responses ${o_i}{i=1}^G$. Then, the advantage of the i-th response is calculated by normalizing the group-level rewards ${R_i}{i=1}^G$:
$$ \nabla_\theta J_{GRPO}(\theta) = \mathbb{E}{(q, a) \sim D, {o_i}{i=1}^G \sim \pi_{\theta_{old}}(\cdot|q)} \left[ \dfrac{1}{G} \sum_{i=1}^G \dfrac{1}{|\mathcal{o}i|} \sum{t = 1}^{|\mathcal{o}i|} \left( \min \left( r{i,t}(\theta)\hat{A}{i, t}, \text{clip}(r{i,t}(\theta), 1 - \epsilon, 1 + \epsilon)\hat{A}{i, t} \right) \right) \right] $$ , where $ r{i,t} = \dfrac{\pi_\theta(o_{i, t}|q,o_{i, <t})}{\pi_\theta^{\mathit{old}}(o_{i, t}|q,o_{i, <t})}$ is the importance sampling ratio.
Now, you need to implement a variant of GRPO to get the highest accuracy and prevent entropy collapse.
You should work under the /workspace/task and /workspace/data directories.
You need to use original GRPO arollout. Onlt modify its advantage/loss compute part in /workspace/task/repositories/verl/verl/trainer/ppo/core_algos.py do not modify other file in verl
After you implement the algorithm in /workspace/task/repositories/verl you can use /workspace/task/scripts/train.sh to train the model based on your code, it will also show the dev set results in the training log. (You could training for multiple times, just select the best one to eval.)
- Do not modify the parameter in
/workspace/task/scripts/train.shexcept test_freq
Finally, you need to selected the best checkpoints in your training, using /workspace/task/scripts/inference.sh <model_path> to generate /workspace/data/outputs/math_results_logits.pt and /workspace/data/outputs/math_results.parquet
Output Format Requirements:
File Location: /workspace/data/outputs/math_results.parquet
File Format: Parquet format containing model responses
Required Fields: The parquet file should contain:
"sample_id"(int): Unique identifier for each test sample (must follow the order in the test set)"output"(string): The model’s complete response including reasoning and final answer
Required Output Format for each response:
<reasoning process>
Answer:\boxed{Your Answer}
Logits File: /workspace/data/outputs/math_results_logits.pt
- PyTorch tensor file containing logits for entropy calculation
- Should be a list of tensors, one for each test sample
- Each tensor should have shape
[sequence_length, vocab_size]where:sequence_length: Number of generated tokens for that samplevocab_size: Size of the model’s vocabulary
IMPORTANT Notes:
- Use
/workspace/task/scripts/inference.pyfor inference - do not modify this script - The reasoning process should demonstrate clear mathematical thinking
- Final answer must be enclosed in
\boxed{}format - Your task is to implement a new strategy for GRPO to prevent entropy collapse and achieve highest accuracy
The evaluation will check these two files to give the final score based on both accuracy and entropy metrics.
Data
-
Train set: Stored in
/workspace/data/datasets/math.8k/train.parquet- data_source: ‘math8k’
- prompt: the math question
- reward_model: the ground truth answer of the math question
-
Dev set: Stored in
/workspace/data/datasets/math.8k/dev.parquet- data_source: ‘GSM8k’ & ‘AMC23’
- prompt: the math question
- reward_model: the ground truth answer of the math question
-
Test set: Stored in
/workspace/data/datasets/math.8k/test.parquet -
Checkpoint: Stored in
/workspace/data/checkpoints/Qwen2.5-1.5B.
Constraint
- Training time constraint: 24 hours.
- Evaluation time constraint: 10 minutes.
- You can use 8 80GB H100 GPUs.
- You only have 3 chances to evaluate your submission.
- Total time limit: 48 hours.
Evaluation
Evaluation metrics:
- Accuracy: Measure accuracy of the variant policy on the test set at the end of the training.
- Entropy Analysis: Track entropy of the variant policy along the training process.
Success criteria:
- The accuracy of the variant policy on MATH500 should be at least higher than the baseline policy. Try your best to get the highest accuracy!
- The average entropy of the variant policy should be in a reasonable range.
- The final score will be the accuracy score * entropy score * 100
Environment
We have setup the conda environment for you named /workspace/conda, and we have activated the environment.
Scripts
You can generate scripts in the /workspace/task/scripts directory. You should not modify scripts that are originally in the /workspace/task/scripts directory.
The following scripts are provided to you, do not modify them:
/workspace/task/repositories/verl/scripts/model_merger.py: Given a model path of verl checkpoint, which is a directory containing multiplemodel_world_size_8_rank_{rank_number}.ptfiles, you can use this script to merge the model weights into huggingface format.- Input:
--local_dir: The path of the verl checkpoint.
- Output:
- The checkpoint in huggingface format.
- Input: