Entropy Collapse Prevention

Motivation

Reinforcement Learning (RL) training for Large Language Models often suffers from entropy collapse, where the model’s output distribution becomes overly deterministic early in training. This severely limits exploration and prevents the model from discovering diverse reasoning paths. Understanding and mitigating entropy collapse is crucial for successful long-form reasoning tasks where exploration of different solution strategies is essential.

Task

Your task is to implement a new strategy for GRPO in language model reinforcement learning in order to get the highest accuracy and prevent entropy collapse.

We provide a GRPO algorithm for you as a background knowledge. For a specific question-answer pair $(q, a)$, the behavior policy $\pi_\theta^{\mathit{old}}$ samples a group of $G$ individual responses ${o_i}{i=1}^G$. Then, the advantage of the i-th response is calculated by normalizing the group-level rewards ${R_i}{i=1}^G$:

$$ \nabla_\theta J_{GRPO}(\theta) = \mathbb{E}{(q, a) \sim D, {o_i}{i=1}^G \sim \pi_{\theta_{old}}(\cdot|q)} \left[ \dfrac{1}{G} \sum_{i=1}^G \dfrac{1}{|\mathcal{o}i|} \sum{t = 1}^{|\mathcal{o}i|} \left( \min \left( r{i,t}(\theta)\hat{A}{i, t}, \text{clip}(r{i,t}(\theta), 1 - \epsilon, 1 + \epsilon)\hat{A}{i, t} \right) \right) \right] $$ , where $ r{i,t} = \dfrac{\pi_\theta(o_{i, t}|q,o_{i, <t})}{\pi_\theta^{\mathit{old}}(o_{i, t}|q,o_{i, <t})}$ is the importance sampling ratio.

Now, you need to implement a variant of GRPO to get the highest accuracy and prevent entropy collapse.

You should work under the /workspace/task and /workspace/data directories.

You need to use original GRPO arollout. Onlt modify its advantage/loss compute part in /workspace/task/repositories/verl/verl/trainer/ppo/core_algos.py do not modify other file in verl

After you implement the algorithm in /workspace/task/repositories/verl you can use /workspace/task/scripts/train.sh to train the model based on your code, it will also show the dev set results in the training log. (You could training for multiple times, just select the best one to eval.)

Do not modify the parameter in /workspace/task/scripts/train.sh except test_freq

Finally, you need to selected the best checkpoints in your training, using /workspace/task/scripts/inference.sh <model_path> to generate /workspace/data/outputs/math_results_logits.pt and /workspace/data/outputs/math_results.parquet

Output Format Requirements:

File Location: /workspace/data/outputs/math_results.parquet

File Format: Parquet format containing model responses

Required Fields: The parquet file should contain:

"sample_id" (int): Unique identifier for each test sample (must follow the order in the test set)
"output" (string): The model’s complete response including reasoning and final answer

Required Output Format for each response:

<reasoning process>
Answer:\boxed{Your Answer}

Logits File: /workspace/data/outputs/math_results_logits.pt

PyTorch tensor file containing logits for entropy calculation
Should be a list of tensors, one for each test sample
Each tensor should have shape [sequence_length, vocab_size] where:
- sequence_length: Number of generated tokens for that sample
- vocab_size: Size of the model’s vocabulary

IMPORTANT Notes:

Use /workspace/task/scripts/inference.py for inference - do not modify this script
The reasoning process should demonstrate clear mathematical thinking
Final answer must be enclosed in \boxed{} format
Your task is to implement a new strategy for GRPO to prevent entropy collapse and achieve highest accuracy

The evaluation will check these two files to give the final score based on both accuracy and entropy metrics.

Data

Train set: Stored in /workspace/data/datasets/math.8k/train.parquet
- data_source: ‘math8k’
- prompt: the math question
- reward_model: the ground truth answer of the math question
Dev set: Stored in /workspace/data/datasets/math.8k/dev.parquet
- data_source: ‘GSM8k’ & ‘AMC23’
- prompt: the math question
- reward_model: the ground truth answer of the math question
Test set: Stored in /workspace/data/datasets/math.8k/test.parquet
Checkpoint: Stored in /workspace/data/checkpoints/Qwen2.5-1.5B.

Constraint

Training time constraint: 24 hours.
Evaluation time constraint: 10 minutes.
You can use 8 80GB H100 GPUs.
You only have 3 chances to evaluate your submission.
Total time limit: 48 hours.

Evaluation

Evaluation metrics:

Accuracy: Measure accuracy of the variant policy on the test set at the end of the training.
Entropy Analysis: Track entropy of the variant policy along the training process.

Success criteria:

The accuracy of the variant policy on MATH500 should be at least higher than the baseline policy. Try your best to get the highest accuracy!
The average entropy of the variant policy should be in a reasonable range.
The final score will be the accuracy score * entropy score * 100

Environment

We have setup the conda environment for you named /workspace/conda, and we have activated the environment.

Scripts

You can generate scripts in the /workspace/task/scripts directory. You should not modify scripts that are originally in the /workspace/task/scripts directory.

The following scripts are provided to you, do not modify them:

/workspace/task/repositories/verl/scripts/model_merger.py: Given a model path of verl checkpoint, which is a directory containing multiple model_world_size_8_rank_{rank_number}.pt files, you can use this script to merge the model weights into huggingface format.
- Input:
  - --local_dir: The path of the verl checkpoint.
- Output:
  - The checkpoint in huggingface format.