Motivation

A critical challenge in preference optimization is handling noisy or ambiguous preference data, where the reward margin between preferred and dispreferred responses is small. Understanding how different methods perform when preference labels are corrupted can inform the development of more robust alignment techniques. This evaluation will help identify which approaches maintain performance stability under realistic data corruption scenarios.

Task

Your task is to develop a more robust preference optimization method than SimPO. The SimPO loss function is defined as:

$$ \mathcal{L}{SimPO}=\mathbb{E}\mathcal{D} \left[-\log\sigma\left(\frac{\beta}{|y_w|}\log\pi_\theta(y_w|x) - \frac{\beta}{|y_l|}\log\pi_\theta(y_l|x)-\gamma_0\right)\right], $$

where $\sigma$ is the sigmoid function, $\beta$ is a hyperparameter, $\pi_\theta(y|x)$ is the model’s probability of response $y$ given prompt $x$, and $|y_w|$ and $|y_l|$ are the token lengths of the preferred and dispreferred responses, respectively.

Key Insight for Improvement

A promising direction is to adaptively adjust the target reward margin ($\gamma$) based on preference clarity:

Preference pairs with a larger reward margin are more likely to represent unambiguous human preferences. Thus, assigning a higher target reward margin ($\gamma$) enables the LLM to learn more effectively from such pairs. Conversely, pairs with a smaller reward margin are more likely to reflect ambiguous preferences, warranting a lower $\gamma$ to reduce their influence on the LLM’s learning process.

As a result, your task is to design a new loss function that can adaptively adjust the target reward margin ($\gamma$) based on preference clarity. In this task, we called this method gammapo.

Design gammapo: You should put your idea about gammapo in /workspace/data/outputs/idea.md and you should work within the /workspace/data/ and /workspace/task directory. The task involves the following steps:
Code Implementation: Refer to the implementation of simpo_loss in /workspace/task/repositories/gammaPO/scripts/simpo_trainer.py, implement the idea in after elif self.loss_type == "gammapo" in /workspace/task/repositories/gammaPO/scripts/gammapo_trainer.py
gammaPO Training:
- Train the Qwen2.5-7B-Instruct model using the your improved algorithm.
- Run the following one for training: /workspace/task/scripts/start.sh
- Save the final trained model to /workspace/data/models/qwen-2.5-7b-it-gammapo/.
gammaPO Evaluation:

Run the following one for inference: /workspace/task/scripts/infer.sh (This script only generate one of the two outputs, you need to modify it) Save the final outputs to /workspace/data/outputs/qwen-2.5-7b-it-gammapo/model_outputs.json Its order and number should be the same as /workspace/data/datasets/qwen_ultrafeedback_binarized

Output Format Requirements:

File Location: /workspace/data/outputs/qwen-2.5-7b-it-gammapo/model_outputs.json

File Format: JSON array containing model responses

Required Fields: Each JSON object should contain:

"dataset" (string): Dataset identifier (from AlpacaEval2)
"instruction" (string): The original instruction from AlpacaEval2
"output" (string): The model’s response to the instruction
"generator" (string): Model identifier “qwen-2.5-7b-it-gammapo”

Important Notes:

The order must match the AlpacaEval2 test set exactly
Use the provided inference script for correct formatting
Both LC (Length-Controlled) and WR (Raw Win Rate) will be evaluated

Evaluation Call evaluation to evaluate your output result.

You should work under the /workspace directories.

You can submit your answer in the file above for 3 times (With different training loss). You should try your best to get highest score.

Data

Training Data: Use the UltraFeedback Binarized dataset, located in /workspace/data/datasets/qwen_ultrafeedback_binarized.
Evaluation Data: Use the AlpacaEval2 dataset, which is included in the alpaca_eval package, for evaluating the model’s performance.
Base Model: The qwen2.5-7b-it model is provided as the base model in /workspace/data/checkpoints/Qwen2.5-7b-Instruct.

Constraint

Time limit: 24 hours
Hardware: 8 GPUs (80GB VRAM each)
Submission attempts: 3 max

Evaluation

Evaluation Metrics

Report the Length-Controlled Win Rate (LC) and Raw Win Rate (WR) from AlpacaEval2 for your new trained model.

Success Criteria

The final evaluation metrics (LC and WR) for your method should outperform those achieved by SimPO. The results of SimPO
If the final evaluation metrics (LC and WR) for your method surpass a higher baseline you will get 100% score. There will be a function if your acc is between two the two baselines.

Key Hyperparameters for Tuning

Train for exactly 1 epoch with the following hyperparameters:

beta: Range: [2, 10], Recommended: 2.5, Controls the sensitivity to preference differences. Higher values increase the penalty for incorrect preference predictions.
gamma: Range: [0.3, 1.0], Recommended: 0.25, Defines the minimum reward gap between preferred and dispreferred responses. A higher value enforces a larger margin.
learning_rate: Range: [1e-7, 1e-6], Recommended: 5e-7, Controls the step size during optimization. Lower values lead to slower but more stable convergence.

Environment

The environment has been pre-configured for you in /workspace/conda. You can start working directly without additional setup. You can use vllm, datatrove etc. in this environment.

Robust Preference Optimization