Motivation

Efficiently adjusting the alignment strength of language models without incurring the high cost of full retraining is an increasingly important challenge. This assignment focuses on a training-efficient method for realigning a base model with its aligned counterpart.

Task

You are provided with a reference policy $\pi^{\text{ref}}(y \mid x)$, represented by DeepSeek-R1-Distilled-Qwen-1.5B, and an already aligned model $\pi_\theta(\beta)(y \mid x)$, represented by DeepScaleR-Preview-1.5B. The aligned model is obtained by further training the reference model and demonstrates enhanced efficiency-oriented reasoning capabilities.

The performance is as follows:

Models	AIME24 Avg@32	AIME24 #Token	Token Reduction %
DeepSeek-R1-Distill-Qwen-1.5B	18.33	12415	—
DeepScaleR-1.5B-Preview	26.77	8533	31.27

Your task is to implement and evaluate a method for efficient realignment between these two models. The goal is to develop a training-efficient approach that can further adjust the alignment strength, ultimately achieving improved efficiency-oriented reasoning capabilities over the reference model. In this task, you should design a algorithm to adjust the alignment strength of the model.

You need to refactor the LLaMA-Factory repository to integrate your proposed method.

Implementation Guidelines

You are required to design and implement the DualAlign algorithm for efficient model realignment. You have access to:

Reference Model: The base model that needs alignment adjustment (DeepSeek-R1-Distilled-Qwen-1.5B)
Aligned Model: A model that has already been aligned and shows improved performance (DeepScaleR-Preview-1.5B)
Target: Train a new model that can achieve better efficiency-oriented reasoning

Implementation Requirements:

Algorithm Design: Create an innovative training method that leverages both the reference and aligned models to improve alignment strength.
Framework Integration: Implement your method in the LLaMA-Factory framework by creating a new training stage called dualalign.
Training Configuration: Use the provided configuration in /workspace/task/scripts/train.yaml with your custom implementation.

Note:

You should design a novel approach to utilize information from both models during training. Consider how to effectively combine their knowledge without simply copying existing methods. The goal is to develop an algorithm that can flexibly control alignment strength and achieve superior performance.

You should work under the /workspace/task and /workspace/data directories.

The directory /workspace/task/repositories/LLama-Factory contains multiple README.md files. You are encouraged to read them to better understand the training framework.

You should output the following files:

/workspace/data/outputs/result.parquet: The inference result produced by your trained model. This file should contain:
- output column: String responses from your trained model
- Same order and number of rows as the test dataset
- Proper pandas DataFrame format
- Contain all the thinking process and the final answer in the output column

Data

Training Data: /workspace/data/datasets/long_cot_calibration.json

Test Data: /workspace/data/datasets/aime-2024.parquet

Constraint

Training: Maximum 400 training steps with batch size 16
Context Length: Training on 4k-8k context, evaluation on up to 16k context
8 x 80G GPUs

Evaluation

We will the accuracy and effency and return a score based on these metrics.

Environment

We have setup the conda enviroment for you named /workspace/conds, and we have activated the env. In this env, we installed the packages to use llama-factory and vllm.

File structure:

/workspace/
├── data/
|   ├── checkpoints/                            # your trained model  
|   |   ├──  DeepSeek-R1-Distilled-Qwen-1.5B
|   |   └──  DeepScaleR-Preview-1.5B
|   ├── dataset/
|   |   ├── aime-2024.parquet                   # do not modify this file
|   |   └── long_cot_calibration.jsonl           # do not modify this file
|   └── output/
└── task/
    ├── repositories/                          
    |   └── LLaMA-Factory                      # you can modify the src/llamafactory/train/dpo/trainer.py and other related files
    ├── scripts/      
    |   ├── llm.py                         # do not modify this file
    |   ├── train.yaml                     
    |   ├── train.sh                     
    |   └── eval_aime24.py                 # do not modify this file      
    └── task_description.md

Scripts

Evaluation: Execute the following command to evaluate your trained model and obtain the results.

cd /workspace/task/scripts
serve run llm:build_app model=your_model_path/model_name tensor-parallel-size=1

# open another terminal
python /workspace/task/scripts/eval_aime24.py --temperature 0.7 --top_p 0.95 --max_tokens 16384 --model model_name --test_file /workspace/data/datasets/aime-2024.parquet

Scripts

/workspace/task/scripts/eval_aime24.py: This is the evaluation script. Use the —model model_name argument to specify the model for inference. Note that model_name refers to the model’s name only, without including the file path.

/workspace/task/scripts/llm.py: This is the VLLM-based parallel inference engine. For example, if you have 4 GPU cards and set —tensor-parallel-size=1, each GPU will hold a full replica of the model and perform inference in parallel.

/workspace/task/scripts/train.yaml: This is the training configuration file. You can modify the file to fit your needs.

/workspace/task/scripts/train.sh: This is the training script. You can modify the script to fit your needs.

Model Realignment