Search-Augmented RL Reward

Motivation

Reinforcement learning has shown great promise for improving language model reasoning capabilities. However, some problems requires the model to search for information to answer the question. This task is to design a reward function for training language models with search-augmented reasoning capabilities.

Task

Design the reward function

Your task is to design and implement a creative and effective reward function for training language models on search-augmented reasoning tasks. You need to design a reward function that provide a clear reward signal for the RL training. The reward function should be able to guide the model to learn to search for information and answer the question correctly. The reward function should be able to handle the following cases:

Reward the model for giving the correct answer.
Give negative rewards for invalid format of the response.

These are the two very basic aspects of the reward function, you can imagine more creative ways for giving rewards.

Implement the reward function in verl

verl is one of the most popular framework for RL training on LLMs, and we have provide the codebase for you in repositories/verl. You should implement the reward funciton in this codebase.

Training a Qwen-2.5-3B-Base model

Once you have finished coding, you can use the following command to train a Qwen-2.5-3B-Base model:

cd /workspace/task
# start the search service
bash scripts/retriever_launch.sh

# train the model
cd /workspace/task/repositories
bash train_grpo.sh

Data

We provide a training set, dev set and a test set for you:

Training set is stored in /workspace/data/datasets/train.parquet
Dev set is stored in /workspace/data/datasets/dev.parquet
Test set is stored in /workspace/data/datasets/test_no_answer.parquet.

Each example in the training set and dev set contains the following fields:

id: str, The id of the example.
question: str, The question to answer.
golden_answers: list[str], The golden answers to the question. Each answer is a short answer.
prompt: list[dict], The first user message to the model with the format of [{"role": "user", "content": "<the content of the question>"}]

There are no golden_answers in this test set. Once you have trained the model, you should perform inference on this dataset and submit it for evaluation. (Return all datas.)

Constraint

Time budget: 24 hours maximum.
You can only use the verl codebase that we provided to you in /workspace/task/repositories/verl.
You can only generate the answer based on the Qwen 2.5 3B model that you trained on the training set. We have a monitor to check whether you have followed this rule. If you violate this rule, you will be disqualified.

Evaluation

Evaluation Metric

We use Exact Match (EM) as the evaluation metric. If the model’s predicted answer matches any answer in the golden_answers in lower case, the prediction is considered as correct.

Serve vllm server for inference

Before evaluation on dev/test set, you should first use vLLM to serve the trained model as a LLM service on http://localhost:8000. You should change the <model_path> to your model path in /workspace/task/vllm_inference.sh and use the following command to start the service:

bash /workspace/task/vllm_inference.sh

Make sure the service is running before executing the generation function, as it will make API calls to this endpoint. Make sure the service is not running before training

Then you can run the /workspace/task/inference.py on the dev/test dataset to score of your workflow. When you find the score is lower than your expected, you can modify your workflow and try to run again.

Evaluate on the dev set

Run the following command to generate the predictions and get the evaluation result on the dev set:

cd /workspace/task/scripts
python inference.py --split=dev

Evaluate on the test set

Run the following command to generate the predictions the test set:

cd /workspace/task/scripts
python inference.py --split=test --output_dir ${base_ai_engineer_path}/evaluations/${task_name}/data/references

This will produce an output file in /workspace/data/outputs/test.json, then you can submit this file for evaluation.

Environment

Execution Environment

A inference Conda environment, /workspace/conda, has been provided for this task. This environment includes the following key packages:

Python 3.10
PyTorch 2.0+ with CUDA support
Transformers 4.47.1
numpy, pandas

A retreiver Conda environment, /workspace/conda_retriever, has also been provided for this task, which is used for running retriever_launch.sh.

Scripts

You can generate scripts in the /workspace/task/scripts directory. You should not modify scripts that are originally in the /workspace/task/scripts directory.

The following scripts are provided to you, do not modify them:

/workspace/task/scripts/retriever_launch.sh: The script for launching the retriever. You should run this script before inference.
/workspace_backup/task/scripts/retrieval_server.py: The file for launching the retriever.