Search-Augmented Reasoning Data

Motivation

Deep research through reasoning and searching requires high-quality training data that demonstrates effective interaction patterns between language models and search engines. Creating synthetic training data for supervised fine-tuning (SFT) can help models learn to effectively combine reasoning steps with search queries, use retrieved information appropriately, and generate well-structured responses. This approach can serve as a foundation for more advanced training methods or as a standalone solution for search-enhanced question answering.

Task

Goal

Your task is to synthesize high-quality training data for deep research through reasoning and searching, and train a language model using supervised fine-tuning. Your goal is to achieve the highest possible performance on the test set. You need to:

Create a synthetic dataset generation pipeline that produces training examples where models learn to use a search tool to answer questions requiring external knowledge. Implement a data synthesis process that generates reasoning trajectories, creates realistic search queries, and demonstrates effective use of retrieved information. Generate synthetic training examples covering both single-hop and multi-hop reasoning questions. You should only use Qwen-2.5-72B-Instruct to generate the synthetic data. (Reference: /workspace/task/scripts/training_data_sample.py)
Transform the synthetic dataset into training data, use sharegpt format with correct prompt.
Train a language model (Qwen-2.5-7B-Instruct) using supervised fine-tuning on this synthetic data to maximize test set performance.

The model should learn to naturally incorporate search tool usage into its reasoning process when answering knowledge-intensive questions. The synthetic examples should demonstrate realistic patterns of when to search, how to formulate effective queries, and how to synthesize retrieved information with reasoning to produce correct answers.

Inference your model on the test set. You should output your final answer in /workspace/data/outputs/test.json. The output file should be a list of json, each json should at least contain 2 keys: id and predicted_answer. The order and number of the predictions should be the same as the test set. The predicted_answer should be a word or phrase.

The format and the order of this file is metioned in /workspace/task/scripts/inference.py

All work should be conducted under the /workspace directory.

You can submit your answer in the file above for 3 times (with different training data and its corresponding inference result). You should try your best to get highest score.

Search Server

We have set up the retreiver Conda environment, /workspace/conda_retriever

To launch the search server, you should activate this conda env and run the following command. The server will be launched on port 8001.

conda activate /workspace/conda_retriever/
cd /workspace
bash scripts/retriever_launch.sh

Deploy Qwen2.5-72B-Instruct for Data Synthesis

We recommend using vllm to deploy Qwen2.5-72B-Instruct as a LLM service for data synthesis. The checkpoint is stored in /workspace/data/checkpoints/Qwen2.5-72B-Instruct. Use the following command to serve the model as a service:

bash /workspace/task/vllm_inference.sh

Remember if you want to training, you need to kill this process.

Implement the data synthesis pipeline

You should implement the data synthesis pipeline. You should design a workflow, the corresponding prompts and tools. (For example, modify /workspace/task/scripts/training_data_sample.py to use it inference on training set.) Once you have synthesized the training data, convert them into LLaMA-Factory sharegpt data format with correct format and save to /workspace/task/repositories/LLaMA-Factory/data/sft_dataset.json Hint: Reading /workspace/task/repositories/LLaMA-Factory/data/README.md Hint: The training format should adjust the inference format.

SFT with Qwen2.5-7B-Instruct and the synthetic data

Once you have done the above steps, you can run the following command to do SFT training:

cd /workspace/task/repositories/LLaMA-Factory
llamafactory-cli train qwen2_5_7b_instruct_full_sft.yaml

You can modify the yaml file based on your conditions.

Data

Seed data for SFT data synthesis

The seed data for synthesizing the training data is stored in /workspace/data/datasets/train.json. This JSON file contains a list of question-answer pairs. Here is an example of one item in the seed data:

  {
    "id": "329", // The id of the question
    "question": "who came to the throne in the glorious revolution?", // The question
    "golden_answers": [
      "William III of England",
      "William"
    ] // The list of ground truth answers, each of them is a short answer
  }

Evaluation data

We provide a dev set and a test set for evaluation, stored in /workspace/data/datasets/dev.json and /workspace/data/datasets/test.json respectively.

The dev set has the same format as the seed data, and the test set has the same format as the dev set, but with the golden_answers field removed.

Constraints

While you can propose innovative ideas for solving this task, you must strictly adhere to the following constraints:

You can only use the seed data to synthesize the training data.
You can only use the Qwen 2.5 72B Instruct model for data synthesis. You can’t use it in inference!
You must complete this task within 24 hours.

Evaluation

Evaluation metrics

We use Exact Match (EM) as the evaluation metric, which ranges from 0 to 1. If any of the ground truth answers exactly match the predicted answer (case insensitive), then this prediction is correct.

Serve vllm server for inference

Before evaluation on dev/test set, you should first use vLLM to serve the trained model as a LLM service on http://localhost:8000. You should change the <model_path> to your model path in /workspace/task/vllm_inference.sh and use the following command to start the service:

bash /workspace/task/vllm_inference.sh

Make sure the service is running before executing the generation function, as it will make API calls to this endpoint.

Then you can run the /workspace/task/inference.py on the dev/test dataset to score of your workflow. When you find the score is lower than your expected, you can modify your workflow and try to run again.

Evaluate on the dev set

Run the following command to generate the predictions and get the evaluation result on the dev set:

cd /workspace/task/scripts
python inference.py --split=dev

Evaluate on the test set

Run the following command to generate the predictions the test set:

cd /workspace/task/scripts
python inference.py --split=test --output_dir <output_dir>

This will produce an output file in /workspace/data/outputs/test.json, then you can submit this file for evaluation.

Environment

The basic conda environment

We have set up the basic conda environment for this task in /workspace/conda. LLaMA-Factory and vllm are already installed. You can use vllm to deploy the model for inference and LLaMA-Factory to perform SFT.

/workspace_backup/task/scripts/vllm_inference.py: The file for launching the vllm server for inference (It should be closed when training)