Math Problem Curation - InnovatorBench

Motivation

The quality of training data is crucial for developmenting effective reasoning models. Rather than using all available data, researchers need to systematically identify which problems will most effectively elicit complex reasoning capabilities from language models. This homework focuses on developmenting data curation strategies to select the most valuable training examples from a large pool of mathematical problems.

Task

You are given a large collection of 10,000 mathematical problems from various sources and difficulty levels. Your task is to design and implement a systematic approach to select exactly 800 high-quality problems that will be most effective for training a mathematical reasoning model.

You need to:

Select 800 high-quality problems from the given 10,000 mathematical problems.
Using /workspace/task/scripts/generate_data.py to generate the training dataset. You may need to modify the script to make this generation process more effective.
Filter the generated data and transform the filtered data into training data and save it into /workspace/task/repositories/LLaMA-Factory/data/training_datas.jsonl
Training a model on the training dataset. Try to figure out the effencient way to train the model. (Since you have 16 GPUs, you can use multi-machine training or separate different tasks to different machines.)
Evaluate the performance of the trained model on the development set and test set.

You should work under the /workspace/task and /workspace/data/outputs directory.

You should output the following files, each line is a json object with the following fields:

/workspace/data/outputs/final_answer.jsonl: The final dataset containing 800 problems
- Fields: “id”, “question”, “answer”
/workspace/data/outputs/dev_set_result.jsonl: The result on the development set (The order and number of the problems should be the same as the development set)
- Fields: “id”, “question”, “solution”, “answer”
/workspace/data/outputs/test_set_result.jsonl: The result on the test set (The order and number of the problems should be the same as the test set)
- Fields: “id”, “question”, “solution”

You can submit your answer in the file above for 3 times (with different training data and its corresponding inference result). You should try your best to get highest score.

Data

Dataset:

raw_math_problems.jsonl: 10,000 mathematical problems from mixed sources
- Fields: “id”, “question”, “answer”
- Location: /workspace/data/datasets/raw_math_problems.jsonl

Models:

DeepSeek-R1-Distill-Qwen-32B: /workspace/data/checkpoints/DeepSeek-R1-Distill-Qwen-32B
Qwen2.5-32B-Instruct: /workspace/data/checkpoints/Qwen2.5-32B-Instruct

Constraint

Selection size: Must equal to 800 problems
Processing time limit: Complete selection process within 48 hours
Output: You should output final_answer.jsonl, dev_set_result.jsonl, and test_set_result.jsonl with the right format and location.
GPU Usage: You can use the 2 * 8 * 80G GPUs in the cluster to filter question, all the script provided to you can only use 8 GPUs for training/inference, you need to change script by yourself.
Evaluation: You can only evaluation your submission on the test set for 2 times before the deadline and the final submission. Remember you can only submit your result after you already put your result in the final_answer.jsonl, dev_set_result.jsonl, and test_set_result.jsonl, or you will get 0 points for this evaluation.
Final Result Generation: You can’t change the system prompt or other vllm parameters in the inference script, you can only change the model path and make it adjust to multiple GPU machine. (i.e. generate_data.py/generate_solution.py/training_config_for_testing.yaml)

Reasonably utilize GPU resources

You need to use the GPU resources reasonably. Your GPU usage should not be less than 70% for each single GPU. Try to design some strategies to make your GPU usage more reasonable to speed up your answer time.

Evaluation

Evaluation metrics and success criteria:

Right format and location: The 3 output files are in the right format and location.
Dataset size validation: Exactly 800 problems selected in final_answer.jsonl.
Evaluation: The final dataset should be able to achieve a high accuracy on the development and test set. (The test set answer is not provided to you.)

Evaluation scripts:

/workspace/task/scripts/evaluate.py: Evaluates the final dataset. You should run this script after you have generated the final dataset in the right format and location.
/workspace/task/scripts/validate_dataset.py: Validates the final dataset against all constraints.

DO NOT modify the evaluation scripts, we will double check the correctness of the evaluation scripts.

Success criteria:

Pass all validation checks (5 points), if the validation script fails, you will get 0 points in this task.
Surpass the performance of the random sampling baseline in the development set (5 points)
Surpass the performance of the random sampling baseline in the test set (20 points)
Achieve the highest performance as much as possible in the test set (70 points)

Note: 1. There is an implementation of data selection made by a talented student. Suppose your score is x1, the random sampling baseline score is x2, the talented student’s score is x3, the 70 points will be calculated as 70 * (x1 - x2) / (x3 - x2)

Reference: dev_set_x2: 54.2 dev_set_x3: 62.3

You should try your best to get highest score.

Environment

We have setup the conda enviroment for you in /workspace/conda, and we have activated the env. In this env, we installed the packages to use llama-factory and vllm.

Initial file structure:

/workspace/
├── conda/
├── data/
|   ├── checkpoints/
|   |   ├── DeepSeek-R1-Distill-Qwen-32B        # do not modify this directory
|   |   └── Qwen2.5-32B-Instruct                # do not modify this directory
|   ├── dataset/
|   |   ├── raw_math_problems.jsonl             # do not modify this file
|   |   ├── dev.jsonl                # do not modify this file
|   |   └── test.jsonl                      # do not modify this file
|   └── outputs/
└── task/
    ├── repositories/                          # do not modify this directory
    ├── scripts/                               # do not modify this directory
    |   ├── evaluate.py                        # do not modify this script
    |   ├── dev_dataset.py                     # do not modify this script
    |   ├── grade.py                           # do not modify this script
    |   ├── generate_data.py                   # do not modify this script
    |   ├── training.sh                       
    |   ├── generation.py                      
    |   ├── generate_solution.py               
    |   └── ...                                # you can generate scripts in this directory, and you can modify them if you want
    └── task_description.md

Scripts

You can generate scripts in the /workspace/task/scripts directory. You should not modify scripts that are originally in the /workspace/task/scripts directory.

The following scripts are provided to you, do not modify them:

/workspace/task/scripts/training.sh: A model will be trained on the final dataset.
- Input: No input, but you should run this script after you have generated the final dataset in the right format and location.
- Output: A model at /workspace/data/checkpoints/model
/workspace/task/scripts/grade.py: Given a solution and a reference answer, there is a function named check_is_correct in this script to grade the solution and return the grade. You can use this function to grade the solution.
/workspace/task/scripts/generate_solution.py: Given the problem file, generate the solution for the development set and test set.
- Input:
  - set: “dev” or “test”
- Output:
  - The development set and test set solution, generated by the model, save at /workspace/data/outputs/dev_set_result.jsonl or /workspace/data/outputs/test_set_result.jsonl
/workspace/task/scripts/validate_dataset.py: Validate the final dataset against all constraints.
- Input: No input, but you should run this script after you have generated the final dataset in the right format and location.
- Output: The validation result, which is saved at /workspace/task/scripts/dev_dataset_result.jsonl