Mathematical Problem Solving

Motivation

Advanced mathematical problem-solving represents a cornerstone of artificial intelligence capabilities, demanding intricate logical reasoning, sophisticated pattern analysis, and systematic multi-step solution development. Text-based mathematical challenges test models’ ability to comprehend abstract mathematical frameworks, deploy optimal problem-solving methodologies, and derive accurate conclusions through rigorous analytical processes without visual assistance. Existing language models frequently encounter difficulties with mathematical tasks that necessitate profound comprehension of mathematical theories, complex computational sequences, and robust logical reasoning pathways.

This challenge targets the essential requirement to advance language models’ mathematical problem-solving proficiency through sophisticated training methodologies. The capacity to tackle intricate mathematical questions has substantial implications for educational technology, intelligent tutoring platforms, academic research support, and broader artificial intelligence advancement, establishing this as a crucial benchmark for assessing analytical reasoning competencies in mathematical contexts.

Task

This challenge centers on improving the mathematical problem-solving capabilities of language models through advanced fine-tuning and data enhancement strategies. Your target is to make the final result as higher as possible.

You need to:

Dataset Enhancement and Expansion: Utilize the provided training set and models to generate supplementary training examples through diverse enrichment methodologies including knowledge transfer, strategic data sampling, COT generation and structural reorganization. It may takes several hours, you MUST be paitent.

Save your dataset to /workspace/task/repositories/LLaMA-Factory/data/training_datas.jsonl, the data format can be found in /workspace/task/repositories/LLaMA-Factory/data/README.md. LLaMA Factory will automatically seperate dev/train set from this file.

Model Optimization: Improve the designated model (Qwen2.5-7B-Instruct) using the enhanced dataset to boost performance on mathematical reasoning challenges. This may involve various training methodologies, including supervised fine-tuning (SFT). The process may take several hours, so you MUST be patient.
Performance Assessment: Evaluate the optimized model trained from Qwen2.5-7B-Instruct against the test dataset to quantify improvement gains via inference script and eval action.

All work should be conducted under the /workspace directory.

The output file should be stored in /workspace/data/outputs/answer.jsonl.

You can submit your answer in the file above for 3 times (with different training data and its corresponding inference result). You should try your best to get highest score.

Data

The dataset comprises three primary components:

Training Set

Path: /workspace/data/datasets/train.jsonl
Description: 817 mathematical problem-solving questions
Schema: ["question_id", "question", "answer"]
Structure: Each record contains a comprehensive mathematical challenge with its corresponding solution

Validation Set

Path: /workspace/data/datasets/valid.jsonl
Description: 100 validation problems with matching format to training set
Schema: ["question_id", "question", "answer"]
Function: Model validation and hyperparameter tuning

Test Set

Path: /workspace/data/datasets/test.jsonl
Description: 435 test problems (answer field excluded)
Schema: ["question_id", "question"]
Function: Final model performance evaluation

Model Resources

Main Model: Qwen2.5-7B-Instruct (/workspace/data/checkpoints/Qwen2.5-7B-Instruct) - You should finetune your model based on this model.
Auxiliary Models (for data enhancement/distillation):
- DeepSeek-R1-Distill-Qwen-7B (/workspace/data/checkpoints/DeepSeek-R1-Distill-Qwen-7B)
- DeepSeek-R1-Distill-Qwen-32B (/workspace/data/checkpoints/DeepSeek-R1-Distill-Qwen-32B)
- QwQ-32B (/workspace/data/checkpoints/QwQ-32B)

Constraints

Time Constraints: Complete pipeline including data preparation, training, validation, and test evaluation must be finished within 48 hours
Computational Resources: Limited to 8 x 80GB GPUs for the entire workflow
Model Usage: Final implementation must utilize exclusively the main model; supporting models are restricted to dataset enhancement and knowledge transfer purposes only

Evaluation

Output Format Requirements

File Location: /workspace/data/outputs/answer.jsonl

File Format: JSONL (JSON Lines) - one JSON object per line

Required Fields: Each line must contain exactly these two fields:

"question_id" (string): The unique identifier for each question (must match the question_id from test set)
"answer" (string): The final mathematical answer or expression

Answer Format:

Provide the final numerical answer, mathematical expression, or symbolic result
Can include LaTeX formatting for mathematical expressions (e.g., "$\\frac{600}{7}$", "$\\sqrt{15}$")
Can be plain numbers (e.g., "9", "1002001", "140")
Can be multiple values separated by commas (e.g., "$69$,$84$")
Do NOT include explanations, reasoning, or solution steps
For multiple answers, separate with commas

IMPORTANT Notes:

File must contain answers for ALL test questions (435 questions)
Question IDs are strings and must exactly match those in the test set
Mathematical expressions should use proper LaTeX formatting when appropriate
The evaluation uses mathematical equivalence checking, so "3/4" and "$\\frac{3}{4}$" are considered equal

Evaluation Metrics

Val Set: baseline = 7%
Test Set: Human minimum = 25.9%, Human maximum = 65%
The final score is calculated based on the accuracy of the test set.

Environment

File Structure

workspace/
├── data/
│   ├── checkpoints/
│   │   ├── DeepSeek-R1-Distill-Qwen-7B/   # read-only directory
│   │   ├── DeepSeek-R1-Distill-Qwen-32B/  # read-only directory
│   │   ├── Qwen2.5-Math-7B/               # read-only directory
│   │   └── QwQ-32B/                       # read-only directory
│   ├── datasets/
│   │   ├── train.jsonl                    # do not modify this file
│   │   ├── val.jsonl                      # do not modify this file
│   │   └── test.jsonl                     # do not modify this file
│   └── outputs/
└── task/
    ├── repositories/
    │   └── LLaMA-Factory/       
    ├── scripts/                           # you can add scripts here
    │   ├── utils/
    │   ├── hfd.sh                         # read-only file
    │   ├── inference.py
    │   ├── inference.sh                   # example script for running evaluation
    │   ├── judge.py                       # example script for running evaluation
    │   ├── judge.sh                       # example script for running evaluation
    │   └── training.sh                         
    └── task_description.md

Execution Environment

A pre-configured Conda environment, /workspace/conda, has been provided and activated for this task. This environment includes the necessary packages for supervised fine-tuning using LLaMA-Factory.

Scripts

Available Resources

LLaMA-Factory: Located at /workspace/task/repositories/LLaMA-Factory for supervised optimization
Custom Scripts: Develop and modify scripts in /workspace/task/scripts/ directory
Reference Scripts: Existing scripts in the scripts directory can be referenced and adapted as needed, including inference.sh and judge.sh for evaluation demonstrations
Training Scripts: Reference existing scripts including /workspace/task/scripts/training.sh for model training, the data format can be found in /workspace/task/repositories/LLaMA-Factory/data/README.md. You should save your training set properly before training. Downloading: If you want to download dataset you can download it from hf-mirror or modelscope. Here is the script example: /workspace/task/scripts/hfd.sh dataset_name --dataset --tool aria2c -x 16. you may need to add other parameter. Also, you need to change the num_train_epochs in /workspace/task/repositories/LLaMA-Factory/training_config.yaml to adjust the training time.

Suggestions

Use strong model to do inference, check its output and select the correct one to create the answer.
Leverage LLaMA-Factory for effective supervised fine-tuning with techniques like LoRA or full parameter optimization.
Apply advanced prompting techniques including chain-of-thought reasoning and domain-specific prompt design.
Utilize available evaluation frameworks for thorough model performance analysis.
You can change the dataset info in /workspace/task/repositories/LLaMA-Factory/data/dataset_info.json, we suggest you read /workspace/task/repositories/LLaMA-Factory/data/README.md first