Scientific Reasoning Enhancement

Motivation

Multidisciplinary scientific reasoning represents a fundamental pillar of artificial intelligence advancement, requiring comprehensive analytical thinking, cross-domain knowledge synthesis, and systematic problem-solving across diverse academic fields. Text-based scientific challenges evaluate models’ capacity to understand complex theoretical principles spanning physics, chemistry, biology, and other scientific domains, apply domain-specific methodologies, and generate accurate solutions through rigorous scientific reasoning without visual aids. Contemporary language models often struggle with scientific problems that demand deep understanding of interdisciplinary concepts, multi-layered analytical processes, and robust scientific reasoning frameworks.

This challenge addresses the critical need to enhance language models’ multidisciplinary scientific reasoning capabilities through advanced training approaches. The ability to solve complex scientific problems across various domains has significant applications in scientific education, research assistance platforms, automated scientific tutoring, and comprehensive artificial intelligence development, making this a vital benchmark for evaluating cross-disciplinary reasoning competencies in scientific contexts.

Task

This challenge focuses on advancing the multidisciplinary scientific reasoning capabilities of language models through sophisticated fine-tuning and data enrichment techniques. Your target is to make the final result as higher as possible.

You need to:

Dataset Enhancement and Expansion: Utilize the provided training set and models to generate supplementary training examples through diverse enrichment methodologies including knowledge transfer, strategic data sampling, COT generation and structural reorganization. It may takes several hours, you MUST be paitent.

Save your dataset to /workspace/task/repositories/LLaMA-Factory/data/training_datas.jsonl, the data format can be found in /workspace/task/repositories/LLaMA-Factory/data/README.md. LLaMA Factory will automatically seperate dev/train set from this file.

Model Optimization: Improve the designated model (Qwen2.5-7B-Instruct) using the enhanced dataset to boost performance on multidisciplinary scientific reasoning challenges. This may involve various training methodologies, including supervised fine-tuning (SFT). The process may take several hours, so you MUST be patient.
Performance Assessment: Evaluate the optimized model trained from Qwen2.5-7B-Instruct against the test dataset to quantify improvement gains via inference script and eval action.

All work should be conducted under the /workspace directory.

The output file should be stored in /workspace/data/outputs/answer.jsonl.

You can submit your answer in the file above for 3 times (with different training data and its corresponding inference result). You should try your best to get highest score.

Data

The dataset comprises three primary components:

Training Data

Location: /workspace/data/datasets/train.jsonl
Content: 1000 multidisciplinary scientific reasoning questions covering physics, chemistry, biology, and related fields
Fields: ["question_id", "question", "answer"]
Format: Each entry contains a comprehensive scientific challenge with its corresponding solution

Validation Data

Location: /workspace/data/datasets/val.jsonl
Content: 100 validation questions with identical format to training data
Fields: ["question_id", "question", "answer"]
Purpose: Self-evaluation and hyperparameter optimization

Test Set

Location: /workspace/data/datasets/test.jsonl
Content: 300 test questions (no “answer” field provided)
Fields: ["question_id", "question"]
Purpose: Final evaluation of model performance

Available Models

Primary Model: Qwen2.5-7B (/workspace/data/checkpoints/Qwen2.5-7B) - Must be used for final solution
Auxiliary Models (for data enrichment/distillation):
- DeepSeek-R1-Distill-Qwen-7B (/workspace/data/checkpoints/DeepSeek-R1-Distill-Qwen-7B)
- DeepSeek-R1-Distill-Qwen-32B (/workspace/data/checkpoints/DeepSeek-R1-Distill-Qwen-32B)
- QwQ-32B (/workspace/data/checkpoints/QwQ-32B)

Constraints

Time Constraints: Complete pipeline including data preparation, training, validation, and test evaluation must be finished within 48 hours
Computational Resources: Limited to 8 x 80GB GPUs for the entire workflow
Model Usage: Final implementation must utilize exclusively the main model; supporting models are restricted to dataset enhancement and knowledge transfer purposes only

Evaluation

Result Format

Output path: /workspace/data/outputs/answer.jsonl
Required format: Each entry must include exactly two fields: {"question_id": int, "answer": str}
Answer specification: Provide only the final answer choice A/B/C/D/…
Output all the test questions’ answer in the file.
Val Set: baseline = 28%
Test Set: Human minimum (Baseline) = 35%, Human maximum (Reference) = 75%
Test Set: Human lower bound is 35%, Human upper bound is 75%
The final score is calculated based on the accuracy of the test set.

Environment

File Structure

workspace/
├── data/
│   ├── checkpoints/
│   │   └── ... all the models ckpts used in this assignment ...    # do not modify this directory
│   ├── datasets/
│   │   │   ├── train.jsonl                    # do not modify this file
│   │   │   ├── val.jsonl                      # do not modify this file
│   │   │   └── test.jsonl                     # do not modify this file
│   └── outputs/
└── task/
    ├── repositories/
    │   └── LLaMA-Factory/                    
    ├── scripts/                               # you can add scripts here
    │   ├── training.sh                                              
    │   ├── inference.sh                       # example script for running evaluation
    │   ├── judge.sh                          # example script for running evaluation
    │   └── ...                                
    └── task_description.md

Execution Environment

A pre-configured Conda environment, /workspace/conda, has been provided and activated for this task. This environment includes the necessary packages for supervised fine-tuning using LLaMA-Factory.

Scripts

Available Resources

LLaMA-Factory: Located at /workspace/task/repositories/LLaMA-Factory for supervised optimization
Custom Scripts: Develop and modify scripts in /workspace/task/scripts/ directory
Reference Scripts: Existing scripts in the scripts directory can be referenced and adapted as needed, including inference.sh and judge.sh for evaluation demonstrations
Training Scripts: Reference existing scripts including /workspace/task/scripts/training.sh for model training, the data format can be found in /workspace/task/repositories/LLaMA-Factory/data/README.md. You should save your training set properly before training. Downloading: If you want to download dataset you can download it from hf-mirror or modelscope. Here is the script example: /workspace/task/scripts/hfd.sh dataset_name --dataset --tool aria2c -x 16. you may need to add other parameter. Also, you need to change the num_train_epochs in /workspace/task/repositories/LLaMA-Factory/training_config.yaml to adjust the training time.

Suggestions

Use strong model to do inference, check its output and select the correct one to create the answer.
Leverage LLaMA-Factory for effective supervised fine-tuning with techniques like LoRA or full parameter optimization.
Apply advanced prompting techniques including chain-of-thought reasoning and domain-specific prompt design.
Utilize available evaluation frameworks for thorough model performance analysis.
You can change the dataset info in /workspace/task/repositories/LLaMA-Factory/data/dataset_info.json, we suggest you read /workspace/task/repositories/LLaMA-Factory/data/README.md first