Motivation
Current multimodal large language models struggle with complex scientific visual reasoning tasks, particularly when connecting abstract visual elements in scientific journal covers to their corresponding textual descriptions. While existing datasets provide basic image-text pairs, they lack the sophisticated training examples needed to teach models the nuanced relationships between scientific visual metaphors, domain-specific concepts, and technical language. The goal is to systematically augment the limited original dataset by generating high-quality, diverse training examples that can improve model performance on scientific visual understanding through strategic data augmentation and curriculum learning.
Current multimodal large language models struggle with complex scientific visual reasoning tasks, particularly when connecting abstract visual elements in scientific journal covers to their corresponding textual descriptions. While existing datasets provide basic image-text pairs, they lack the sophisticated training examples needed to teach models the nuanced relationships between scientific visual metaphors, domain-specific concepts, and technical language.
Scientific journal covers often contain highly abstract and symbolic visual elements that have deep connections to the article content, but these connections often require specialized knowledge and contextual understanding to interpret. For example, covers of Nature or Science journals might use artistic representations of molecular structures, cell images, or physical phenomena to symbolically convey research findings, rather than directly showing experimental results.
The goal of this task is to systematically augment the limited original dataset by generating high-quality, diverse training examples that can improve model performance on scientific visual understanding through strategic data augmentation and curriculum learning.
Task Description
The focus of this assignment is to enhance the scientific visual understanding capabilities of multimodal large models through inference-time scaling techniques. The specific requirements are as follows:
1. Data Augmentation
Utilize the provided training data to create more training samples through various data synthesis techniques, including but not limited to:
- Chain-of-Thought Generation: Generate detailed reasoning processes for each training sample, explaining how to derive specific conclusions from images
- Prompt Engineering Techniques: Design diverse prompt templates to enhance the model’s understanding of different expressions
- Data Recombination and Transformation: Create new training examples by recombining existing data
Data augmentation should consider the specificity of the scientific domain, ensuring that the generated samples are of high quality in terms of scientific accuracy and educational value.
2. Model Fine-tuning
Fine-tune the specified Qwen2.5-VL-7B-Instruct model using the enhanced dataset to improve its performance on scientific visual understanding tasks. The fine-tuning process should include:
- Parameter Optimization: Adjust learning rate, batch size, training epochs, and other hyperparameters
- Regularization Strategies: Implement appropriate regularization methods to prevent overfitting
- Evaluation and Validation: Regularly evaluate model performance and save the best checkpoints
- Gradient Accumulation: Use gradient accumulation to increase effective batch size when memory is limited
The fine-tuning process should record key metrics, including training loss, validation accuracy, etc., to analyze model improvements.
3. Evaluation and Output
To generate responses for image2text or text2image tasks, you can use the following LLaMA-Factory command:
cd /workspace/task/repositories/LLaMA-Factory
python scripts/vllm_infer.py \
--model_name_or_path <model_path> \
--dataset <dataset_name> \
--save_name <save_path> \
Transform to final answers and save them to the following files:
/workspace/data/outputs/image2text_answer.json/workspace/data/outputs/text2image_answer.json
The output format must strictly follow the JSON format below:
[
{
"id": "xxxx", // Unique identifier for the test sample
"output": "A/B/C/D" // Predicted option, must be a single letter
},
...
]
Evaluation will be based on the match between model predictions and standard answers, with accuracy as the primary evaluation metric.
You can submit your answer in the file above for 3 times (with different training data and its corresponding inference result). You should try your best to get highest score.
Dataset Details
Dataset Overview
You will use a scientific understanding dataset based on the MAC paper. This dataset consists of scientific journal cover image-story pairs from Nature, Science, Cell, and ACS, and includes two tasks: image2text and text2image.
-
Dataset Scale:
- Training set: 1877 data points for each task
- Dev set: 10 data points for each task
- Test set: 400 data points for each task
-
Image Location:
/workspace/data/image/MAC_Bench
Data Format Details
image2text Task Format
In the image2text task, the model needs to select the most matching text description based on a given scientific image. The data format is as follows:
{
"system": "# Requirement\n You are an excellent scientific image reader. You need to analyze the provided image and choose the most appropriate option based on your understanding. ONLY based on the image and the options provided above, predict the option that you would choose.",
"input": "The input", // Contains question description and multiple options
"output": "A/B/C/D", // Contains the correct answer in the training set, empty in the test set
"images": [
"image_path" // Path to the scientific image
]
}
text2image Task Format
In the text2image task, the model needs to select the most matching image from multiple images based on a text description. The data format is as follows:
{
"system": "# Requirement\n You are an excellent scientific image reader. You need to analyze the provided image and choose the most appropriate option based on your understanding. ONLY based on the image and the options provided above, predict the option that you would choose.",
"input": "The input", // Contains question description
"output": "A/B/C/D", // Contains the correct answer in the training set, empty in the test set
"images": [
"image_path_A", // Path to the image for option A
"image_path_B", // Path to the image for option B
"image_path_C", // Path to the image for option C
"image_path_D" // Path to the image for option D
]
}
Datasets
- Training Set: Located at
/workspace/data/datasets/image2text_train.jsonand/workspace/data/datasets/text2image_train.json. - Development Set: Located at
/workspace/data/datasets/image2text_dev.jsonand/workspace/data/datasets/text2image_dev.json. - Test Set: After training, you should evaluate the model on the test sets
/workspace/data/datasets/image2text_test.jsonand/workspace/data/datasets/text2image_test.json.- Test data includes a unique “id” field to identify each question
- The “output” field in test data is empty (None), to be predicted by the model
All data files are located in the /workspace/data/datasets/ directory.
Available Models
- Qwen2.5-VL-7B-Instruct: Located at
/workspace/data/checkpoints/Qwen2.5-VL-7B-Instruct- This is a vision-language instruction-tuned model with 7 billion parameters
- Supports multi-turn dialogue and complex visual understanding tasks
- Has the ability to process high-resolution images
LLaMA-Factory Usage Guide
Image Token Explanation
In LLaMA-Factory, the <image> token in the input text represents an image path. The nth <image> token corresponds to the nth image in the “images” list. For example:
- If there is one
<image>token in the input, it will be replaced with the image pointed to by images[0] - If there are multiple
<image>tokens, they will correspond to the images in the images list in order - Read
/workspace/task/repositories/LLaMA-Factory/data/README.md
Inference Command
To generate responses for image2text or text2image tasks, you can use the following LLaMA-Factory command:
cd /workspace/task/repositories/LLaMA-Factory
python scripts/vllm_infer.py \
--model_name_or_path <model_path> \
--dataset <dataset_name> \
--save_name <save_path> \
(Note: You may need to add another parameters.)
Training Command
To train the model, you can use the following command:
cd /workspace/task/repositories/LLaMA-Factory
llamafactory-cli train training_config.yaml
You can modify hyperparameters in the training_config.yaml file, including learning rate, training epochs, batch size, etc.
Dataset Registration
Datasets need to be registered in /workspace/task/repositories/LLaMA-Factory/data/dataset_info.json. Make sure the paths and column mappings are correctly set:
{
"image2text_train":{
"file_name": "/workspace/data/datasets/image2text_train.json",
"columns":{
"prompt": "system",
"query": "input",
"response": "output",
"images": "images"
}
},
"text2image_train":{
"file_name": "/workspace/data/datasets/text2image_train.json",
"columns":{
"prompt": "system",
"query": "input",
"response": "output",
"images": "images"
}
}
}
Performance Constraints
Time and Resource Limits
- Development Time Limit: The maximum time for system enhancement and implementation is 24 hours
- Evaluation Time Limit: The solution time for each question should not exceed 3 minutes
- GPU Requirement: 8x 80G GPUs
- Memory Management: Please plan memory usage reasonably to avoid OOM errors
Innovation Space
- Open Innovation: You are encouraged to explore any method that can improve scientific reasoning performance
- Freedom: You have complete freedom to enhance the inference system as needed. The codebase provides inference capabilities and an extensible optimization infrastructure for you to utilize creatively
- Goal: Achieve the highest success rate through innovative optimization methods
Evaluation Criteria
Primary Metrics
- Accuracy: Main metric, percentage of correct predictions
- Accuracy will be calculated separately for image2text and text2image tasks
- Overall accuracy is the weighted average of the two task accuracies
Evaluation Process
- The system will load the prediction results you generated at the specified paths
- Compare the prediction results with the reference answers
- Calculate accuracy and generate evaluation reports
- Determine the final score based on accuracy
Environment Configuration
The environment has been pre-configured for you in /workspace/conda. You can start working directly without additional setup. This environment includes:
Primary Metrics:
- Accuracy: Main metric, percentage of correct predictions
Environment
The environment has been pre-configured for you in /workspace/conda. You can start working directly
without additional setup. You can use vllm, datatrove etc. in this environment.
workspace/
├── data/
│ ├── checkpoints/
│ │ └── Qwen2.5-VL-7B-Instruct/ # Pre-trained model
│ ├── datasets/
│ │ ├── image2text_train.json # Image to text training data
│ │ ├── image2text_test.json # Image to text test data
│ │ ├── text2image_train.json # Text to image training data
│ │ └── text2image_test.json # Text to image test data
│ ├── image/
│ │ └── MAC_Bench/ # Image file directory
│ └── outputs/ # Output directory
│ ├── image2text_answer.json # Image to text task answers (to be generated)
│ └── text2image_answer.json # Text to image task answers (to be generated)
└── task/
├── repositories/
│ └── LLaMA-Factory/ # LLaMA-Factory codebase
│ └── training_config.yaml # Training configuration file
└── task_description.md # Task description file
Implementation Suggestions
- Data Analysis: First analyze the training data to understand the characteristics and challenges of scientific visual understanding tasks
- Baseline Evaluation: Conduct baseline evaluation using the original model to determine areas for improvement
- Data Augmentation Strategies:
- Generate detailed chain-of-thought explanations
- Create samples with increasing difficulty
- Use prompt engineering techniques to enrich training data
- Model Fine-tuning Optimization:
- Try different learning rates and training epochs
- Experiment with parameter-efficient fine-tuning methods like LoRA
- Use gradient accumulation to handle large batch data
- Inference Optimization:
- Design prompt templates specifically for scientific visual understanding
- Implement multi-step reasoning processes
- Integrate uncertainty estimation and answer verification mechanisms
Submission Requirements
- Ensure final answers are saved at the specified paths:
/workspace/data/outputs/image2text_answer.json/workspace/data/outputs/text2image_answer.json
- Strictly follow the specified output format
- Complete all work before the deadline
Good luck with your scientific visual understanding enhancement task!