Scientific Visual Reasoning

Motivation

Current multimodal large language models struggle with complex scientific visual reasoning tasks, particularly when connecting abstract visual elements in scientific journal covers to their corresponding textual descriptions. While existing datasets provide basic image-text pairs, they lack the sophisticated training examples needed to teach models the nuanced relationships between scientific visual metaphors, domain-specific concepts, and technical language. The goal is to systematically augment the limited original dataset by generating high-quality, diverse training examples that can improve model performance on scientific visual understanding through strategic data augmentation and curriculum learning.

Scientific journal covers often contain highly abstract and symbolic visual elements that have deep connections to the article content, but these connections often require specialized knowledge and contextual understanding to interpret. For example, covers of Nature or Science journals might use artistic representations of molecular structures, cell images, or physical phenomena to symbolically convey research findings, rather than directly showing experimental results.

The goal of this task is to systematically augment the limited original dataset by generating high-quality, diverse training examples that can improve model performance on scientific visual understanding through strategic data augmentation and curriculum learning.

Task Description

The focus of this assignment is to enhance the scientific visual understanding capabilities of multimodal large models through inference-time scaling techniques. The specific requirements are as follows:

1. Data Augmentation

Utilize the provided training data to create more training samples through various data synthesis techniques, including but not limited to:

Chain-of-Thought Generation: Generate detailed reasoning processes for each training sample, explaining how to derive specific conclusions from images
Prompt Engineering Techniques: Design diverse prompt templates to enhance the model’s understanding of different expressions
Data Recombination and Transformation: Create new training examples by recombining existing data

Data augmentation should consider the specificity of the scientific domain, ensuring that the generated samples are of high quality in terms of scientific accuracy and educational value.

2. Model Fine-tuning

Fine-tune the specified Qwen2.5-VL-7B-Instruct model using the enhanced dataset to improve its performance on scientific visual understanding tasks. The fine-tuning process should include:

Parameter Optimization: Adjust learning rate, batch size, training epochs, and other hyperparameters
Regularization Strategies: Implement appropriate regularization methods to prevent overfitting
Evaluation and Validation: Regularly evaluate model performance and save the best checkpoints
Gradient Accumulation: Use gradient accumulation to increase effective batch size when memory is limited

The fine-tuning process should record key metrics, including training loss, validation accuracy, etc., to analyze model improvements.

3. Evaluation and Output

To generate responses for image2text or text2image tasks, you can use the following LLaMA-Factory command:

cd /workspace/task/repositories/LLaMA-Factory
python scripts/vllm_infer.py \
    --model_name_or_path <model_path> \
    --dataset <dataset_name> \
    --save_name <save_path> \

Transform to final answers and save them to the following files:

/workspace/data/outputs/image2text_answer.json
/workspace/data/outputs/text2image_answer.json

The output format must strictly follow the JSON format below:

[
  {
    "id": "xxxx",    // Unique identifier for the test sample
    "output": "A/B/C/D"    // Predicted option, must be a single letter
  },
  ...
]

Evaluation will be based on the match between model predictions and standard answers, with accuracy as the primary evaluation metric.

You can submit your answer in the file above for 3 times (with different training data and its corresponding inference result). You should try your best to get highest score.

Dataset Details

Dataset Overview

You will use a scientific understanding dataset based on the MAC paper. This dataset consists of scientific journal cover image-story pairs from Nature, Science, Cell, and ACS, and includes two tasks: image2text and text2image.

Dataset Scale:
- Training set: 1877 data points for each task
- Dev set: 10 data points for each task
- Test set: 400 data points for each task
Image Location: /workspace/data/image/MAC_Bench

Data Format Details

image2text Task Format

In the image2text task, the model needs to select the most matching text description based on a given scientific image. The data format is as follows:

{
  "system": "# Requirement\n You are an excellent scientific image reader. You need to analyze the provided image and choose the most appropriate option based on your understanding. ONLY based on the image and the options provided above, predict the option that you would choose.",
  "input": "The input",  // Contains question description and multiple options
  "output": "A/B/C/D",  // Contains the correct answer in the training set, empty in the test set
  "images": [
      "image_path"  // Path to the scientific image
  ]
}

text2image Task Format

In the text2image task, the model needs to select the most matching image from multiple images based on a text description. The data format is as follows:

{
  "system": "# Requirement\n You are an excellent scientific image reader. You need to analyze the provided image and choose the most appropriate option based on your understanding. ONLY based on the image and the options provided above, predict the option that you would choose.",
  "input": "The input",  // Contains question description
  "output": "A/B/C/D",  // Contains the correct answer in the training set, empty in the test set
  "images": [
      "image_path_A",  // Path to the image for option A
      "image_path_B",  // Path to the image for option B
      "image_path_C",  // Path to the image for option C
      "image_path_D"   // Path to the image for option D
  ]
}

Datasets

Training Set: Located at /workspace/data/datasets/image2text_train.json and /workspace/data/datasets/text2image_train.json.
Development Set: Located at /workspace/data/datasets/image2text_dev.json and /workspace/data/datasets/text2image_dev.json.
Test Set: After training, you should evaluate the model on the test sets /workspace/data/datasets/image2text_test.json and /workspace/data/datasets/text2image_test.json.
- Test data includes a unique “id” field to identify each question
- The “output” field in test data is empty (None), to be predicted by the model

All data files are located in the /workspace/data/datasets/ directory.

Available Models

Qwen2.5-VL-7B-Instruct: Located at /workspace/data/checkpoints/Qwen2.5-VL-7B-Instruct
- This is a vision-language instruction-tuned model with 7 billion parameters
- Supports multi-turn dialogue and complex visual understanding tasks
- Has the ability to process high-resolution images

LLaMA-Factory Usage Guide

Image Token Explanation

In LLaMA-Factory, the <image> token in the input text represents an image path. The nth <image> token corresponds to the nth image in the “images” list. For example:

If there is one <image> token in the input, it will be replaced with the image pointed to by images[0]
If there are multiple <image> tokens, they will correspond to the images in the images list in order
Read /workspace/task/repositories/LLaMA-Factory/data/README.md

Inference Command

To generate responses for image2text or text2image tasks, you can use the following LLaMA-Factory command:

cd /workspace/task/repositories/LLaMA-Factory
python scripts/vllm_infer.py \
    --model_name_or_path <model_path> \
    --dataset <dataset_name> \
    --save_name <save_path> \

(Note: You may need to add another parameters.)

Training Command

To train the model, you can use the following command:

cd /workspace/task/repositories/LLaMA-Factory
llamafactory-cli train training_config.yaml

You can modify hyperparameters in the training_config.yaml file, including learning rate, training epochs, batch size, etc.

Dataset Registration

Datasets need to be registered in /workspace/task/repositories/LLaMA-Factory/data/dataset_info.json. Make sure the paths and column mappings are correctly set:

{
  "image2text_train":{
    "file_name": "/workspace/data/datasets/image2text_train.json",
    "columns":{
      "prompt": "system",
      "query": "input",
      "response": "output",
      "images": "images"
    }
  },
  "text2image_train":{
    "file_name": "/workspace/data/datasets/text2image_train.json",
    "columns":{
      "prompt": "system",
      "query": "input",
      "response": "output",
      "images": "images"
    }
  }
}

Performance Constraints

Time and Resource Limits

Development Time Limit: The maximum time for system enhancement and implementation is 24 hours
Evaluation Time Limit: The solution time for each question should not exceed 3 minutes
GPU Requirement: 8x 80G GPUs
Memory Management: Please plan memory usage reasonably to avoid OOM errors

Innovation Space

Open Innovation: You are encouraged to explore any method that can improve scientific reasoning performance
Freedom: You have complete freedom to enhance the inference system as needed. The codebase provides inference capabilities and an extensible optimization infrastructure for you to utilize creatively
Goal: Achieve the highest success rate through innovative optimization methods

Evaluation Criteria

Primary Metrics

Accuracy: Main metric, percentage of correct predictions
- Accuracy will be calculated separately for image2text and text2image tasks
- Overall accuracy is the weighted average of the two task accuracies

Evaluation Process

The system will load the prediction results you generated at the specified paths
Compare the prediction results with the reference answers
Calculate accuracy and generate evaluation reports
Determine the final score based on accuracy

Environment Configuration

The environment has been pre-configured for you in /workspace/conda. You can start working directly without additional setup. This environment includes:

Primary Metrics:

Accuracy: Main metric, percentage of correct predictions

Environment

The environment has been pre-configured for you in /workspace/conda. You can start working directly without additional setup. You can use vllm, datatrove etc. in this environment.

workspace/
├── data/
│   ├── checkpoints/
│   │   └── Qwen2.5-VL-7B-Instruct/  # Pre-trained model
│   ├── datasets/
│   │   ├── image2text_train.json    # Image to text training data
│   │   ├── image2text_test.json     # Image to text test data
│   │   ├── text2image_train.json    # Text to image training data
│   │   └── text2image_test.json     # Text to image test data
│   ├── image/
│   │   └── MAC_Bench/               # Image file directory
│   └── outputs/                     # Output directory
│       ├── image2text_answer.json   # Image to text task answers (to be generated)
│       └── text2image_answer.json   # Text to image task answers (to be generated)
└── task/
    ├── repositories/
    │   └── LLaMA-Factory/           # LLaMA-Factory codebase
    │       └── training_config.yaml # Training configuration file
    └── task_description.md          # Task description file

Implementation Suggestions

Data Analysis: First analyze the training data to understand the characteristics and challenges of scientific visual understanding tasks
Baseline Evaluation: Conduct baseline evaluation using the original model to determine areas for improvement
Data Augmentation Strategies:
- Generate detailed chain-of-thought explanations
- Create samples with increasing difficulty
- Use prompt engineering techniques to enrich training data
Model Fine-tuning Optimization:
- Try different learning rates and training epochs
- Experiment with parameter-efficient fine-tuning methods like LoRA
- Use gradient accumulation to handle large batch data
Inference Optimization:
- Design prompt templates specifically for scientific visual understanding
- Implement multi-step reasoning processes
- Integrate uncertainty estimation and answer verification mechanisms

Submission Requirements

Ensure final answers are saved at the specified paths:
- /workspace/data/outputs/image2text_answer.json
- /workspace/data/outputs/text2image_answer.json
Strictly follow the specified output format
Complete all work before the deadline

Good luck with your scientific visual understanding enhancement task!