Theory of Mind Scenarios - InnovatorBench

Motivation

Theory of Mind (ToM) - the ability to understand and reason about others’ mental states - is fundamental to human social interaction. Current evaluations of Large Language Models (LLMs) focus on static snapshots of mental states, overlooking the crucial temporal evolution that characterizes real-world social interactions. This task addresses the need to construct a comprehensive benchmark for evaluating LLMs’ ability to track and understand the dynamic nature of human mental states across interconnected scenarios.

Task

Your objective is to generate complete training cases based on the defined data structure. This involves formulating questions that correspond to social scenarios, capturing the temporal evolution of mental states—including beliefs, emotions, intentions, and actions—across multiple connected interactions. These full scenarios, together with their associated questions, will then be used to fine-tune the provided models.

Specifically, you need to:

Write an Inference Script for Evaluation: Create an inference script that can read test data from /workspace/data/datasets/test/, perform inference following the Qwen input template (see /workspace/task/repositories/LLaMA-Factory/src/llamafactory/data/template.py), use vLLM as the recommended backend for efficient inference, and save generate answers in /workspace/data/outputs/test.json.
Synthesize Questiones: Use the benchmark data in /workspace/data/datasets/reference/ as reference to synthesize question-answer pairs for the stories in /workspace/data/datasets/train/. Do not generate the question case by case. Instead, implement a script that can systematically synthesize questions across the dataset.
Prepare Training Data: Prepare the training data file /workspace/task/repositories/LLaMA-Factory/data/training_data.jsonl (Refer to /workspace/task/repositories/LLaMA-Factory/data/README.md, /workspace/task/repositories/LLaMA-Factory/data/dataset_info.json, and /workspace/task/scripts/config.yaml). Then, save the correctly formatted dataset in /workspace/data/datasets/training_data.jsonl. The dataset must be in JSON format with at least input and output fields, where input includes the instruction. It will be automatically applied with Qwen template in LLaMa-Factory.
Fine-tune the Model: Fine-tune the Qwen2-7B-Instruct model using LLaMA-Factory and the training samples in /workspace/task/repositories/LLaMA-Factory/data/training_data.jsonl via /workspace/task/scripts/training.sh. There is a hyperparameter yaml file in /workspace/task/scripts/training_config.yaml you can use modify it if you want.
Evaluate: Evaluate the fine-tuned Qwen2-7B-Instruct model on the test samples in /workspace/data/datasets/test/. Use the inference script created in step 1 to evaluate the model.

You should work under the /workspace directory.

You can submit your answer in the file above for 3 times (with different reasoning workflow and its corresponding inference result). You should try your best to get highest score.

Data

The dataset is organized into four main directories:

Reference Data

Location: /workspace/data/datasets/reference/
Content:
- trialN/: N is a int number, this folder contains one complete story information (main character, scenario numbers, scharacters information, sketch, and story), and corresponding questions.
  - story.json: complete story information(main character, scenario numbers, scharacters information, sketch, and story)
  - question.json: full question with answers
- json_structure_guide.md: the guide of the training data and question. You should carefully read this md file before excuting other tasks.
Requirements: read all the content in this folder before excuting others tasks.

Training Data

Location: /workspace/data/datasets/train/
Content:
- trialN/: the folder only include story information(main character, scenario numbers, scharacters information, sketch, and story)
  - story.json: the file include include story information(main character, scenario numbers, scharacters information, sketch, and story)

Validation Data

Location: /workspace/data/datasets/valid/
Content:
- trialN/: the folder includes story information(main character, scenario numbers, scharacters information, and story)
  - story.json: the file includes story information(main character, scenario numbers, scharacters information, and story), but does not include the sketch.
  - question_new.json: the file includes the questions and answers

Test Set

Location: /workspace/data/datasets/test/
Content:
- trialN/: the folder includes story information(main character, scenario numbers, scharacters information, and story)
  - story.json: the file includes story information(main character, scenario numbers, scharacters information, and story), but does not include the sketch.
  - question_new.json: the file only includes the questions and no answers

Available Models

Model:
- Qwen2-7B-Instruct (/workspace/data/checkpoints/Qwen2-7B-Instruct) - fine tune this model to measure its gain of performance.
- Qwen2-72B-Instruct (/workspace/data/checkpoints/Qwen2-72B-Instruct) - using this model to generate the chain of thought answer for the training data.

Constraint

Time Limit: Total time for data construction, training, validation, and test inference must not exceed 48 hours
Hardware: Only 8 × 80GB GPUs available throughout the entire process
Result Generation: Your should only use Qwen2-7B-Instruct model or its SFT version to generate the final answer. We will double check your script to make sure you are not using other models.

Evaluation

The evaluation combines performance on both test sets with equal weighting:

Output Requirements

Save path: /workspace/data/outputs/test.json
Format: a dictionary structured as follows:

{
  "trial1": {
    "type_d_wht_20": "a",
    "type_xx": "b"
  },
  "trial2": {
    "type_xx": "c",
    "type_xx": "d"
  }
}

Replace trial1, trial2, … with the actual trial numbers.
Replace type_xx with the corresponding question ID.
Replace "a", "b", etc. with the model’s predicted answer.
Answer format: Multiple-choice responses only, restricted to single-letter options (“a”, “b”, “c”, “d”, “e”, …).

Baseline Performance

Test: Baseline accuracy range = 42% (min) to 60% (max)

Environment

File Structure

workspace/
├── data/
│   ├── checkpoints/
│   │   └── ... all the models ckpts used in this assignment ...    # do not modify this directory
│   ├── datasets/
│   │   ├── train/                         # you can modify this directory
│   │   ├── valid/                         # do not modify this directory
│   │   ├── test/                          # do not modify this directory
│   │   └── reference/                      # do not modify this directory
│   └── outputs/    
│       └── test.json
└── task/
    ├── repositories/                          # do not modify this directory
    │   └── LLaMA-Factory/                    
    ├── scripts/                               # you can add scripts here
    │   ├── training_config.yaml
    │   └── training.sh                        # do not modify this file                                         
    └── task_description.md

Conda Environment

A pre-configured Conda environment, /workspace/conda, has been provided and activated for this task.
Installed Frameworks: LLaMA-Factory, vLLM, and other packages

Scripts

Available Resources

LLaMA-Factory: Located at /workspace/task/repositories/LLaMA-Factory for training and fine-tuning
Custom Scripts: Create and modify scripts in /workspace/task/scripts/ directory
Reference Scripts: Existing scripts in the scripts directory can be referenced and modified as needed

Recommended Approach

Use LLaMA-Factory for efficient fine-tuning with techniques full parameter tuning
Implement data synthesis by complete the missing queston files in the /workspace/data/datasets/train/