Motivation
Theory of Mind (ToM) - the ability to understand and reason about others’ mental states - is fundamental to human social interaction. Current evaluations of Large Language Models (LLMs) focus on static snapshots of mental states, overlooking the crucial temporal evolution that characterizes real-world social interactions. This task addresses the need to construct a comprehensive benchmark for evaluating LLMs’ ability to track and understand the dynamic nature of human mental states across interconnected scenarios.
Task
Your objective is to generate complete training cases based on the defined data structure. This involves formulating questions that correspond to social scenarios, capturing the temporal evolution of mental states—including beliefs, emotions, intentions, and actions—across multiple connected interactions. These full scenarios, together with their associated questions, will then be used to fine-tune the provided models.
Specifically, you need to:
- Write an Inference Script for Evaluation: Create an inference script that can read test data from
/workspace/data/datasets/test/, perform inference following the Qwen input template (see/workspace/task/repositories/LLaMA-Factory/src/llamafactory/data/template.py), use vLLM as the recommended backend for efficient inference, and save generate answers in/workspace/data/outputs/test.json. - Synthesize Questiones: Use the benchmark data in
/workspace/data/datasets/reference/as reference to synthesize question-answer pairs for the stories in/workspace/data/datasets/train/. Do not generate the question case by case. Instead, implement a script that can systematically synthesize questions across the dataset. - Prepare Training Data: Prepare the training data file
/workspace/task/repositories/LLaMA-Factory/data/training_data.jsonl(Refer to/workspace/task/repositories/LLaMA-Factory/data/README.md,/workspace/task/repositories/LLaMA-Factory/data/dataset_info.json, and/workspace/task/scripts/config.yaml). Then, save the correctly formatted dataset in/workspace/data/datasets/training_data.jsonl. The dataset must be in JSON format with at leastinputandoutputfields, whereinputincludes the instruction. It will be automatically applied with Qwen template in LLaMa-Factory. - Fine-tune the Model: Fine-tune the Qwen2-7B-Instruct model using LLaMA-Factory and the training samples in
/workspace/task/repositories/LLaMA-Factory/data/training_data.jsonlvia/workspace/task/scripts/training.sh. There is a hyperparameter yaml file in/workspace/task/scripts/training_config.yamlyou can use modify it if you want. - Evaluate: Evaluate the fine-tuned Qwen2-7B-Instruct model on the test samples in
/workspace/data/datasets/test/. Use the inference script created in step 1 to evaluate the model.
You should work under the /workspace directory.
You can submit your answer in the file above for 3 times (with different reasoning workflow and its corresponding inference result). You should try your best to get highest score.
Data
The dataset is organized into four main directories:
Reference Data
- Location:
/workspace/data/datasets/reference/ - Content:
trialN/: N is a int number, this folder contains one complete story information (main character, scenario numbers, scharacters information, sketch, and story), and corresponding questions.story.json: complete story information(main character, scenario numbers, scharacters information, sketch, and story)question.json: full question with answers
json_structure_guide.md: the guide of the training data and question. You should carefully read this md file before excuting other tasks.
- Requirements: read all the content in this folder before excuting others tasks.
Training Data
- Location:
/workspace/data/datasets/train/ - Content:
trialN/: the folder only include story information(main character, scenario numbers, scharacters information, sketch, and story)story.json: the file include include story information(main character, scenario numbers, scharacters information, sketch, and story)
Validation Data
- Location:
/workspace/data/datasets/valid/ - Content:
trialN/: the folder includes story information(main character, scenario numbers, scharacters information, and story)story.json: the file includes story information(main character, scenario numbers, scharacters information, and story), but does not include the sketch.question_new.json: the file includes the questions and answers
Test Set
- Location:
/workspace/data/datasets/test/ - Content:
trialN/: the folder includes story information(main character, scenario numbers, scharacters information, and story)story.json: the file includes story information(main character, scenario numbers, scharacters information, and story), but does not include the sketch.question_new.json: the file only includes the questions and no answers
Available Models
- Model:
- Qwen2-7B-Instruct (
/workspace/data/checkpoints/Qwen2-7B-Instruct) - fine tune this model to measure its gain of performance. - Qwen2-72B-Instruct (
/workspace/data/checkpoints/Qwen2-72B-Instruct) - using this model to generate the chain of thought answer for the training data.
- Qwen2-7B-Instruct (
Constraint
- Time Limit: Total time for data construction, training, validation, and test inference must not exceed 48 hours
- Hardware: Only 8 × 80GB GPUs available throughout the entire process
- Result Generation: Your should only use Qwen2-7B-Instruct model or its SFT version to generate the final answer. We will double check your script to make sure you are not using other models.
Evaluation
The evaluation combines performance on both test sets with equal weighting:
Output Requirements
- Save path:
/workspace/data/outputs/test.json - Format: a dictionary structured as follows:
{
"trial1": {
"type_d_wht_20": "a",
"type_xx": "b"
},
"trial2": {
"type_xx": "c",
"type_xx": "d"
}
}
- Replace
trial1,trial2, … with the actual trial numbers. - Replace
type_xxwith the corresponding question ID. - Replace
"a","b", etc. with the model’s predicted answer. - Answer format: Multiple-choice responses only, restricted to single-letter options (“a”, “b”, “c”, “d”, “e”, …).
Baseline Performance
- Test: Baseline accuracy range = 42% (min) to 60% (max)
Environment
File Structure
workspace/
├── data/
│ ├── checkpoints/
│ │ └── ... all the models ckpts used in this assignment ... # do not modify this directory
│ ├── datasets/
│ │ ├── train/ # you can modify this directory
│ │ ├── valid/ # do not modify this directory
│ │ ├── test/ # do not modify this directory
│ │ └── reference/ # do not modify this directory
│ └── outputs/
│ └── test.json
└── task/
├── repositories/ # do not modify this directory
│ └── LLaMA-Factory/
├── scripts/ # you can add scripts here
│ ├── training_config.yaml
│ └── training.sh # do not modify this file
└── task_description.md
Conda Environment
- A pre-configured Conda environment,
/workspace/conda, has been provided and activated for this task. - Installed Frameworks: LLaMA-Factory, vLLM, and other packages
Scripts
Available Resources
- LLaMA-Factory: Located at
/workspace/task/repositories/LLaMA-Factoryfor training and fine-tuning - Custom Scripts: Create and modify scripts in
/workspace/task/scripts/directory - Reference Scripts: Existing scripts in the scripts directory can be referenced and modified as needed
Recommended Approach
- Use LLaMA-Factory for efficient fine-tuning with techniques full parameter tuning
- Implement data synthesis by complete the missing queston files in the
/workspace/data/datasets/train/