Medical Question Answering

Data Construction Text Medicine
Created by: Mohan Jiang
2025-09-03

Motivation

Datasets play a crucial role in scientific research. With the advancement of AI engineering capabilities, it becomes critical to evaluate how well AI engineers can find datasets that meet specific requirements and adapt them for model training. This task focuses on medical question answering as an example domain to assess these capabilities.

The challenge lies in both discovering existing datasets and synthesizing new data when needed, ensuring that the acquired or generated data can effectively improve model performance on downstream tasks.

Task

Your task is to work with datasets for medical question answering and fine-tune a model to improve its performance. You need to:

  1. Dataset Discovery: Search for existing public datasets that match the specified criteria
  2. Data Synthesis: Create high-quality synthetic data samples that can be used for model fine-tuning
  3. Data Processing: Format all discovered and synthesized data for model fine-tuning
  4. Model Fine-tuning: Use the curated dataset to fine-tune a Llama-3.1-8B-Instruct model with full parameter fine-tuning
  5. Performance Validation: Evaluate the fine-tuned model and generate inference results

The specific dataset requirements for this medical question answering task are:

Target Dataset Criteria:

  • Domain: USMLE-style multiple choice medical question answering
  • Input: Detailed clinical scenarios with four answer options (A, B, C, D) capturing real-world medical situations
  • Output: Single letter (A, B, C, or D) corresponding to the correct answer with annotated correct answer letters
  • Source: Authentic human-generated clinical questions with expert-annotated answers (no synthetic data for existing datasets)
  • Dataset scale: approximately 5000 question/answer pairs

You should work under the directory /workspace/task and /workspace/data.

You need to implement either dataset discovery or data synthesis approaches. After obtaining the data, convert it into a format suitable for fine-tuning (See /workspace/task/repositories/LLaMA-Factory/data/README.md, /workspace/task/repositories/LLaMA-Factory/data/dataset_info.json and /workspace/task/scripts/config.yaml, and save the correct format data in /workspace/data/datasets/training_data.json). The dataset must be in JSON format with at least input and output fields, where input includes the instruction.

For fine-tuning, you should use full parameter fine-tuning (not LoRA) with the Llama-3.1-8B-Instruct model.

After fine-tuning, use your trained model to generate inference results on the test set and save them to the specified output location.

Data

Model Checkpoint

The Llama-3.1-8B-Instruct model checkpoint is stored in /workspace/data/checkpoints/. (/workspace/data/checkpoints/Meta-Llama-3.1-8B-Instruct)

Test Sets

  • Test Set: Located at /workspace/data/datasets/test.json. This contains the remaining test data without ground truth answers. You need to generate predictions for this set.

It contain JSON lists where each element is a dictionary with:

  • system: System prompt.
  • input: Instruction containing medical question directive and clinical scenario with multiple choice options
  • output: empty

Data Format Requirements

All discovered or synthesized data must be formatted for direct use with model fine-tuning. The format should be compatible with standard instruction-following datasets.

Whether you search for or synthesize datasets, you need to organize your final dataset into JSON files. These JSON files should contain a JSON list where each element is a dictionary with three keys: system, input and output. The input should contain the instruction and input for Llama-3.1-8B-Instruct, and the output should contain the expected output for fine-tuning. The organized JSON files should be placed in /workspace/data/datasets/ with the names search_set.json and generation_set.json.(Both of them can be empty, but you should create them.) And then combined them into /workspace/data/datasets/training_data.json.

After processing the datasets, use the following script for fine-tuning:

cd /workspace/task/repositories/LLaMA-Factory
llamafactory-cli train /workspace/task/scripts/config.yaml

(You can change the hyperparameter in /workspace/task/scripts/config.yaml if you want)

Constraint

  • You must work within the directory /workspace/task and /workspace/data
  • You can implement both dataset discovery and data synthesis approaches and must implement at least one of them.
  • For dataset search: Focus on publicly available datasets from HuggingFace
  • Use full parameter fine-tuning (not LoRA)
  • Use Llama-3.1-8B-Instruct model for fine-tuning
  • Time constraint: Complete this task efficiently within 48 hours
  • GPU Usage: You can use 8 80G GPUs for training/inference

Evaluation

Inference Requirements

After fine-tuning your model, you need to:

  1. Generate predictions for the test set (/workspace/data/datasets/test.json), using /workspace/task/scripts/inference.sh (You may need to change some hyperparameters in the script), and save the inference results to /workspace/data/outputs/predictions.json, transform it into the final answer /workspace/data/outputs/answers.json by generating another scripts.
  2. The output format should match the expected JSON structure with predictions for each test sample
[
  {
    "answer": "your answer (A,B,C,D)"
  }
]

Note: the /workspace/data/outputs/answers.json should return the same order and number of data in /workspace/data/datasets/test.json with

You can submit your answer in the file above for 3 times (with different reasoning workflow and its corresponding inference result). You should try your best to get highest score.

Performance Requirements

The fine-tuned model must demonstrate significant improvement over the baseline (non-fine-tuned model). Specifically:

  • The final score is calculated based on the accuracy of the test set.
  • Higher improvement margins are preferred

Evaluation Metrics

You can write eval metric by your own!

Evaluation Process

  1. Baseline Evaluation: Test the original Llama-3.1-8B-Instruct model on the test set (baseline accuracy score: 0.26)
  2. Fine-tuned Model Evaluation: Test the fine-tuned model on the same test set (fine-tuned accuracy score: 0.95)
  3. Comparison: Calculate percentage improvement in accuracy scores

The evaluation will be conducted in the eval workspace using your generated predictions to determine task success.

The final score is calculated based on the accuracy of the test set.

Environment

We have set up and activated the conda environment /workspace/conda with all necessary dependencies.

Script

If you want to download dataset you can download it from hf-mirror or modelscope. Here is the script example: /workspace/task/scripts/hfd.sh dataset_name --dataset --tool aria2c -x 16. you may need to add other parameter.