Motivation
Datasets play a crucial role in scientific research. With the advancement of AI engineering capabilities, it becomes critical to evaluate how well AI engineers can find datasets that meet specific requirements and adapt them for model training. This task focuses on text summarization as an example domain to assess these capabilities.
The challenge lies in both discovering existing datasets and synthesizing new data when needed, ensuring that the acquired or generated data can effectively improve model performance on downstream tasks.
Task
Your task is to work with datasets for language translation and fine-tune a model to improve its performance. You need to:
- Dataset Discovery: Search for existing public datasets that match the specified criteria or clean the data from the webpages
- Data Synthesis: Create high-quality synthetic data samples that can be used for model fine-tuning
- Data Processing: Format all discovered and synthesized data for model fine-tuning
- Model Fine-tuning: Use the curated dataset to fine-tune a Llama-3.1-8B-Instruct model with full parameter fine-tuning
- Performance Validation: Evaluate the fine-tuned model and generate inference results
The specific dataset requirements for this text summarization task are:
Target Dataset Criteria:
- Domain: Medical (medical consent forms, patient information, and other health-related sentences)
- Language Pairs: English (Latin script) paired with one Indian language (Tamil) in its native scripts.
- Data Type: Sentence-aligned parallel text suitable for supervised translation training.
- Input Format: English medical sentences/questions/instructions.
- Output Format: Exact corresponding translation in the target Indian(Tamil) language script.
- Source Quality: Translations produced or verified by professional humans (no synthetic or machine-generated pairs).
- Size: Approximately 5 K – 15 K aligned sentence pairs (around 10 K preferred).
You should work under the directory /workspace/task and /workspace/data.
You need to implement either dataset discovery or data synthesis approaches. After obtaining the data, convert it into a format suitable for fine-tuning (See /workspace/task/repositories/LLaMA-Factory/data/README.md, /workspace/task/repositories/LLaMA-Factory/data/dataset_info.json and /workspace/task/scripts/config.yaml, and save the correct format data in /workspace/data/datasets/training_data.json). The dataset must be in JSON format with at least input and output fields, where input includes the instruction.
For fine-tuning, you should use full parameter fine-tuning (not LoRA) with the Llama-3.1-8B-Instruct model.
After fine-tuning, use your trained model to generate inference results on the test set and save them to the specified output location.
You can submit your answer in the file above for 3 times (with different reasoning workflow and its corresponding inference result). You should try your best to get highest score.
Data
Model Checkpoint
The Llama-3.1-8B-Instruct model checkpoint is stored in /workspace/data/checkpoints/. (/workspace/data/checkpoints/Meta-Llama-3.1-8B-Instruct)
Test Sets
- Test Set: Located at
/workspace/data/datasets/test.json. This contains the remaining test data without ground truth answers. You need to generate predictions for this set.
It contain JSON lists where each element is a dictionary with:
input: Instruction containing translation directive and original language textoutput: empty
Data Format Requirements
All discovered or synthesized data must be formatted for direct use with model fine-tuning. The format should be compatible with standard instruction-following datasets.
Whether you search for or synthesize datasets, you need to organize your final dataset into JSON files. These JSON files should contain a JSON list where each element is a dictionary with two keys: input and output. The input should contain the instruction and input for Llama-3.1-8B-Instruct, and the output should contain the expected output for fine-tuning. The organized JSON files should be placed in /workspace/data/datasets/ with the names search_set.json and generation_set.json.(Both of them can be empty, but you should create them.) And then combined them into /workspace/data/datasets/training_data.json.
After processing the datasets, use the following script for fine-tuning:
cd /workspace/task/repositories/LLaMA-Factory
llamafactory-cli train /workspace/task/scripts/config.yaml
(You can change the hyperparameter in /workspace/task/scripts/config.yaml if you want)
Constraint
- You must work within the directory
/workspace/taskand/workspace/data - You can implement both dataset discovery and data synthesis approaches and must implement at least one of them.
- For dataset search: Focus on publicly available datasets from HuggingFace
- Use full parameter fine-tuning (not LoRA)
- Use Llama-3.1-8B-Instruct model for fine-tuning
- Time constraint: Complete this task efficiently within 48 hours
- GPU Usage: You can use 8 80G GPUs for training/inference
Evaluation
Inference Requirements
After fine-tuning your model, you need to:
- Generate predictions for the test set (
/workspace/data/datasets/test.json), using/workspace/task/scripts/inference.py(You may need to change some hyperparameters in the script), and save the inference results to/workspace/data/outputs/predictions.json - The output format should match the expected JSON structure with predictions for each test sample, a list of json, like:
{
"input": "Translate the following eng_Latn text to tam_Taml text: 5. Is there any benefit to accepting this study? How the proposed treatment is beneficial over the current treatment?",
"output": "5. இந்த ஆய்வில் பங்கேற்பதால் எனக்கு ஏதாவது நன்மை உண்டா? இப்போதைய சிகிச்சையைவிட முன்னெடுக்கப்போகும் சிகிச்சை எவ்விதத்தில் ஆதாயமானது?"
},
The order and number of the data in /workspace/data/outputs/predictions.json should obey the order and number of the data in /workspace/data/datasets/test.json
Performance Requirements
The fine-tuned model must demonstrate significant improvement over the baseline (non-fine-tuned model). Specifically:
- The final score is calculated based on the bleu of the test set.
- Higher improvement margins are preferred
Evaluation Metrics
BLEU scores will be computed using the evaluation script located at /workspace/task/inference.py.
Evaluation Process
- Baseline Evaluation: Test the original Llama-3.1-8B-Instruct model on the test set (baseline BLEU score: 0.01)
- Fine-tuned Model Evaluation: Test the fine-tuned model on the same test set (fine-tuned BLEU score: 0.65)
- Comparison: Calculate percentage improvement in BLEU scores
- Success Criteria: BLEU metric must show ≥10% improvement over baseline
The evaluation will be conducted in the eval workspace using your generated predictions to determine task success.
Environment
We have set up and activated the conda environment /workspace/conda with all necessary dependencies.
Script
If you want to download dataset you can download it from hf-mirror or modelscope. Here is the script example:
/workspace/task/scripts/hfd.sh dataset_name --dataset --tool aria2c -x 16. you may need to add other parameter.