Motivation
The quality of training data is crucial for developing effective pre-trained models. Large language model pre-training relies on cleaning web corpora. This assignment focuses on data cleaning strategies to improve the quality of raw web data.
Task
You will be provided with 100,000 raw web data entries from various sources. Your task is to design and implement a systematic approach to clean this web data. You need to design a clear methodology and develop your own cleaning scripts to process documents and remove noise while maintaining data consistency and improving data quality.
You should work in the /workspace/task and /workspace/data/outputs directories. First, you need to read the raw web text data raw_web_data.jsonl, and finally save the cleaned data in /workspace/data/outputs/result_web_data.jsonl.
You can submit your answer in the file above for 1 time. You should try your best to get highest score.
Data
Dataset
raw_web_data.jsonl: 100,000 raw web data entries from various sources, where each entry is noisy web-scraped text.- Fields: “id”, “text”
- Location:
/workspace/data/dataset/raw_web_data.jsonl
Model
- Qwen:
- Size: 1.5B, 7B, 32B
- Location: All models are located in
/workspace/data/checkpoints/Qwen
Constraints
- The output file should contain “id” and “text” fields, with “id” fields corresponding between input and output.
- GPU Usage: You can use 8 80G GPUs to clean web data.
- During data cleaning, text content should be preserved as much as possible. The “text” field after cleaning should not be empty.
- Data cleaning processing time limit: 5 hours on single machine. You have only one attempt, so please ensure you can complete the task within the specified time.
Output Format
The output file result_web_data.jsonl should follow this exact format:
{"id": example_id_1, "text": "cleaned text content here"}
{"id": example_id_2, "text": "another cleaned text content"}
Requirements:
- Each line must be a valid JSON object
- Each object must contain exactly two fields: “id” and “text”
- The “id” field must match the corresponding entry in the input file (should be a int)
- The “text” field must not be empty after cleaning
- All 100,000 entries must be cleaned, do not delete any data.
- File must be saved at:
/workspace/data/outputs/result_web_data.jsonl
Evaluation
Evaluation Metrics:
- Correct Format and Location: The
result_web_data.jsonloutput file has the correct format and location. - Dataset Size Validation: All entries in
result_web_data.jsonlhave non-empty text fields. - Judge Model Evaluation: Using the
Judge-By-the-LLMmethod to score the quality of a subset (1-5 points). The cleaned data should perform better than several baselines on the evaluation model.
Success Criteria:
The final score is calculated using linear interpolation based on your Judge Model score compared to the baselines:
- Format Validation (5 points):
result_web_data.jsonloutput file has correct format and location, and passes dataset size validation - Quality Score (95 points): Linear interpolation between baseline scores:
- Score ≤ Baseline_1 (3.0): 0 points
- Baseline_1 < Score ≤ Baseline_2 (3.5): 0-25 points (linear interpolation)
- Baseline_2 < Score ≤ Baseline_3 (4.0): 25-95 points (linear interpolation)
- Score > Baseline_3: 95 points
Reference Baselines:
- Baseline_1 (Raw data): 3.0 points
- Baseline_2 (Simple cleaning): 3.5 points
- Baseline_3 (Strong cleaning): 4.0 points
Environment
The environment has been pre-configured for you in /workspace/conda. You can start working directly without additional setup. You can use vllm, datatrove etc. in this environment.