Web Data Cleaning

Data Filtering Text
Created by: Tianze Xu
2025-09-03

Motivation

The quality of training data is crucial for developing effective pre-trained models. Large language model pre-training relies on cleaning web corpora. This assignment focuses on data cleaning strategies to improve the quality of raw web data.

Task

You will be provided with 100,000 raw web data entries from various sources. Your task is to design and implement a systematic approach to clean this web data. You need to design a clear methodology and develop your own cleaning scripts to process documents and remove noise while maintaining data consistency and improving data quality.

You should work in the /workspace/task and /workspace/data/outputs directories. First, you need to read the raw web text data raw_web_data.jsonl, and finally save the cleaned data in /workspace/data/outputs/result_web_data.jsonl.

You can submit your answer in the file above for 1 time. You should try your best to get highest score.

Data

Dataset

  • raw_web_data.jsonl: 100,000 raw web data entries from various sources, where each entry is noisy web-scraped text.
    • Fields: “id”, “text”
    • Location: /workspace/data/dataset/raw_web_data.jsonl

Model

  • Qwen:
    • Size: 1.5B, 7B, 32B
    • Location: All models are located in /workspace/data/checkpoints/Qwen

Constraints

  • The output file should contain “id” and “text” fields, with “id” fields corresponding between input and output.
  • GPU Usage: You can use 8 80G GPUs to clean web data.
  • During data cleaning, text content should be preserved as much as possible. The “text” field after cleaning should not be empty.
  • Data cleaning processing time limit: 5 hours on single machine. You have only one attempt, so please ensure you can complete the task within the specified time.

Output Format

The output file result_web_data.jsonl should follow this exact format:

{"id": example_id_1, "text": "cleaned text content here"}
{"id": example_id_2, "text": "another cleaned text content"}

Requirements:

  • Each line must be a valid JSON object
  • Each object must contain exactly two fields: “id” and “text”
  • The “id” field must match the corresponding entry in the input file (should be a int)
  • The “text” field must not be empty after cleaning
  • All 100,000 entries must be cleaned, do not delete any data.
  • File must be saved at: /workspace/data/outputs/result_web_data.jsonl

Evaluation

Evaluation Metrics:

  1. Correct Format and Location: The result_web_data.jsonl output file has the correct format and location.
  2. Dataset Size Validation: All entries in result_web_data.jsonl have non-empty text fields.
  3. Judge Model Evaluation: Using the Judge-By-the-LLM method to score the quality of a subset (1-5 points). The cleaned data should perform better than several baselines on the evaluation model.

Success Criteria:

The final score is calculated using linear interpolation based on your Judge Model score compared to the baselines:

  • Format Validation (5 points): result_web_data.jsonl output file has correct format and location, and passes dataset size validation
  • Quality Score (95 points): Linear interpolation between baseline scores:
    • Score ≤ Baseline_1 (3.0): 0 points
    • Baseline_1 < Score ≤ Baseline_2 (3.5): 0-25 points (linear interpolation)
    • Baseline_2 < Score ≤ Baseline_3 (4.0): 25-95 points (linear interpolation)
    • Score > Baseline_3: 95 points

Reference Baselines:

  • Baseline_1 (Raw data): 3.0 points
  • Baseline_2 (Simple cleaning): 3.5 points
  • Baseline_3 (Strong cleaning): 4.0 points

Environment

The environment has been pre-configured for you in /workspace/conda. You can start working directly without additional setup. You can use vllm, datatrove etc. in this environment.