Code Dataset Decontamination

Data Filtering Text Code
Created by: Dayuan Fu
2025-09-03

Motivation

Code instruction tuning datasets often suffer from data contamination where training samples are too similar to benchmark test cases, leading to inflated performance metrics that don’t reflect true model capabilities. Additionally, training on overly simple programming tasks can limit model learning and performance on challenging problems. There is a critical need to systematically detect and remove contaminated data while filtering for appropriate task difficulty levels to ensure robust model training and fair evaluation.

Task

Your task is to implement a data decontamination and difficulty filtering pipeline for code instruction tuning datasets, followed by composition analysis of the cleaned data.

Specifically, you need to:

  1. Data Decontamination: Implement a contamination detection and filtering system that:

    • Mark samples within the training set that are too similar to the benchmark problems.
  2. Difficulty Assessment: Implement a difficulty scoring mechanism that evaluates the difficulty of the question. You should give the top 160k samples with the highest difficulty scores.

  3. Composition Analysis: Analyze the filtered dataset by:

    • Calculate contamination statistics in each training dataset before and after filtering
    • Analyze difficulty score distribution across different datasets
    • Generate comprehensive reports including contamination detection results, difficulty filtering statistics, and detailed composition analysis with actionable insights.

You should work in the /workspace/task and /workspace/data/outputs directory.

You should output the following files:

  • /workspace/data/outputs/data_decontamination_result.jsonl: The filtered dataset
    • Fields: “source”, “id”, “messages”, “contaminationed” (boolean, whether the sample is contaminated, if it is contaminated, the value should be True, otherwise False)
  • /workspace/data/outputs/difficulty_score_result.jsonl: The difficulty score dataset with the top 160k samples without contamination (sorted by difficulty score, the first sample means the most difficult one)
    • Fields: “source”, “id”, “messages”, “difficulty_score”
  • /workspace/data/outputs/analysis_result.txt: The analysis result. You should designed it as a survey which a plain text file to answer the following questions:
    • What is the distribution of the difficulty scores?
    • What types of problems are most/less difficult?
    • How can we get more difficult problems? Your response should less then 20000 characters.

Data

You will work with the following code instruction datasets located in /workspace/data/datasets/training/:

DatasetData SizeInstruction SourceResponse Source
CodeFeedback49.8kReal-world coding tasks (user queries) and human feedback: Data derived from multi-turn interactions between users, code models, and compilers. Filtered by some methods to avoid contamination.Code models and compilers (execution feedback): Providing outputs and diagnostics from compilers; and human feedback: Consisting of additional guidance or instructions from users.
StarCoder250.7kStarCoder2-15B (self-generated): Instructions are generated in two steps: 1. Concepts extraction: For each seed function, StarCoder2-15B is prompted to produce a list of code concepts present within the function. 2. Instruction generation: StarCoder2-15B is then prompted to self-generate a coding task that incorporates the identified code concepts.StarCoder2-15B (self-validated): The model is explicitly instructed to generate tests for self-validation after it produces a response.
Magiccoder-Evol-Instruct50.2kEvolved instructions generated through the Code Evol-Instruct method from an initial instruction set (e.g., Code Alpaca), potentially involving OpenAI models (e.g., GPT-3.5). Evol-Instruct entails progressively developing complex instructions, starting with an initial instruction set and regenerating data in each step to create more complex instructions.LLMs (GPT-4) generate responses after instruction evolution.
Codefuse-Evol-Instruct37.7kEvolved instructions based on the “WizardCoder: Empowering Code Large Language Models with Evol-Instruct” method, evolved from an open-source dataset (e.g., Evol-Instruct-Code-80k-v1) using models like GPT-3.5 or GPT-4. This method enhances the fine-tuning effect of pre-trained code large models by adding complex code instructions.Generated by models like GPT-3.5 or GPT-4 along with the evolved instructions; the data undergoes processing such as low-quality filtering and filtering similar data via HumanEval evaluation to improve quality.
MagicCoder-OSS-Instruct43.1kAn LLM (specifically gpt-3.5-turbo-1106) generates coding problems by drawing inspiration from random code snippets collected from open-source repositories like GitHub, using the OSS-INSTRUCT method.The LLM (specifically gpt-3.5-turbo-1106) generates solutions based on the open-source code snippets.
oa_leet_10k2.4k (User Query) / 23k (HuggingFace)Programming problem descriptions collected from the LeetCode platform, used for code generation, problem understanding, or model evaluation.Solutions from Kaggle leetcode-solutions in C++, Java, JavaScript, Python.
Octopack56.0kGit commit messages: Data leverages the natural structure of Git commits, where code changes are paired with human instructions (commit messages). This dataset is called COMMITPACK and uses commit metadata from the GitHub action dump.Code changes associated with the Git commits.
Code-Alpaca20.0kSelf-instruct method using seed tasks and generated by OpenAI’s text-davinci-003. This method aims to automate the collection of instruction-following data with only a small set of human-written instructions.Generated by OpenAI’s text-davinci-003.

Each dataset should be in JSONL format with fields:

  • “id”: unique sample identifier
  • “messages”: a list of messages in the conversation
  • “messages[i].role”: the role of the message
  • “messages[i].content”: the content of the message

Benchmark datasets for contamination detection: The benchmark datasets are located in /workspace/data/datasets/benchmarks:

  • HumanEval: 164 hand-written programming problems with function signatures and docstrings

Models Complexity-Scorer: The complexity score model is located in /workspace/data/checkpoints/Complexity-Scorer

Constraint

  1. Processing time: Complete pipeline should finish within 8 hours.
  2. GPU Usage: You can use the 8 80G GPUs in the cluster.
  3. Evaluation: You can only evaluate your submission on the test set for 3 times before the deadline and the final submission. Your results will be compared with reference data to calculate scores. Remember you can only submit your result after you already put your result in the data_decontamination_result.jsonl, difficulty_score_result.jsonl, and analysis_result.txt, or you will get 0 points for this evaluation.

Evaluation

Your implementation will be evaluated on:

  1. Contamination Detection Accuracy (50 points):

    • We will calculate the accuracy of the decontamination set and the contamination detection set. Suppose the accuracy of the decontamination set and the contamination detection set is x1 and x2, the score will be 50 * max((x1 + x2) - baseline - 1, 0) / (1 - baseline)
  2. Difficulty Filtering Quality (10 points):

    • We will calculate the accuracy of the difficulty score set. Suppose the accuracy of the difficulty score set is x, the score will be max(10 * (x - baseline) / (1 - baseline), 0)
  3. Composition Analysis Completeness (40 points):

    • We will score your composition analysis result in analysis_result.txt.

Environment

We have setup the conda enviroment for you in /workspace/conda, and we have activated the env. In this environment, we installed the packages to use vllm and nltk

File Structure:

/workspace/
├── conda/                                            # Do not modify this directory
├── data/
   ├── checkpoints/                                  # Do not modify this directory
      └── Complexity-Scorer 
   ├── datasets/                                     # Do not modify this directory
      ├── training/
         ├── CodeFeedback.jsonl
         ├── StarCoder2.jsonl
         ├── Magiccoder-Evol-Instruct.jsonl
         ├── Codefuse-Evol-Instruct.jsonl
         ├── MagicCoder-OSS-Instruct.jsonl
         ├── oa_leet_10k.jsonl
         ├── Octopack.jsonl
         └── code_alpaca.jsonl
      └── benchmarks/
          └── humaneval.jsonl
   └── outputs/
       ├── data_decontamination_result.jsonl
       ├── difficulty_score_result.jsonl
       └── analysis_result.txt
└── task/                                           # Do not modify this directory
    ├── scripts/                               
       └──  complexity_scorer.py                    # Do not modify this script
    └── task_description.md                          # Do not modify this file

Scripts

You can generate scripts in the /workspace/task/scripts directory. You should not modify scripts that are originally in the /workspace/task/scripts directory.

The following scripts are provided to you, do not modify them:

  • /workspace/task/scripts/complexity_scorer.py: It contains the code to score the complexity of the question. But its effency is worse. You should generate another code to speed up.