Research Topic Registry

Step 1: Prepare agent’s workspace

Clone the InnovatorBench Repo

git clone https://github.com/GAIR-NLP/InnovatorBench.git
cd InnovatorBench

Create a new branch from dev/task_registry, name it as task_<task_name>, and checkout to your branch.

export task_name=<task_name>
git checkout dev/task_registry
git checkout -b ${task_name}

Create folders to annotate the task.

mkdir task_registry/

mkdir gym/
mkdir gym/${task_name}
mkdir gym/${task_name}/data/checkpoints/  # store the training checkpoints
mkdir gym/${task_name}/data/datasets/  # store the training datasets
mkdir gym/${task_name}/data/outputs/  # store the results from the agent
mkdir gym/${task_name}/task/  # store the code for the agent to complete the task
mkdir gym/${task_name}/task/repositories/  # store the original code repository
mkdir gym/${task_name}/task/scripts/  # store the scripts (launch scripts, dependency installation scripts, etc.)

Create task description task_description.md under gym/${task_name}/task/.
- The task can come from any paper, or any other source, as long as it is an innovative research topic.
- Ensure the task is reasonably assessable: neither impossible nor trivial for the agent.
- Make sure task_description.md includes the following sections: Motivation, Task, Data, Constraint, Evaluation, and Environment. Do NOT include any hint!
- The Constraint section must specify: training-time limit, evaluation-time limit, number of GPUs, and the memory size of each GPU.
- Under the Task section, state that the agent must work inside /workspace.
- Ensure every requirement listed in task_description.md is verifiable in evaluations/${task_name}/task/evaluation.py.
- To prevent cheating, choose evaluation metrics that the agent cannot access; do not let the agent compute or store the metrics itself.
- You may prompt the agent to output its ideas to /workspace/data/outputs.
- Refer to the format in task_17 for guidance.
- A full training + testing run must not exceed 16 hours.
- Whenever possible, select creative / innovative tasks. For such tasks, add an extra markdown file evaluations/${task_name}/hint.md. This hint helps the agent by detailing the paper you reproduced as the baseline. If the agent chooses to read the hint, its score will be penalized; therefore, you must also add a note in task_description.md informing the agent how many points will be deducted for viewing the hint.
If the task contains a origin repository, please build the original repository in gym/${task_name}/task/repositories/.

cp -r path/to/repositories/ gym/${task_name}/task/repositories/

(Only for reference, the actual repository may need to be built manually.) - If the original repository uses submodules, remove the .git directory inside the submodule and treat it as a regular subdirectory.

Copy the training data, dev data, and test data without ground-truth labels into gym/${task_name}/data/datasets/.

Step 2: Prepare evaluation data and code

Create folders to store the evaluation data and code.

mkdir evaluations/
mkdir evaluations/${task_name}
mkdir evaluations/${task_name}/data/  # store the evaluation data
mkdir evaluations/${task_name}/data/references/  # store the reference evaluation data
mkdir evaluations/${task_name}/data/baselines/  # store the baseline evaluation data
mkdir evaluations/${task_name}/task/  # store the evaluation code

Design the evaluation dimensions.

You can design the evaluation dimensions by yourself, here are some examples:

Model inference result
Accuracy score
What types of data are not allowed as evaluation data?

Tasks that require GPUs, training, or inference.
Tasks that have no reference answer.

For tasks that need to be scored based on ground truth, rubric, or unit tests, place the corresponding reference data in evaluations/${task_name}/data/references. You may create sub-folders to store data for different evaluation dimensions.
- Examples: specific functions that need unit tests, grading rubrics for the problem, etc.
Place the expected agent outputs in the workspace under evaluations/${task_name}/data/outputs. You can further create sub-folders to store data for different evaluation dimensions.
- Adding the agent’s outputs (or baseline) here is only for convenience in verifying that the evaluation script is constructed correctly. In practice, please add instructions in task_description.md directing the agent to output its results to /workspace/${task_name}/data/outputs.
Create evaluations/${task_name}/task/config.py, inherit the Config class from evaluations/base/data_classes.py, and implement your configuration, named TaskConfig.

from evaluations.base.data_classes import Config

class TaskConfig(Config):
    pass

Create evaluations/${task_name}/task/evaluation.py, inherit the base interface from evaluations/base/base_eval.py, you can only implement run_1 methods, set metrics by yourself, and return a dictionary, keyed by metric name, valued by metric value, named TaskBenchmark.

from evaluations.base.base_eval import BaseBenchmark
from typing import Dict, Any, Optional

class TaskBenchmark(BaseBenchmark):
    """
    paper_name: str = "paper_name" if have paper name else ""
    task_name: str = "task_name"
    """
    def __init__(self, config: Optional[Config] = None):
        # Initialize your benchmark
        super().__init__(config)

    def run_1(self) -> Dict[str, Any]:
        # Implement your test logic
        # Use relative path, e.g. evaluations/${task_name}/data/outputs/outputs.json
        # Return a dictionary as the evaluation result, keyed by metric name, valued by metric value
        # It must contain the `score` key, the format of the sub-score is not required.
        # It is recommended to contain the `error` key and `message` key. e.g. {"error": "error_message", "message": "correct_message"}
        try:
            score_1 = self.cal_score_1()
            score_2 = self.cal_score_2()
        except Exception as e:
            return {
                "score": 0,
                "error": str(e) + "Error calculating the score.",
                "message": None
            }

        score = score_1 + score_2
        result = {
            "score": score,
            "error": None,
            "message": f"The accuracy is {score_1}/80, and the length score is {score_2}/20."
        }
        
        return result

    def cal_score_1(self) -> int:
        # Calculate the score1, e.g. the accuracy of the answer (60% of the total score, 20% for extra bonus)
        # If the accuracy of the model trained by agent surpasses the reference answer, it can receive up to 80 points. Otherwise it only receive 60 points.
        return 60

    def cal_score_2(self) -> int:
        # Calculate the score2, e.g. the length of the answer (20% of the total score)
        return 20

Notes:

The scores should include all the metrics that you want the agent to achieve in task_description.md. For example, if you want the answer is both short and good, you need to design at least quality score and length score.
The scores calculated by the cal_score_1 and cal_score_2 functions must satisfy the following conditions:
- The reference answer (evaluations/${task_name}/data/references) should be around 70-80 points.
- The baseline (evaluations/${task_name}/data/baselines) should be around 0 points.
- The extra 20 points is the extra bonus when the model performance exceeds the reference answer, which will be automatically added.
- If the output result of the agent is too close to the reference answer (less than 5% difference), you need to redesign your task.
Please detect the possible hack methods of the model, and if it is hacked, it will be penalized in the final result.

Create a unit test tests/test_${task_name}.py and implement your test cases in this file, ensure your evaluation can pass all the local test cases, or run a single test case.

Ensure the evaluation data is correctly placed and can pass your own test: evaluations/${task_name}/task/evaluation.py.

# Run a single test case
python -m unittest tests.test_${task_name}_benchmark

Step 3: Prepare conda environment

Create a symbolic link from ./workspace to /workspace.

mkdir -p workspace && ln -s workspace /workspace

Create, activate, and install the base environment with conda:

conda create --prefix /workspace/conda python=3.10  # You can use other python versions
conda activate /workspace/conda

# When using `pip install -e` for installation, you must specify the absolute path, otherwise it will error during actual testing
pip install -e /workspace/task/repositories/your_repo_name
... # Other installation code

Pack the conda environment

tar -cf conda.tar /workspace/conda
mkdir conda
mkdir conda/${task_name}
mv conda.tar conda/${task_name}/conda.tar

Step 4: Create a PR!

Congrats, you’ve created your first task!

Normally right now, you would create a PR to add your task to the repo.