Research Topic Registry

Step 1: Prepare agent’s workspace

  1. Clone the InnovatorBench Repo
git clone https://github.com/GAIR-NLP/InnovatorBench.git
cd InnovatorBench
  1. Create a new branch from dev/task_registry, name it as task_<task_name>, and checkout to your branch.
export task_name=<task_name>
git checkout dev/task_registry
git checkout -b ${task_name}
  1. Create folders to annotate the task.
mkdir task_registry/

mkdir gym/
mkdir gym/${task_name}
mkdir gym/${task_name}/data/checkpoints/  # store the training checkpoints
mkdir gym/${task_name}/data/datasets/  # store the training datasets
mkdir gym/${task_name}/data/outputs/  # store the results from the agent
mkdir gym/${task_name}/task/  # store the code for the agent to complete the task
mkdir gym/${task_name}/task/repositories/  # store the original code repository
mkdir gym/${task_name}/task/scripts/  # store the scripts (launch scripts, dependency installation scripts, etc.)
  1. Create task description task_description.md under gym/${task_name}/task/.

    • The task can come from any paper, or any other source, as long as it is an innovative research topic.
    • Ensure the task is reasonably assessable: neither impossible nor trivial for the agent.
    • Make sure task_description.md includes the following sections: Motivation, Task, Data, Constraint, Evaluation, and Environment. Do NOT include any hint!
    • The Constraint section must specify: training-time limit, evaluation-time limit, number of GPUs, and the memory size of each GPU.
    • Under the Task section, state that the agent must work inside /workspace.
    • Ensure every requirement listed in task_description.md is verifiable in evaluations/${task_name}/task/evaluation.py.
    • To prevent cheating, choose evaluation metrics that the agent cannot access; do not let the agent compute or store the metrics itself.
    • You may prompt the agent to output its ideas to /workspace/data/outputs.
    • Refer to the format in task_17 for guidance.
    • A full training + testing run must not exceed 16 hours.
    • Whenever possible, select creative / innovative tasks. For such tasks, add an extra markdown file evaluations/${task_name}/hint.md. This hint helps the agent by detailing the paper you reproduced as the baseline. If the agent chooses to read the hint, its score will be penalized; therefore, you must also add a note in task_description.md informing the agent how many points will be deducted for viewing the hint.
  2. If the task contains a origin repository, please build the original repository in gym/${task_name}/task/repositories/.

cp -r path/to/repositories/ gym/${task_name}/task/repositories/

(Only for reference, the actual repository may need to be built manually.) - If the original repository uses submodules, remove the .git directory inside the submodule and treat it as a regular subdirectory.

  1. Copy the training data, dev data, and test data without ground-truth labels into gym/${task_name}/data/datasets/.

Step 2: Prepare evaluation data and code

  1. Create folders to store the evaluation data and code.
mkdir evaluations/
mkdir evaluations/${task_name}
mkdir evaluations/${task_name}/data/  # store the evaluation data
mkdir evaluations/${task_name}/data/references/  # store the reference evaluation data
mkdir evaluations/${task_name}/data/baselines/  # store the baseline evaluation data
mkdir evaluations/${task_name}/task/  # store the evaluation code
  1. Design the evaluation dimensions.
  • You can design the evaluation dimensions by yourself, here are some examples:
  1. Model inference result

  2. Accuracy score

  3. What types of data are not allowed as evaluation data?

  • Tasks that require GPUs, training, or inference.
  • Tasks that have no reference answer.
  1. For tasks that need to be scored based on ground truth, rubric, or unit tests, place the corresponding reference data in evaluations/${task_name}/data/references. You may create sub-folders to store data for different evaluation dimensions.

    • Examples: specific functions that need unit tests, grading rubrics for the problem, etc.
  2. Place the expected agent outputs in the workspace under evaluations/${task_name}/data/outputs. You can further create sub-folders to store data for different evaluation dimensions.

    • Adding the agent’s outputs (or baseline) here is only for convenience in verifying that the evaluation script is constructed correctly. In practice, please add instructions in task_description.md directing the agent to output its results to /workspace/${task_name}/data/outputs.
  3. Create evaluations/${task_name}/task/config.py, inherit the Config class from evaluations/base/data_classes.py, and implement your configuration, named TaskConfig.

from evaluations.base.data_classes import Config

class TaskConfig(Config):
    pass
  1. Create evaluations/${task_name}/task/evaluation.py, inherit the base interface from evaluations/base/base_eval.py, you can only implement run_1 methods, set metrics by yourself, and return a dictionary, keyed by metric name, valued by metric value, named TaskBenchmark.
from evaluations.base.base_eval import BaseBenchmark
from typing import Dict, Any, Optional

class TaskBenchmark(BaseBenchmark):
    """
    paper_name: str = "paper_name" if have paper name else ""
    task_name: str = "task_name"
    """
    def __init__(self, config: Optional[Config] = None):
        # Initialize your benchmark
        super().__init__(config)

    def run_1(self) -> Dict[str, Any]:
        # Implement your test logic
        # Use relative path, e.g. evaluations/${task_name}/data/outputs/outputs.json
        # Return a dictionary as the evaluation result, keyed by metric name, valued by metric value
        # It must contain the `score` key, the format of the sub-score is not required.
        # It is recommended to contain the `error` key and `message` key. e.g. {"error": "error_message", "message": "correct_message"}
        try:
            score_1 = self.cal_score_1()
            score_2 = self.cal_score_2()
        except Exception as e:
            return {
                "score": 0,
                "error": str(e) + "Error calculating the score.",
                "message": None
            }

        score = score_1 + score_2
        result = {
            "score": score,
            "error": None,
            "message": f"The accuracy is {score_1}/80, and the length score is {score_2}/20."
        }
        
        return result

    def cal_score_1(self) -> int:
        # Calculate the score1, e.g. the accuracy of the answer (60% of the total score, 20% for extra bonus)
        # If the accuracy of the model trained by agent surpasses the reference answer, it can receive up to 80 points. Otherwise it only receive 60 points.
        return 60

    def cal_score_2(self) -> int:
        # Calculate the score2, e.g. the length of the answer (20% of the total score)
        return 20

Notes:

  • The scores should include all the metrics that you want the agent to achieve in task_description.md. For example, if you want the answer is both short and good, you need to design at least quality score and length score.
  • The scores calculated by the cal_score_1 and cal_score_2 functions must satisfy the following conditions:
    • The reference answer (evaluations/${task_name}/data/references) should be around 70-80 points.
    • The baseline (evaluations/${task_name}/data/baselines) should be around 0 points.
    • The extra 20 points is the extra bonus when the model performance exceeds the reference answer, which will be automatically added.
    • If the output result of the agent is too close to the reference answer (less than 5% difference), you need to redesign your task.
  • Please detect the possible hack methods of the model, and if it is hacked, it will be penalized in the final result.
  1. Create a unit test tests/test_${task_name}.py and implement your test cases in this file, ensure your evaluation can pass all the local test cases, or run a single test case.

Ensure the evaluation data is correctly placed and can pass your own test: evaluations/${task_name}/task/evaluation.py.

# Run a single test case
python -m unittest tests.test_${task_name}_benchmark

Step 3: Prepare conda environment

  1. Create a symbolic link from ./workspace to /workspace.
mkdir -p workspace && ln -s workspace /workspace
  1. Create, activate, and install the base environment with conda:
conda create --prefix /workspace/conda python=3.10  # You can use other python versions
conda activate /workspace/conda

# When using `pip install -e` for installation, you must specify the absolute path, otherwise it will error during actual testing
pip install -e /workspace/task/repositories/your_repo_name
... # Other installation code
  1. Pack the conda environment
tar -cf conda.tar /workspace/conda
mkdir conda
mkdir conda/${task_name}
mv conda.tar conda/${task_name}/conda.tar

Step 4: Create a PR!

Congrats, you’ve created your first task!

Normally right now, you would create a PR to add your task to the repo.