Research Topic Registry
Step 1: Prepare agent’s workspace
- Clone the InnovatorBench Repo
git clone https://github.com/GAIR-NLP/InnovatorBench.git
cd InnovatorBench
- Create a new branch from
dev/task_registry, name it as task_<task_name>, and checkout to your branch.
export task_name=<task_name>
git checkout dev/task_registry
git checkout -b ${task_name}
- Create folders to annotate the task.
mkdir task_registry/
mkdir gym/
mkdir gym/${task_name}
mkdir gym/${task_name}/data/checkpoints/ # store the training checkpoints
mkdir gym/${task_name}/data/datasets/ # store the training datasets
mkdir gym/${task_name}/data/outputs/ # store the results from the agent
mkdir gym/${task_name}/task/ # store the code for the agent to complete the task
mkdir gym/${task_name}/task/repositories/ # store the original code repository
mkdir gym/${task_name}/task/scripts/ # store the scripts (launch scripts, dependency installation scripts, etc.)
-
Create task description
task_description.mdundergym/${task_name}/task/.- The task can come from any paper, or any other source, as long as it is an innovative research topic.
- Ensure the task is reasonably assessable: neither impossible nor trivial for the agent.
- Make sure
task_description.mdincludes the following sections: Motivation, Task, Data, Constraint, Evaluation, and Environment. Do NOT include any hint! - The Constraint section must specify: training-time limit, evaluation-time limit, number of GPUs, and the memory size of each GPU.
- Under the Task section, state that the agent must work inside
/workspace. - Ensure every requirement listed in
task_description.mdis verifiable inevaluations/${task_name}/task/evaluation.py. - To prevent cheating, choose evaluation metrics that the agent cannot access; do not let the agent compute or store the metrics itself.
- You may prompt the agent to output its ideas to
/workspace/data/outputs. - Refer to the format in
task_17for guidance. - A full training + testing run must not exceed 16 hours.
- Whenever possible, select creative / innovative tasks. For such tasks, add an extra markdown file
evaluations/${task_name}/hint.md. This hint helps the agent by detailing the paper you reproduced as the baseline. If the agent chooses to read the hint, its score will be penalized; therefore, you must also add a note intask_description.mdinforming the agent how many points will be deducted for viewing the hint.
-
If the task contains a origin repository, please build the original repository in
gym/${task_name}/task/repositories/.
cp -r path/to/repositories/ gym/${task_name}/task/repositories/
(Only for reference, the actual repository may need to be built manually.)
- If the original repository uses submodules, remove the .git directory inside the submodule and treat it as a regular subdirectory.
- Copy the training data, dev data, and test data without ground-truth labels into
gym/${task_name}/data/datasets/.
Step 2: Prepare evaluation data and code
- Create folders to store the evaluation data and code.
mkdir evaluations/
mkdir evaluations/${task_name}
mkdir evaluations/${task_name}/data/ # store the evaluation data
mkdir evaluations/${task_name}/data/references/ # store the reference evaluation data
mkdir evaluations/${task_name}/data/baselines/ # store the baseline evaluation data
mkdir evaluations/${task_name}/task/ # store the evaluation code
- Design the evaluation dimensions.
- You can design the evaluation dimensions by yourself, here are some examples:
-
Model inference result
-
Accuracy score
-
What types of data are not allowed as evaluation data?
- Tasks that require GPUs, training, or inference.
- Tasks that have no reference answer.
-
For tasks that need to be scored based on ground truth, rubric, or unit tests, place the corresponding reference data in
evaluations/${task_name}/data/references. You may create sub-folders to store data for different evaluation dimensions.- Examples: specific functions that need unit tests, grading rubrics for the problem, etc.
-
Place the expected agent outputs in the workspace under
evaluations/${task_name}/data/outputs. You can further create sub-folders to store data for different evaluation dimensions.- Adding the agent’s outputs (or baseline) here is only for convenience in verifying that the evaluation script is constructed correctly. In practice, please add instructions in
task_description.mddirecting the agent to output its results to/workspace/${task_name}/data/outputs.
- Adding the agent’s outputs (or baseline) here is only for convenience in verifying that the evaluation script is constructed correctly. In practice, please add instructions in
-
Create
evaluations/${task_name}/task/config.py, inherit theConfigclass fromevaluations/base/data_classes.py, and implement your configuration, namedTaskConfig.
from evaluations.base.data_classes import Config
class TaskConfig(Config):
pass
- Create
evaluations/${task_name}/task/evaluation.py, inherit the base interface fromevaluations/base/base_eval.py, you can only implementrun_1methods, set metrics by yourself, and return a dictionary, keyed by metric name, valued by metric value, namedTaskBenchmark.
from evaluations.base.base_eval import BaseBenchmark
from typing import Dict, Any, Optional
class TaskBenchmark(BaseBenchmark):
"""
paper_name: str = "paper_name" if have paper name else ""
task_name: str = "task_name"
"""
def __init__(self, config: Optional[Config] = None):
# Initialize your benchmark
super().__init__(config)
def run_1(self) -> Dict[str, Any]:
# Implement your test logic
# Use relative path, e.g. evaluations/${task_name}/data/outputs/outputs.json
# Return a dictionary as the evaluation result, keyed by metric name, valued by metric value
# It must contain the `score` key, the format of the sub-score is not required.
# It is recommended to contain the `error` key and `message` key. e.g. {"error": "error_message", "message": "correct_message"}
try:
score_1 = self.cal_score_1()
score_2 = self.cal_score_2()
except Exception as e:
return {
"score": 0,
"error": str(e) + "Error calculating the score.",
"message": None
}
score = score_1 + score_2
result = {
"score": score,
"error": None,
"message": f"The accuracy is {score_1}/80, and the length score is {score_2}/20."
}
return result
def cal_score_1(self) -> int:
# Calculate the score1, e.g. the accuracy of the answer (60% of the total score, 20% for extra bonus)
# If the accuracy of the model trained by agent surpasses the reference answer, it can receive up to 80 points. Otherwise it only receive 60 points.
return 60
def cal_score_2(self) -> int:
# Calculate the score2, e.g. the length of the answer (20% of the total score)
return 20
Notes:
- The scores should include all the metrics that you want the agent to achieve in
task_description.md. For example, if you want the answer is both short and good, you need to design at least quality score and length score. - The scores calculated by the
cal_score_1andcal_score_2functions must satisfy the following conditions:- The reference answer (
evaluations/${task_name}/data/references) should be around 70-80 points. - The baseline (
evaluations/${task_name}/data/baselines) should be around 0 points. - The extra 20 points is the extra bonus when the model performance exceeds the reference answer, which will be automatically added.
- If the output result of the agent is too close to the reference answer (less than 5% difference), you need to redesign your task.
- The reference answer (
- Please detect the possible hack methods of the model, and if it is hacked, it will be penalized in the final result.
- Create a unit test
tests/test_${task_name}.pyand implement your test cases in this file, ensure your evaluation can pass all the local test cases, or run a single test case.
Ensure the evaluation data is correctly placed and can pass your own test: evaluations/${task_name}/task/evaluation.py.
# Run a single test case
python -m unittest tests.test_${task_name}_benchmark
Step 3: Prepare conda environment
- Create a symbolic link from
./workspaceto/workspace.
mkdir -p workspace && ln -s workspace /workspace
- Create, activate, and install the base environment with conda:
conda create --prefix /workspace/conda python=3.10 # You can use other python versions
conda activate /workspace/conda
# When using `pip install -e` for installation, you must specify the absolute path, otherwise it will error during actual testing
pip install -e /workspace/task/repositories/your_repo_name
... # Other installation code
- Pack the conda environment
tar -cf conda.tar /workspace/conda
mkdir conda
mkdir conda/${task_name}
mv conda.tar conda/${task_name}/conda.tar
Step 4: Create a PR!
Congrats, you’ve created your first task!
Normally right now, you would create a PR to add your task to the repo.