Prompt-Based Deep Research

Motivation

Prompt-based Deep Research systems remain the most practical approach for real-world deployment due to their interpretability, controllability, and lower computational requirements compared to end-to-end trained models. However, designing an effective prompt-based deep research agent that can handle complex multi-step reasoning, information synthesis, and adaptive search strategies is challenging. Understanding how to leverage powerful foundation models like Qwen-2.5-72B-Instruct with carefully designed prompts can provide insights into building robust deep research agents without extensive training overhead.

Task

Task Description

Your task is to build a prompt-based deep research agent using the GPT-4.1 model as the backbone for deep research and GPT-4.1-mini model as the backbone for web browsing. You will work with a basic toolkit that provides fundamental web search and browsing capabilities, and design prompts that orchestrate these tools to conduct systematic research.

The research agent should be capable of:

Handling complex research questions that require multi-step reasoning
Conducting systematic information gathering from web sources
Synthesizing information from multiple sources
Providing well-researched, accurate answers with proper source attribution
Working with both English and Chinese content

Your agent will be evaluated on a benchmark dataset with complex research questions and fact-seeking questions which require the agent to perform web search to get the answers. The goal is to create an effective research system that can handle diverse question types and provide accurate answers.

Think carefully about how to design the workflow or the agentic framework, what prompting strategies would be most effective, and how to best utilize the available tools to create a robust research agent.

The Starter Repository and Your Task

You can find the starter repository in repositories/deep_research_agent. This repo contains the following components:

Research agent: repositories/deep_research_agent/agents/research_agent.py, which is the main component of this deep research framework. Your task is to implement this agent.
Web search tool: repositories/deep_research_agent/toolkit/search_engine_tool.py, which is the tool to perform Google search.
Web browsing agent: repositories/deep_research_agent/agents/browsing_agent.py, which is the agent for scraping the web content and extract relevant information from the web pages.

Thus, your specific task is to implement the code for the research agent in repositories/deep_research_agent/agents/research_agent.py. We recommand you to write the prompts of the research agent in file repositories/deep_research_agent/agents/research_agent_prompts.py.

Once you have done the coding, you can run the following command to do deep research on the dev set:

# Make sure you are under the root directory of the repository
cd repositories/deep_research_agent

# Run the script for predictions on the dev set (this will take hours to run)
python main.py --split=dev

This will create the prediction file for evaluation. See ### Evaluation on Dev set for more details of how to evaluate the performance of your agent.

You can submit your answer in the file above for 3 times (with different workflow). You should try your best to get highest score.

Data

We provide you a dev set (25 examples) and a test set (100 examples). The dev set is a JSON file containing 25 question-answer pairs and the test set is a JSON file containing 100 questions without the ground truth answers.

The dev set is stored in ../data/datasets/dev.json. The format of the dev set is as follows:

[
   {
      "prompt": "<The question of the first example>",
      "answer": "<The answer of the first example>"
   },
   {
      "prompt": "<The question of the second example>",
      "answer": "<The answer of the second example>"
   },
   ...
]

The test set is stored in ../data/datasets/test.json. The format of the test set is as follows:

[
   {
      "prompt": "<The question of the first example>"
   },
   ...
]

Constraint

Your solution should meet the following requirements:

Maximum research time: 10 minutes per question
Must handle both English and Chinese content
Must use search tool to get the answers, the model cannot answer it without search.
You should only use GPT-4.1 model for deep research and GPT-4.1-mini model for web browsing. You cannot use other models.
You cannot search for the raw data on the dev or test set. We have a monitor to check the search history of the model. If you search for the raw data, you will be penalized.

Evaluation

Evaluation Metric

We use LLM-as-a-Judge to verify the correctness of the predicted answers. You can find the details of the evaluation in scripts/eval_on_dev.py. For each question, ground truth answer and predicted answer, we use o3 model to judge whether the predicted answer is correct or not.

The evaluation metric is accuracy which is the ratio of the number of correct predictions to the total number of predictions.

The final score will be calcuted from accuracy.

Evaluation on Dev set

Once you have generated the predictions on the dev set, you can run the following command to evaluate the performance of your agent on the dev set:

# Make sure you are under the root directory of the task
cd repositories/deep_research_agent

# Run the evaluation script
python /workspace/task/scripts/eval_on_dev.py

This will print the accuracy of your agent on the dev set.

Evaluation on Test set

Run the following code to do deep research on the test set:

# Make sure you are under the root directory of the task
cd repositories/deep_research_agent

# Run the script for predictions on the test set (this will take hours to run)
python main.py --split=test

Then you can call the evaluation tool to get evaluation results on the test set. Note that, the evaluation will take minutes to run.

Output Format Requirements:

File Location: /workspace/data/outputs/test.json

File Format: JSON array containing research results

Required Fields: Each JSON object should contain at least 2 keys:

"prompt" (string): The original question (copied from input)
"answer" (string): The answer.

Important Notes:

The order must be identical to /workspace/data/datasets/test.json
Both English and Chinese content should be handled appropriately

The evaluation will only check the content inside /workspace/data/outputs/test.json

Environment

We have provided you a conda environment named /workspace/conda, and we have activated the env.

API key

We will provide you OpenAI API or Azure OpenAI

OpenAI API: Access via environment variables OPENAI_API_KEY and OPENAI_BASE_URL
Azure OpenAI: Access via environment variables AZURE_OPENAI_API_KEY, AZURE_OPENAI_BASE_URL and AZURE_OPENAI_API_VERSION

We will also provide you Serper API Access via environment variables SERPER_API_KEY

You can read the environment variable to figure out which api are provided and modify some files.