GUI Grounding Reward Design

Motivation

Effective reward functions are crucial for training GUI grounding models via reinforcement learning. A key challenge lies in designing reward functions that can accurately assess the correctness of GUI actions across diverse platforms (e.g., mobile, desktop, and web), each with distinct action spaces. This task centers on the implementation and evaluation of a unified reward function capable of providing reliable feedback for training GUI grounding models.

Task

Your task is to implement a unified reward function for training GUI grounding models to get the evaluation score as higher as possible. This function must evaluate three key components of a GUI action: the action type (e.g., click, scroll, type), the coordinates of the click point, and any associated input text.

The reward function must be capable of handling actions from multiple platforms and outputting a single scalar reward for each predicted action.

You should work under the /workspace directory. We provide you a basic repository based on Verl to train the GUI grounding model.

You can start the training by running the following command (Please implement the reward function and start ray before run this command):

bash /workspace/task/repositories/GUI/examples/training.sh

You should save the trained model under /workspace/data/checkpoints/.

After finishing the training, you should evaluate the model on the test set by running the following command:

bash /workspace/task/scripts/inference.sh

Your output should be stored in /workspace/data/outputs/screenspot_test_no_gt.jsonl and /workspace/data/outputs/screenspot_pro_test_no_gt.jsonl.

You may submit your answer in the file above up to 3 times, each using a different reward design method with its corresponding training and inference. Strive to achieve the highest possible score.

Data

The dataset is divided into training, development, and test sets.

Data Fields:

id: A unique identifier for each data sample.
image: A screenshot of the GUI from a mobile, desktop, or web platform.
gt_bbox: The ground-truth bounding box for the target UI element.
instruction: A natural language instruction describing the task to be performed.
gt_action: The ground-truth action type (e.g., ‘click’, ‘type’).
gt_input_text: The ground-truth text to be inputted; “no input text” if not applicable.
history: A record of previous actions in the task; “None” if no history exists.
task_type: The category or domain of the task (e.g., ‘low’, ‘creative’).

Datasets:

Training Set: Located at /workspace/data/datasets/train.parquet.
Development Set: Located at /workspace/data/datasets/dev.parquet.
Test Set: After training, you should evaluate the model on the test sets /workspace/data/datasets/screenspot_test_no_gt.parquet and /workspace/data/datasets/screenspot_pro_test_no_gt.parquet.

Example Data Format:

{
  "gt_bbox": [0.24765625, 0.28611111, 0.3859375, 0.44305556],
  "instruction": "click the UI element Elizabeth Barrett Browning and Robert Browning",
  "id": 2418,
  "gt_action": "click",
  "gt_input_text": "no input text",
  "history": "None",
  "task_type": "low"
}

Checkpoint:

A Qwen2.5-VL-7B-Instruct model is available at /workspace/data/checkpoints/Qwen2.5-VL-7B-Instruct.

Constraints

Working Time Budget: 24 hours.
Training Time Limit: 12 hours.
Evaluation Time Limit: 10 minutes.
Hardware: Access to 8 x 80GB GPUs is provided.
Submission Attempts: You are limited to 3 evaluation submissions.

Evaluation

Evaluation Metrics

Accuracy: The primary metric is the accuracy of the trained GUI grounding model on the held-out test set, categorized by task type. An action is considered correct if the predicted x and y coordinates are within the ground-truth bounding box (gt_bbox).

Success Criteria

To succeed, the accuracy of your trained agent must surpass the following baseline scores from the ScreenSpot and ScreenSpot-Pro benchmarks:

ScreenSpot-Pro Baselines
- Dev
  - Text: 50.6
  - Icon: 4.8
- Creative
  - Text: 37.4
  - Icon: 8.4
- CAD
  - Text: 23.4
  - Icon: 6.2
- Scientific
  - Text: 54.9
  - Icon: 11.8
- Office
  - Text: 57.6
  - Icon: 28.3
- OS
  - Text: 41.1
  - Icon: 18.0
ScreenSpot Baselines
- Web
  - Text: 91.3
  - Icon: 75.7
- Desktop
  - Text: 93.3
  - Icon: 72.9
- Mobile
  - Text: 96.3
  - Icon: 77.3

Environment

Execution Environment

A pre-configured Conda environment, /workspace/conda, has been provided and activated for this task. This environment includes the following key packages:

Python 3.9
PyTorch 2.0+ with CUDA support
Transformers 4.49.0
NumPy, Matplotlib

File Structure

A recommended file structure is provided below:

/workspace/
├── conda/
├── data/
|   ├── checkpoints/
|   |   └── Qwen2.5-VL-7B-Instruct/            # do not modify this directory
|   ├── dataset/
|   |   ├── screenspot_pro_test_no_gt.parquet # do not modify this file
|   |   ├── screenspot_test_no_gt.parquet     # do not modify this file
|   |   ├── dev.parquet                      # do not modify this file
|   |   └── train.parquet                     # do not modify this file
|   └── outputs/
└── task/
    ├── repositories/
    │   │   └── GUI/
    ├── scripts/
    |   ├── inference/                        # do not modify this directory
    |   |   └── inference_vllm_screenspot.py  # do not modify this script
    |   ├── inference.sh                      # do not modify this directory
    |   └── ...                               # you can generate scripts in this directory, and you can modify them if you want
    └── task_description.md

Scripts

You can generate scripts in the /workspace/task/scripts directory. You should not modify scripts that are originally in the /workspace/task/scripts directory.

The following scripts are provided to you, do not modify them:

/workspace/task/repositories/GUI/examples/training.sh: The script for training the model. You should modify the worker.reward.compute_score to the name of reward functions you implement.
/workspace/task/repositories/GUI/scripts/model_merger.py: Given a model path of verl checkpoint, which is a directory containing multiple model_world_size_8_rank_{rank_number}.pt files, you can use this script to merge the model weights into huggingface format.
- Input:
  - --local_dir: The path of the verl checkpoint.
- Output:
  - The checkpoint in huggingface format.
/workspace/task/scripts/inference.sh: Evaluate the model on the test set after training. It runs the inference_vllm_screenspot.py script.
- Input: No input, but you should run this script after you have trained the model.
- Output: The evaluation results will be saved in /workspace/data/task_0024_1/outputs/.
/workspace/task/scripts/inference/inference_vllm_screenspot.py: Given a model, a test setl, and output path, the script will evaluate the model on the test set and save the results to the output path.

Notes

Always start verl in /workspace/task/repositories/GUI, Since there is a verl in the environment.