Visual Reasoning System - InnovatorBench

Motivation

Current multimodal large language models still face challenges in complex visual reasoning tasks, particularly when dealing with spatial relationship understanding, visual perception, semantic correlation analysis, and depth reasoning. These problems often require sophisticated visual analysis capabilities, combining image understanding with logical reasoning. Improving AI performance in visual reasoning domains is crucial for advancing artificial intelligence applications in computer vision, robotics, and human-computer interaction.

Task

Your task is to develop an efficient visual reasoning system that can accurately solve various complex visual understanding and reasoning problems.

You should work under the /workspace directory.

You need to handle the following 4 different types of visual reasoning tasks:

vstar - Visual star pattern recognition and spatial relationship reasoning
blink_viscorr - Visual correlation analysis
blink_jigsaw - Visual puzzle and combination reasoning
blink_depth - Depth perception and 3D understanding

Your objectives are:

implement a efficient multimodel mathematical reasoning workflow: Build a framework capable of handling multiple mathematical problems, you can use tools like python etc. Your target is to make the final result as higher as possible.
Inference: Inference the dev set & test set result by using the workflow with the gpt-4o api.

You can submit your answer in the file above for 3 times (with different reasoning workflow and its corresponding inference result). You should try your best to get highest score.

Important Note: All final results and reasoning processes must be saved in the /workspace/data/outputs/ directory.

Data

You will use diverse visual reasoning datasets organized into training and evaluation sets:

Training Data (develop set):

Location: /workspace/data/datasets/develop/
Use for development, pattern analysis, and strategy formulation

Evaluation Data (test set):

Location: /workspace/data/datasets/test/
Used for final evaluation (answers removed)

The dataset contains 4 different visual task subsets organized as follows:

/workspace/data/
├── develop/
│   ├── vstar/               # 119 instances
│   │   ├── relative_position@sa_xxxxx/
│   │   ├── direct_attributes@sa_xxxxx/
│   │   ├── OCR@textvqa_x/
│   │   └── ...
│   ├── blink_viscorr/       # 86 instances
│   ├── blink_jigsaw/        # 75 instances
│   │   ├── val_Jigsaw_xx/
│   │   └── ...
│   └── blink_depth/         # 62 instances
├── test/
│   ├── vstar/               # 119 instances (no answers)
│   ├── blink_viscorr/       # 86 instances (no answers)
│   ├── blink_jigsaw/        # 75 instances (no answers)
│   └── blink_depth/         # 62 instances (no answers)

Total Dataset Statistics:

vstar: 238 total (119 develop + 119 test)
blink_viscorr: 172 total (86 develop + 86 test)
blink_jigsaw: 150 total (75 develop + 75 test)
blink_depth: 124 total (62 develop + 62 test)

The dataset contains problems for the following 4 visual reasoning tasks:

1. vstar (Visual Star Pattern Recognition)

File Format: Each sample contains:
- request.json: Contains query text, image paths, options, and ground truth answer
- sa_xxxxx.jpg: Corresponding star pattern image
Task Objective: Recognize star patterns and understand spatial positional relationships
Task Types: Includes relative position judgment, direct attribute recognition, OCR text recognition, and GPT4V-hard challenges

Data Format Example:

{
  "target_object": ["bucket", "cyclist"],
  "bbox": [[1904, 906, 46, 54], [882, 899, 22, 62]],
  "question": "Is the bucket on the left or right side of the cyclist?",
  "options": [
      "The bucket is on the left side of the cyclist.",
      "The bucket is on the right side of the cyclist."
  ],
  "query": "<img src='../tasks/vstar/processed/relative_position@sa_86732/sa_86732.jpg'> Is the bucket on the left or right side of the cyclist? Options: (A) The bucket is on the left side of the cyclist. (B) The bucket is on the right side of the cyclist.",
  "images": ["../tasks/vstar/processed/relative_position@sa_86732/sa_86732.jpg"],
  "answer": "(B)"
}

Challenge: Requires precise spatial understanding and pattern recognition capabilities

2. blink_viscorr (Visual Correlation Analysis)

File Format: Each sample contains:
- request.json: Contains query about visual correlations
- image1.jpg, image2.jpg: Two images for correspondence analysis
Task Objective: Analyze visual correlations and find corresponding points between different camera positions or lighting conditions

Data Format Example:

{
  "query": "<img src='../tasks/blink_viscorr/processed/val_Visual_Correspondence_98/image1.jpg'> <img src='../tasks/blink_viscorr/processed/val_Visual_Correspondence_98/image2.jpg'> A point is circled on the first image, labeled with REF. We change the camera position or lighting and shoot the second image. You are given multiple red-circled points on the second image, choices of \"A, B, C, D\" are drawn beside each circle. Which point on the second image corresponds to the point in the first image? Select from the following options.\n(A) Point A\n(B) Point B\n(C) Point C\n(D) Point D",
  "images": [
      "../tasks/blink_viscorr/processed/val_Visual_Correspondence_98/image1.jpg",
      "../tasks/blink_viscorr/processed/val_Visual_Correspondence_98/image2.jpg"
  ],
  "answer": "(A)"
}

Challenge: Requires understanding abstract relationships between visual elements under different conditions

3. blink_jigsaw (Visual Puzzle Reasoning)

File Format: Each sample contains:
- request.json: Contains puzzle-related query
- image1.jpg, image2.jpg, image3.jpg: Multiple images for jigsaw puzzle analysis
Task Objective: Understand combination and arrangement relationships of image fragments

Data Format Example:

{
  "query": "<img src='../tasks/blink_jigsaw/processed/val_Jigsaw_99/image1.jpg'> <img src='../tasks/blink_jigsaw/processed/val_Jigsaw_99/image2.jpg'> <img src='../tasks/blink_jigsaw/processed/val_Jigsaw_99/image3.jpg'> Given the first image with the lower right corner missing, can you tell which one of the second image or the third image is the missing part? Imagine which image would be more appropriate to place in the missing spot. You can also carefully observe and compare the edges of the images.\nSelect from the following choices.\n\n(A) the second image\n(B) the third image\n",
  "images": [
      "../tasks/blink_jigsaw/processed/val_Jigsaw_99/image1.jpg",
      "../tasks/blink_jigsaw/processed/val_Jigsaw_99/image2.jpg",
      "../tasks/blink_jigsaw/processed/val_Jigsaw_99/image3.jpg"
  ],
  "answer": "(A)"
}

Challenge: Requires spatial reasoning and pattern matching capabilities

4. blink_depth (Depth Perception)

File Format: Each sample contains:
- request.json: Contains depth-related query
- image.jpg: Image for depth analysis
Task Objective: Analyze depth information and 3D spatial relationships in images

Data Format Example:

{
  "query": "<img src='../tasks/blink_depth/processed/val_Relative_Depth_99/image.jpg'> Two points are circled on the image, labeled by A and B beside each circle. Which point is closer to the camera?\nSelect from the following choices.\n(A) A is closer\n(B) B is closer",
  "images": ["../tasks/blink_depth/processed/val_Relative_Depth_99/image.jpg"],
  "answer": "(A)"
}

Challenge: Requires inferring 3D spatial information from 2D images

Available Tools

Visual Expert Server Setup

We provide three visual expert servers for you to use:

1. SOM (Segment-and-Mark)

The http server is running on a url, you can get it from SOM_ADDRESS.

2. GroundingDINO Server

The http server is running on a url, you can get it from GROUNDING_DINO_ADDRESS.

3. Depth-Anything Server

The http server is running on a url, you can get it from DEPTH_ANYTHING_ADDRESS.

Visual Tools API

You can use the visual tools located at /workspace/task/scripts/tools.py, including:

Core Visual Tools:

segment_and_mark
- Function: Image segmentation and marking to help identify objects and spatial relationships
- Usage: Object segmentation, region marking, spatial relationship analysis
- Applicable Tasks: vstar, blink_spatial, blink_semcorr
detection
- Function: Object detection to identify specific objects in images
- Usage: Object recognition, localization, attribute analysis
- Applicable Tasks: All task types
depth
- Function: Depth estimation to analyze 3D information in images
- Usage: Depth analysis, 3D spatial understanding, relative position judgment
- Applicable Tasks: blink_depth, blink_spatial
crop_image
- Function: Image cropping to focus on specific regions
- Usage: Region extraction, detail analysis
- Applicable Tasks: All task types
zoom_in_image_by_bbox
- Function: Region zooming for detailed analysis of specific areas
- Usage: Fine-grained analysis, local magnification
- Applicable Tasks: vstar, mmvp, blink_jigsaw
sliding_window_detection
- Function: Sliding window detection for systematic image analysis
- Usage: Global scanning, pattern recognition
- Applicable Tasks: vstar, blink_jigsaw
overlay_images
- Function: Image overlay for comparing and analyzing multiple images
- Usage: Image comparison, correspondence analysis
- Applicable Tasks: blink_viscorr, blink_semcorr, blink_jigsaw

Tool Usage Examples:

We have given you an example in /workspace/task/scripts/tool_example.py about how to use the tools. For details, please refer /workspace/task/scripts/tool.py

Constraints

Performance Constraints:

Development Time Limit: Maximum 12 hours to complete system implementation and testing
Single Problem Evaluation Time Limit: Each visual problem solving time should not exceed 3 minutes
Model Requirement: Must use GPT-4o model for reasoning
API Calls: Use OpenAI/Azure OpenAI API keys provided via environment variables

Technical Constraints:

Model Interface: Use OpenAI GPT-4o or Azure OpenAI services
Programming Language: Python
Output Format: JSON format result files
Resource Limitation: Efficiently utilize API calls and optimize reasoning efficiency
Environment: Use provided visual reasoning environment and tools

Evaluation

Your system will be evaluated using a two-part scoring system with a maximum total score of 100 points:

Scoring System:

Part 1: File Validation (10 points)

Purpose: Verifies that output files exist and contain proper answer fields
Scoring: (Valid Files / Total Files) × 10
Requirements:
- Each problem must have a result file at the correct path: outputs/{task_type}/{problem_id}/result.json
- Each result file must contain either an "answer" or "label" field
- File must be valid JSON format

Part 2: Task Performance (90 points)

Purpose: Measures accuracy across all visual reasoning tasks
Scoring: (Total Correct Answers / Total Problems) × 90 (Task Performance Score will be calcuted from this score but not the original score)
Calculation: Based on all visual problems across 4 task types
Note: This measures overall success rate across all visual task types combined

Total Score Calculation:

Total Score = File Validation Score + Task Performance Score
Maximum Score = 10 + 90 = 100 points

Performance Benchmarks:

Basic Performance Targets (Individual Task Accuracy):

vstar: 66.0%
blink_viscorr: 73.3%
blink_jigsaw: 64.0%
blink_depth: 71.8%

Advanced Performance Targets (Individual Task Accuracy):

vstar: 80.3%
blink_viscorr: 80.8%
blink_jigsaw: 70.7%
blink_depth: 83.9%

Evaluation Methods:

Primary Evaluation: Use all visual tasks for testing your system
Development/Training: Analyze patterns and develop strategies from the provided data
Result Files: Check output files in /workspace/data/outputs/ directory
Answer Verification: Compare your system’s predictions with ground truth answers
Automated Scoring: Evaluation system automatically calculates both file validation and task performance scores

Output Requirements:

Your system must process visual reasoning samples from both develop and test, and save results in the following directory structure:

For Evalset Results (Final Evaluation): Save results in: /workspace/data/outputs/

/workspace/data/outputs/
├── vstar/
│   ├── direct_attributes@sa_xxxxx/result.json
│   ├── relative_position@sa_xxxxx/result.json
│   ├── OCR@textvqa_x/result.json
│   └── ...                        # All vstar test results (119 instances)
├── blink_viscorr/
│   ├── problem_xxx/result.json    # Result for viscorr problem
│   └── ...                        # All viscorr test results (86 instances)
├── blink_jigsaw/
│   ├── val_Jigsaw_xxx/result.json # Result for jigsaw problem
│   └── ...                        # All jigsaw test results (75 instances)
├── blink_depth/
│   ├── problem_xxx/result.json    # Result for depth problem
│   └── ...                        # All depth test results (62 instances)
└── summary.json                   # Overall performance summary

IMPORTANT: Each result file must be located in a subdirectory named after the problem ID and the file must be named result.json. For test results, they must be saved in /workspace/data/outputs/. For example, for blink_viscorr test problem ID val_Visual_Correspondence_2, the result should be at /workspace/data/outputs/blink_viscorr/val_Visual_Correspondence_2/result.json.

Result File Format:

Each individual result file should contain a JSON object with the following required format:

For all visual tasks, the file MUST contain an “answer” field:

For vstar tasks:

{
  "task_type": "vstar",
  "problem_id": "relative_position@sa_87051",
  "answer": "top-left",
  "reasoning": "Analyzed the star pattern position relative to the image boundaries and determined it is located in the top-left quadrant.",
  "solution_steps": [
    "Load and analyze the star pattern image",
    "Identify star position relative to image boundaries", 
    "Apply spatial relationship rules",
    "Determine relative position as top-left"
  ],
  "confidence": 0.90,
  "visual_analysis": "Used image segmentation to isolate star pattern and measured relative coordinates"
}

For blink_viscorr, link_jigsaw, and blink_depth tasks:

{
  "task_type": "blink_depth",
  "problem_id": "problem_1",
  "answer": "closer",
  "reasoning": "Analyzed depth cues including relative size, occlusion, and perspective to determine object depth ordering.",
  "solution_steps": [
    "Apply depth estimation algorithms to image",
    "Analyze visual depth cues (size, occlusion, perspective)",
    "Compare relative depths of relevant objects",
    "Determine depth relationship"
  ],
  "confidence": 0.88,
  "visual_analysis": "Utilized depth estimation tools and stereo vision principles"
}

Answer Format Requirements:

CRITICAL: The “answer” field format must exactly match the expected answer format for each task type:

vstar: Depends on question type - could be text descriptions, coordinates, or choice letters
blink_viscorr: Depends on question - could be yes/no, choice letters, or descriptive text
blink_jigsaw: Depends on question - could be choice letters, numbers, or descriptive text
blink_depth: Depends on question - could be relative terms like “closer”, “farther”, or choice letters

Summary File Format:

The summary.json file should contain:

{
  "total_problems": "varies by task type",
  "completed_problems": "number completed",
  "task_results": {
    "vstar": {"total": "varies", "completed": "varies"},
    "blink_viscorr": {"total": "varies", "completed": "varies"},
    "blink_jigsaw": {"total": "varies", "completed": "varies"},
    "blink_depth": {"total": "varies", "completed": "varies"}
  },
  "execution_time": "2.5 hours",
  "average_confidence": 0.82,
  "tools_used": ["segment_and_mark", "detection", "depth", "crop_image"]
}

Environment

Work in the provided environment with the following setup:

API Configuration:

OpenAI API: Access via environment variables OPENAI_API_KEY and OPENAI_BASE_URL
Azure OpenAI: Access via environment variables AZURE_OPENAI_API_KEY and AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_API_VERSION
Specified Model: GPT-4o

Visual Tools Environment:

Visual Tools: /workspace/task/scripts/tools.py
Required Dependencies: PIL, OpenCV, gradio-client, matplotlib, numpy

System Components:

Visual Reasoning Engine: Core visual analysis and reasoning module
Image Processor: Handle image loading, preprocessing, and analysis
Multi-modal Integrator: Combine visual and textual information
Tool Interface: Interface with provided visual analysis tools
Result Generator: Generate standardized output results

Provided Scripts:

install.sh: Environment setup script for installing visual reasoning dependencies and conda environment
tools.py: Visual analysis tools including segmentation, detection, depth estimation
simple_api_example.py: OpenAI API usage examples for visual tasks with image support
simple_azure_example.py: Azure OpenAI API usage examples for visual tasks with image support
server_environment.tar.gz: Complete conda environment package for visual reasoning tools

Directory Structure:

Working Directory: /workspace/task/
Testset Output Directory: /workspace/data/outputs/develop/ (optional, for development)
Evalset Output Directory: /workspace/data/outputs/ (required, for final evaluation)
Training Dataset Directory: /workspace/data/datasets/develop/
Evaluation Dataset Directory: /workspace/data/datasets/test/
Tools Directory: /workspace/task/scripts/tools.py

Implementation Requirements:

Create main implementation file /workspace/task/main.py, ensuring it can be executed with the following command:

python /workspace/task/main.py

Important: Your main.py script must automatically process all test visual reasoning samples from /workspace/data/datasets/test/ and generate the required output files in /workspace/data/outputs/ following the specified format.

Your implementation should:

Design Universal Visual Framework: Capable of handling multiple types of visual reasoning problems
Process Complete Dataset: Automatically iterate through all visual reasoning samples
Implement Efficient Visual Analysis: Develop effective image analysis and reasoning strategies
Integrate Visual Tools: Effectively utilize provided visual analysis tools
Generate Structured Output: Create properly formatted result files for each visual task type
Ensure Visual Understanding Quality: Generate accurate answers and clear visual reasoning processes
Optimize Performance: Strive for highest accuracy across all visual task types
Ensure Reproducibility: Save complete visual reasoning processes and intermediate results

Innovation Goals:

This is an open-ended innovation task that encourages you to explore various innovative methods and techniques to achieve excellent performance in visual reasoning tasks. You can:

Design novel visual analysis methods
Develop efficient multi-modal reasoning algorithms
Explore advanced computer vision techniques
Innovatively combine visual tools with language model capabilities
Create sophisticated visual understanding pipelines

Note: You should generate all files in Section Output Requirements before you eval / finish your task.