Motivation
Current multimodal large language models still face challenges in complex visual reasoning tasks, particularly when dealing with spatial relationship understanding, visual perception, semantic correlation analysis, and depth reasoning. These problems often require sophisticated visual analysis capabilities, combining image understanding with logical reasoning. Improving AI performance in visual reasoning domains is crucial for advancing artificial intelligence applications in computer vision, robotics, and human-computer interaction.
Task
Your task is to develop an efficient visual reasoning system that can accurately solve various complex visual understanding and reasoning problems.
You should work under the /workspace directory.
You need to handle the following 4 different types of visual reasoning tasks:
- vstar - Visual star pattern recognition and spatial relationship reasoning
- blink_viscorr - Visual correlation analysis
- blink_jigsaw - Visual puzzle and combination reasoning
- blink_depth - Depth perception and 3D understanding
Your objectives are:
- implement a efficient multimodel mathematical reasoning workflow: Build a framework capable of handling multiple mathematical problems, you can use tools like python etc. Your target is to make the final result as higher as possible.
- Inference: Inference the dev set & test set result by using the workflow with the
gpt-4oapi.
You can submit your answer in the file above for 3 times (with different reasoning workflow and its corresponding inference result). You should try your best to get highest score.
Important Note: All final results and reasoning processes must be saved in the /workspace/data/outputs/ directory.
Data
You will use diverse visual reasoning datasets organized into training and evaluation sets:
Training Data (develop set):
- Location:
/workspace/data/datasets/develop/ - Use for development, pattern analysis, and strategy formulation
Evaluation Data (test set):
- Location:
/workspace/data/datasets/test/ - Used for final evaluation (answers removed)
The dataset contains 4 different visual task subsets organized as follows:
/workspace/data/
├── develop/
│ ├── vstar/ # 119 instances
│ │ ├── relative_position@sa_xxxxx/
│ │ ├── direct_attributes@sa_xxxxx/
│ │ ├── OCR@textvqa_x/
│ │ └── ...
│ ├── blink_viscorr/ # 86 instances
│ ├── blink_jigsaw/ # 75 instances
│ │ ├── val_Jigsaw_xx/
│ │ └── ...
│ └── blink_depth/ # 62 instances
├── test/
│ ├── vstar/ # 119 instances (no answers)
│ ├── blink_viscorr/ # 86 instances (no answers)
│ ├── blink_jigsaw/ # 75 instances (no answers)
│ └── blink_depth/ # 62 instances (no answers)
Total Dataset Statistics:
- vstar: 238 total (119 develop + 119 test)
- blink_viscorr: 172 total (86 develop + 86 test)
- blink_jigsaw: 150 total (75 develop + 75 test)
- blink_depth: 124 total (62 develop + 62 test)
The dataset contains problems for the following 4 visual reasoning tasks:
1. vstar (Visual Star Pattern Recognition)
- File Format: Each sample contains:
request.json: Contains query text, image paths, options, and ground truth answersa_xxxxx.jpg: Corresponding star pattern image
- Task Objective: Recognize star patterns and understand spatial positional relationships
- Task Types: Includes relative position judgment, direct attribute recognition, OCR text recognition, and GPT4V-hard challenges
- Data Format Example:
{ "target_object": ["bucket", "cyclist"], "bbox": [[1904, 906, 46, 54], [882, 899, 22, 62]], "question": "Is the bucket on the left or right side of the cyclist?", "options": [ "The bucket is on the left side of the cyclist.", "The bucket is on the right side of the cyclist." ], "query": "<img src='../tasks/vstar/processed/relative_position@sa_86732/sa_86732.jpg'> Is the bucket on the left or right side of the cyclist? Options: (A) The bucket is on the left side of the cyclist. (B) The bucket is on the right side of the cyclist.", "images": ["../tasks/vstar/processed/relative_position@sa_86732/sa_86732.jpg"], "answer": "(B)" } - Challenge: Requires precise spatial understanding and pattern recognition capabilities
2. blink_viscorr (Visual Correlation Analysis)
- File Format: Each sample contains:
request.json: Contains query about visual correlationsimage1.jpg,image2.jpg: Two images for correspondence analysis
- Task Objective: Analyze visual correlations and find corresponding points between different camera positions or lighting conditions
- Data Format Example:
{ "query": "<img src='../tasks/blink_viscorr/processed/val_Visual_Correspondence_98/image1.jpg'> <img src='../tasks/blink_viscorr/processed/val_Visual_Correspondence_98/image2.jpg'> A point is circled on the first image, labeled with REF. We change the camera position or lighting and shoot the second image. You are given multiple red-circled points on the second image, choices of \"A, B, C, D\" are drawn beside each circle. Which point on the second image corresponds to the point in the first image? Select from the following options.\n(A) Point A\n(B) Point B\n(C) Point C\n(D) Point D", "images": [ "../tasks/blink_viscorr/processed/val_Visual_Correspondence_98/image1.jpg", "../tasks/blink_viscorr/processed/val_Visual_Correspondence_98/image2.jpg" ], "answer": "(A)" } - Challenge: Requires understanding abstract relationships between visual elements under different conditions
3. blink_jigsaw (Visual Puzzle Reasoning)
- File Format: Each sample contains:
request.json: Contains puzzle-related queryimage1.jpg,image2.jpg,image3.jpg: Multiple images for jigsaw puzzle analysis
- Task Objective: Understand combination and arrangement relationships of image fragments
- Data Format Example:
{ "query": "<img src='../tasks/blink_jigsaw/processed/val_Jigsaw_99/image1.jpg'> <img src='../tasks/blink_jigsaw/processed/val_Jigsaw_99/image2.jpg'> <img src='../tasks/blink_jigsaw/processed/val_Jigsaw_99/image3.jpg'> Given the first image with the lower right corner missing, can you tell which one of the second image or the third image is the missing part? Imagine which image would be more appropriate to place in the missing spot. You can also carefully observe and compare the edges of the images.\nSelect from the following choices.\n\n(A) the second image\n(B) the third image\n", "images": [ "../tasks/blink_jigsaw/processed/val_Jigsaw_99/image1.jpg", "../tasks/blink_jigsaw/processed/val_Jigsaw_99/image2.jpg", "../tasks/blink_jigsaw/processed/val_Jigsaw_99/image3.jpg" ], "answer": "(A)" } - Challenge: Requires spatial reasoning and pattern matching capabilities
4. blink_depth (Depth Perception)
- File Format: Each sample contains:
request.json: Contains depth-related queryimage.jpg: Image for depth analysis
- Task Objective: Analyze depth information and 3D spatial relationships in images
- Data Format Example:
{ "query": "<img src='../tasks/blink_depth/processed/val_Relative_Depth_99/image.jpg'> Two points are circled on the image, labeled by A and B beside each circle. Which point is closer to the camera?\nSelect from the following choices.\n(A) A is closer\n(B) B is closer", "images": ["../tasks/blink_depth/processed/val_Relative_Depth_99/image.jpg"], "answer": "(A)" } - Challenge: Requires inferring 3D spatial information from 2D images
Available Tools
Visual Expert Server Setup
We provide three visual expert servers for you to use:
1. SOM (Segment-and-Mark)
The http server is running on a url, you can get it from SOM_ADDRESS.
2. GroundingDINO Server
The http server is running on a url, you can get it from GROUNDING_DINO_ADDRESS.
3. Depth-Anything Server
The http server is running on a url, you can get it from DEPTH_ANYTHING_ADDRESS.
Visual Tools API
You can use the visual tools located at /workspace/task/scripts/tools.py, including:
Core Visual Tools:
-
segment_and_mark
- Function: Image segmentation and marking to help identify objects and spatial relationships
- Usage: Object segmentation, region marking, spatial relationship analysis
- Applicable Tasks: vstar, blink_spatial, blink_semcorr
-
detection
- Function: Object detection to identify specific objects in images
- Usage: Object recognition, localization, attribute analysis
- Applicable Tasks: All task types
-
depth
- Function: Depth estimation to analyze 3D information in images
- Usage: Depth analysis, 3D spatial understanding, relative position judgment
- Applicable Tasks: blink_depth, blink_spatial
-
crop_image
- Function: Image cropping to focus on specific regions
- Usage: Region extraction, detail analysis
- Applicable Tasks: All task types
-
zoom_in_image_by_bbox
- Function: Region zooming for detailed analysis of specific areas
- Usage: Fine-grained analysis, local magnification
- Applicable Tasks: vstar, mmvp, blink_jigsaw
-
sliding_window_detection
- Function: Sliding window detection for systematic image analysis
- Usage: Global scanning, pattern recognition
- Applicable Tasks: vstar, blink_jigsaw
-
overlay_images
- Function: Image overlay for comparing and analyzing multiple images
- Usage: Image comparison, correspondence analysis
- Applicable Tasks: blink_viscorr, blink_semcorr, blink_jigsaw
Tool Usage Examples:
We have given you an example in /workspace/task/scripts/tool_example.py about how to use the tools. For details, please refer /workspace/task/scripts/tool.py
Constraints
Performance Constraints:
- Development Time Limit: Maximum 12 hours to complete system implementation and testing
- Single Problem Evaluation Time Limit: Each visual problem solving time should not exceed 3 minutes
- Model Requirement: Must use GPT-4o model for reasoning
- API Calls: Use OpenAI/Azure OpenAI API keys provided via environment variables
Technical Constraints:
- Model Interface: Use OpenAI GPT-4o or Azure OpenAI services
- Programming Language: Python
- Output Format: JSON format result files
- Resource Limitation: Efficiently utilize API calls and optimize reasoning efficiency
- Environment: Use provided visual reasoning environment and tools
Evaluation
Your system will be evaluated using a two-part scoring system with a maximum total score of 100 points:
Scoring System:
Part 1: File Validation (10 points)
- Purpose: Verifies that output files exist and contain proper answer fields
- Scoring:
(Valid Files / Total Files) × 10 - Requirements:
- Each problem must have a result file at the correct path:
outputs/{task_type}/{problem_id}/result.json - Each result file must contain either an
"answer"or"label"field - File must be valid JSON format
- Each problem must have a result file at the correct path:
Part 2: Task Performance (90 points)
- Purpose: Measures accuracy across all visual reasoning tasks
- Scoring:
(Total Correct Answers / Total Problems) × 90(Task Performance Score will be calcuted from this score but not the original score) - Calculation: Based on all visual problems across 4 task types
- Note: This measures overall success rate across all visual task types combined
Total Score Calculation:
Total Score = File Validation Score + Task Performance Score
Maximum Score = 10 + 90 = 100 points
Performance Benchmarks:
Basic Performance Targets (Individual Task Accuracy):
- vstar: 66.0%
- blink_viscorr: 73.3%
- blink_jigsaw: 64.0%
- blink_depth: 71.8%
Advanced Performance Targets (Individual Task Accuracy):
- vstar: 80.3%
- blink_viscorr: 80.8%
- blink_jigsaw: 70.7%
- blink_depth: 83.9%
Evaluation Methods:
- Primary Evaluation: Use all visual tasks for testing your system
- Development/Training: Analyze patterns and develop strategies from the provided data
- Result Files: Check output files in
/workspace/data/outputs/directory - Answer Verification: Compare your system’s predictions with ground truth answers
- Automated Scoring: Evaluation system automatically calculates both file validation and task performance scores
Output Requirements:
Your system must process visual reasoning samples from both develop and test, and save results in the following directory structure:
For Evalset Results (Final Evaluation):
Save results in: /workspace/data/outputs/
/workspace/data/outputs/
├── vstar/
│ ├── direct_attributes@sa_xxxxx/result.json
│ ├── relative_position@sa_xxxxx/result.json
│ ├── OCR@textvqa_x/result.json
│ └── ... # All vstar test results (119 instances)
├── blink_viscorr/
│ ├── problem_xxx/result.json # Result for viscorr problem
│ └── ... # All viscorr test results (86 instances)
├── blink_jigsaw/
│ ├── val_Jigsaw_xxx/result.json # Result for jigsaw problem
│ └── ... # All jigsaw test results (75 instances)
├── blink_depth/
│ ├── problem_xxx/result.json # Result for depth problem
│ └── ... # All depth test results (62 instances)
└── summary.json # Overall performance summary
IMPORTANT: Each result file must be located in a subdirectory named after the problem ID and the file must be named result.json. For test results, they must be saved in /workspace/data/outputs/. For example, for blink_viscorr test problem ID val_Visual_Correspondence_2, the result should be at /workspace/data/outputs/blink_viscorr/val_Visual_Correspondence_2/result.json.
Result File Format:
Each individual result file should contain a JSON object with the following required format:
For all visual tasks, the file MUST contain an “answer” field:
For vstar tasks:
{
"task_type": "vstar",
"problem_id": "relative_position@sa_87051",
"answer": "top-left",
"reasoning": "Analyzed the star pattern position relative to the image boundaries and determined it is located in the top-left quadrant.",
"solution_steps": [
"Load and analyze the star pattern image",
"Identify star position relative to image boundaries",
"Apply spatial relationship rules",
"Determine relative position as top-left"
],
"confidence": 0.90,
"visual_analysis": "Used image segmentation to isolate star pattern and measured relative coordinates"
}
For blink_viscorr, link_jigsaw, and blink_depth tasks:
{
"task_type": "blink_depth",
"problem_id": "problem_1",
"answer": "closer",
"reasoning": "Analyzed depth cues including relative size, occlusion, and perspective to determine object depth ordering.",
"solution_steps": [
"Apply depth estimation algorithms to image",
"Analyze visual depth cues (size, occlusion, perspective)",
"Compare relative depths of relevant objects",
"Determine depth relationship"
],
"confidence": 0.88,
"visual_analysis": "Utilized depth estimation tools and stereo vision principles"
}
Answer Format Requirements:
CRITICAL: The “answer” field format must exactly match the expected answer format for each task type:
- vstar: Depends on question type - could be text descriptions, coordinates, or choice letters
- blink_viscorr: Depends on question - could be yes/no, choice letters, or descriptive text
- blink_jigsaw: Depends on question - could be choice letters, numbers, or descriptive text
- blink_depth: Depends on question - could be relative terms like “closer”, “farther”, or choice letters
Summary File Format:
The summary.json file should contain:
{
"total_problems": "varies by task type",
"completed_problems": "number completed",
"task_results": {
"vstar": {"total": "varies", "completed": "varies"},
"blink_viscorr": {"total": "varies", "completed": "varies"},
"blink_jigsaw": {"total": "varies", "completed": "varies"},
"blink_depth": {"total": "varies", "completed": "varies"}
},
"execution_time": "2.5 hours",
"average_confidence": 0.82,
"tools_used": ["segment_and_mark", "detection", "depth", "crop_image"]
}
Environment
Work in the provided environment with the following setup:
API Configuration:
- OpenAI API: Access via environment variables
OPENAI_API_KEYandOPENAI_BASE_URL - Azure OpenAI: Access via environment variables
AZURE_OPENAI_API_KEYandAZURE_OPENAI_ENDPOINTandAZURE_OPENAI_API_VERSION - Specified Model: GPT-4o
Visual Tools Environment:
- Visual Tools:
/workspace/task/scripts/tools.py - Required Dependencies: PIL, OpenCV, gradio-client, matplotlib, numpy
System Components:
- Visual Reasoning Engine: Core visual analysis and reasoning module
- Image Processor: Handle image loading, preprocessing, and analysis
- Multi-modal Integrator: Combine visual and textual information
- Tool Interface: Interface with provided visual analysis tools
- Result Generator: Generate standardized output results
Provided Scripts:
- install.sh: Environment setup script for installing visual reasoning dependencies and conda environment
- tools.py: Visual analysis tools including segmentation, detection, depth estimation
- simple_api_example.py: OpenAI API usage examples for visual tasks with image support
- simple_azure_example.py: Azure OpenAI API usage examples for visual tasks with image support
- server_environment.tar.gz: Complete conda environment package for visual reasoning tools
Directory Structure:
- Working Directory:
/workspace/task/ - Testset Output Directory:
/workspace/data/outputs/develop/(optional, for development) - Evalset Output Directory:
/workspace/data/outputs/(required, for final evaluation) - Training Dataset Directory:
/workspace/data/datasets/develop/ - Evaluation Dataset Directory:
/workspace/data/datasets/test/ - Tools Directory:
/workspace/task/scripts/tools.py
Implementation Requirements:
Create main implementation file /workspace/task/main.py, ensuring it can be executed with the following command:
python /workspace/task/main.py
Important: Your main.py script must automatically process all test visual reasoning samples from /workspace/data/datasets/test/ and generate the required output files in /workspace/data/outputs/ following the specified format.
Your implementation should:
- Design Universal Visual Framework: Capable of handling multiple types of visual reasoning problems
- Process Complete Dataset: Automatically iterate through all visual reasoning samples
- Implement Efficient Visual Analysis: Develop effective image analysis and reasoning strategies
- Integrate Visual Tools: Effectively utilize provided visual analysis tools
- Generate Structured Output: Create properly formatted result files for each visual task type
- Ensure Visual Understanding Quality: Generate accurate answers and clear visual reasoning processes
- Optimize Performance: Strive for highest accuracy across all visual task types
- Ensure Reproducibility: Save complete visual reasoning processes and intermediate results
Innovation Goals:
This is an open-ended innovation task that encourages you to explore various innovative methods and techniques to achieve excellent performance in visual reasoning tasks. You can:
- Design novel visual analysis methods
- Develop efficient multi-modal reasoning algorithms
- Explore advanced computer vision techniques
- Innovatively combine visual tools with language model capabilities
- Create sophisticated visual understanding pipelines
Note: You should generate all files in Section Output Requirements before you eval / finish your task.