Motivation
Current multimodal large language models still face challenges in complex mathematical reasoning tasks, particularly when dealing with geometry problems, graph theory algorithms, mathematical function analysis, and strategic games. These problems often require deep logical reasoning and problem analysis capabilities. Improving AI performance in these mathematical domains is crucial for advancing artificial intelligence applications in scientific computing and education.
Task
Your task is to develop an efficient mathematical reasoning workflow that can accurately solve various complex mathematical problems.
You should work under the /workspace directory.
You need to handle the following 4 different types of mathematical reasoning tasks:
- geometry - Geometric problem solving (angle calculation, area calculation, etc.)
- graph_connectivity - Graph connectivity determination
- graph_maxflow - Graph maximum flow problems
- math_convexity - Mathematical function convexity determination
Your objectives are:
- implement a efficient multimodel mathematical reasoning workflow: Build a framework capable of handling multiple mathematical problems, you can use tools like python etc. Your target is to make the final result as higher as possible.
- Inference: Inference the dev set & test set result by using the workflow with the
gpt-4oapi.
You can submit your answer in the file above for 3 times (with different reasoning workflow and its corresponding inference result). You should try your best to get highest score.
Important Note: All final results and reasoning processes must be saved in the /workspace/data/outputs/ directory. The format requirement will be mentioned below.
Data
You will use diverse mathematical reasoning datasets located at:
/workspace/data/datasets/
The dataset is organized into two main subsets:
- develop: Contains complete data with answers for reference (50% of data)
- test: Contains data without answers for evaluation and testing (50% of data)
The dataset contains 4 different visual task subsets organized as follows:
/workspace/data/datasets/
├── develop/ # Training/reference data with answers
│ ├── geometry/
│ ├── graph_connectivity/
│ ├── graph_maxflow/
│ └── math_convexity/
├── test/ # Evaluation data without answers
│ ├── geometry/
│ ├── graph_connectivity/
│ ├── graph_maxflow/
│ └── math_convexity/
└── [original task directories remain for reference]
The dataset contains problems for the following 4 tasks:
1. geometry (Geometric Problems)
- File Format: Each sample contains:
ex.json: Contains problem text, answer choices, image information, geometric logic forms, etc.image.png: Corresponding geometric image
- Task Objective: Solve geometric problems such as angles, lengths, areas, etc.
- Data Samples: 24 samples in develop, 24 samples in test
- Data Format Example (develop with answer):
{ "problem_text": "In \\odot K, M N = 16 and m \\widehat M N = 98. Find the measure of L N.", "choices": ["6.93", "7.50", "8.94", "10.00"], "answer": "C", "problem_type_graph": ["Circle"], "problem_type_goal": ["Length"], "logic_form": { "text_logic_form": ["Circle(K)", "Equals(LengthOf(Line(M,N)),16)", ...] } } - Note: test samples are identical but without the
"answer"field
2. graph_connectivity (Graph Connectivity)
- File Format:
example.jsoncontains graph adjacency matrix and query vertices - Task Objective: Determine if two specified vertices in a graph are connected
- Data Samples: 64 samples in develop, 64 samples in test
- Data Format Example (develop with answer):
{ "adjacency_matrix": "[[0, 0, 1], [0, 0, 0], [1, 0, 0]]", "query_node_1": 12, "query_node_2": 10, "label": false, "id": "isobench/algorithm/connectivity_008" } - Note: test samples are identical but without the
"label"field
3. graph_maxflow (Graph Maximum Flow)
- File Format:
example.jsoncontains weighted graph structure with source and sink nodes - Task Objective: Calculate maximum flow from source to sink
- Data Samples: 64 samples in develop, 64 samples in test
- Data Format Example (develop with answer):
{ "source_node": 0, "sink_node": 2, "adjacency_matrix": "[[0, 2, 7], [0, 0, 3], [0, 0, 0]]", "label": 9, "id": "isobench/algorithm/maxflow_105" } - Note: test samples are identical but without the
"label"field
4. math_convexity (Function Convexity)
- File Format:
example.jsoncontains mathematical function in LaTeX and code representation - Task Objective: Determine if a function is convex or concave
- Data Samples: 128 samples in develop, 128 samples in test
- Data Format Example (develop with answer):
{ "domain": "x > 0", "latex": "$$f(x) = 0.19 \\log{\\left(x \\right)} - 1.52 \\left|{x}\\right| + 3.58$$", "code": "f(x) = 0.19*log(x) - 1.52*Abs(x) + 3.58", "label": "concave", "id": "isobench/math/convexity_129" } - Note: test samples are identical but without the
"label"field
Constraints
Performance Constraints:
- Development Time Limit: Maximum 12 hours to complete system implementation and testing
- Single Problem Evaluation Time Limit: Each problem solving time should not exceed 3 minutes
- Model Requirement: Must use
gpt-4omodel for reasoning - API Calls: Use OpenAI/Azure OpenAI API keys provided via environment variables
Technical Constraints:
- Model Interface: Use OpenAI GPT-4o or Azure OpenAI services
- Programming Language: Python
- Output Format: JSON format result files
- Resource Limitation: Efficiently utilize API calls and optimize reasoning efficiency
Evaluation
Your system will be evaluated using a two-part scoring system with a maximum total score of 100 points:
Scoring System:
Part 1: File Validation (10 points)
- Purpose: Verifies that output files exist and contain proper answer fields
- Scoring:
(Valid Files / Total Files) × 10 - Requirements:
- Each problem must have a result file at the correct path:
outputs/{task_type}/{problem_id}/result.json - Each result file must contain an
"answer"field - File must be valid JSON format
- Each problem must have a result file at the correct path:
Part 2: Task Performance (90 points)
- Purpose: Measures accuracy across all mathematical reasoning tasks
- Scoring:
(Total Correct Answers / Total Problems) × 90(Task Performance Score will be calcuted from this score but not the original score) - Calculation: Based on all 280 problems across 4 task types
- Note: This measures overall success rate across all task types combined
Total Score Calculation:
Total Score = File Validation Score + Task Performance Score
Maximum Score = 10 + 90 = 100 points
Performance Benchmarks:
Basic Performance Targets (Individual Task Accuracy):
- Geometry: 62.5%
- Graph Max Flow: 25.0%
- Graph Connectivity: 96.1%
- Function Convexity: 87.2%
Advanced Performance Targets (Individual Task Accuracy):
- Geometry: 66.7%
- Graph Max Flow: 66.3%
- Graph Connectivity: 98.4%
- Function Convexity: 94.9%
Evaluation Methods:
- Primary Evaluation: Use the
testfor testing your system (answers removed) - Development/Training: Use the
developfor development and training (answers included) - Result Files: Check output files in
/workspace/data/outputs/directory - Answer Verification: Compare your system’s predictions on test with ground truth answers
- Automated Scoring: Evaluation system automatically calculates both file validation and task performance scores
Output Requirements:
Your system must process all samples in the test and save results in the following directory structure:
/workspace/data/outputs/
├── geometry/
│ ├── 8/result.json # Result for geometry problem ID 8
│ ├── 67/result.json # Result for geometry problem ID 67
│ └── ... # More geometry problem results
├── graph_connectivity/
│ ├── 10/result.json # Result for connectivity problem ID 10
│ ├── 11/result.json # Result for connectivity problem ID 11
│ └── ... # More connectivity problem results
├── graph_maxflow/
│ ├── 10/result.json # Result for maxflow problem ID 10
│ └── ... # More maxflow problem results
├── math_convexity/
│ ├── 126/result.json # Result for convexity problem ID 126
│ └── ... # More convexity problem results
└── summary.json # Overall performance summary
IMPORTANT: Each result file must be located in a subdirectory named after the problem ID and the file must be named result.json. For example, for geometry problem ID 8, the result should be at outputs/geometry/8/result.json.
Result File Format:
Each individual result file should contain a JSON object with the following required format:
For all tasks, the file MUST contain an “answer” field:
For geometry tasks:
{
"task_type": "geometry",
"problem_id": "8",
"answer": "C",
"reasoning": "Based on circle properties and chord-arc relationships, the measure of LN is 8.94, which corresponds to choice C.",
"solution_steps": [
"Identify circle properties from the diagram",
"Apply chord-arc relationship formulas",
"Calculate using given measurements: MN = 16, arc MN = 98°",
"Determine LN = 8.94"
],
"confidence": 0.85
}
For graph connectivity tasks:
{
"task_type": "graph_connectivity",
"problem_id": "10",
"answer": false,
"reasoning": "Applied DFS algorithm to determine connectivity between nodes 4 and 11. No path exists between these nodes.",
"solution_steps": [
"Parse adjacency matrix",
"Apply DFS from node 4",
"Check if node 11 is reachable",
"Result: nodes are not connected"
],
"confidence": 0.92
}
For graph maxflow tasks:
{
"task_type": "graph_maxflow",
"problem_id": "10",
"answer": 16,
"reasoning": "Applied Ford-Fulkerson algorithm to find maximum flow from source node 0 to sink node 2.",
"solution_steps": [
"Identify source node 0 and sink node 2",
"Parse capacity matrix",
"Apply Ford-Fulkerson algorithm",
"Calculate maximum flow = 16"
],
"confidence": 0.88
}
For math convexity tasks:
{
"task_type": "math_convexity",
"problem_id": "126",
"answer": "convex",
"reasoning": "Analyzed the function's second derivative to determine convexity. All terms contribute positively to the second derivative.",
"solution_steps": [
"Parse function: f(x) = e^(0.66x) + e^(1.25x) - 0.76*log(x) + 0.2*|x| + 10.13",
"Compute second derivative",
"Analyze sign of second derivative over domain x > 0",
"Conclude function is convex"
],
"confidence": 0.90
}
Answer Format Requirements:
CRITICAL: The “answer” field format must exactly match the expected answer format for each task type:
- geometry: Must be one of: “A”, “B”, “C”, “D” (uppercase letters)
- graph_connectivity: Must be boolean:
trueorfalse - graph_maxflow: Must be integer: e.g.,
16,25,100 - math_convexity: Must be one of: “convex”, “concave”, “neither” (lowercase)
Summary File Format:
The summary.json file should contain:
{
"total_problems": 665,
"completed_problems": 665,
"task_results": {
"geometry": {"total": 24, "completed": 24},
"graph_connectivity": {"total": 64, "completed": 64},
"graph_maxflow": {"total": 64, "completed": 64},
"math_convexity": {"total": 128, "completed": 128},
},
"execution_time": "2.5 hours",
"average_confidence": 0.78
}
Environment
Work in the provided environment with the following setup:
API Configuration:
- OpenAI API: Access via environment variables
OPENAI_API_KEYandOPENAI_BASE_URL - Azure OpenAI: Access via environment variables
AZURE_OPENAI_API_KEYandAZURE_OPENAI_ENDPOINTandAZURE_OPENAI_API_VERSION - Specified Model: GPT-4o Note: we will only provide one API, please check the real condition before you use the API.
System Components:
- Reasoning Engine: Core problem analysis and reasoning module
- Data Processor: Handle different input data formats
- Result Generator: Generate standardized output results
- Execution Environment: Support Python code execution and computation
Provided Scripts:
- simple_api_example.py: OpenAI API example with two functions:
call_with_image(image_path, problem_text): For image-based problemscall_without_image(problem_text): For text-based problems
- simple_azure_example.py: Azure OpenAI API example with same functions as above
Directory Structure:
- Working Directory:
/workspace/task/ - Output Directory:
/workspace/data/outputs/ - Dataset Directory:
/workspace/data/datasets/- Training/Reference Data:
/workspace/data/datasets/develop/ - Evaluation Data:
/workspace/data/datasets/test/
- Training/Reference Data:
Implementation Requirements:
Create main implementation file /workspace/data/outputs/main.py, ensuring it can be executed with the following command:
python /workspace/data/outputs/main.py
Important: Your main.py script must automatically process all samples in the test and generate the required output files in /workspace/data/outputs/ following the specified format.
You should run main.py generate all files in Section Output Requirements before you eval / finish this task.
Your implementation should:
- Design Universal Framework: Capable of handling multiple types of mathematical reasoning problems
- Process Complete test: Automatically iterate through all 665 samples in the test directories
- Implement Efficient Reasoning: Develop effective problem analysis and solving strategies
- Generate Structured Output: Create properly formatted result files for each task type
- Ensure Output Quality: Generate accurate answers and clear reasoning processes
- Optimize Performance: Strive for highest accuracy across all task types
- Ensure Reproducibility: Save complete reasoning processes and intermediate results
Innovation Goals:
This is an open-ended innovation task that encourages you to explore various innovative methods and techniques to achieve excellent performance in mathematical reasoning tasks. You can:
- Design novel problem analysis methods
- Develop efficient reasoning algorithms
- Explore multi-step reasoning strategies
- Innovatively utilize language model capabilities
Note: You should generate all files in Section Output Requirements before you eval / finish your task.