Mathematical Reasoning Workflow

Motivation

Current multimodal large language models still face challenges in complex mathematical reasoning tasks, particularly when dealing with geometry problems, graph theory algorithms, mathematical function analysis, and strategic games. These problems often require deep logical reasoning and problem analysis capabilities. Improving AI performance in these mathematical domains is crucial for advancing artificial intelligence applications in scientific computing and education.

Task

Your task is to develop an efficient mathematical reasoning workflow that can accurately solve various complex mathematical problems.

You should work under the /workspace directory.

You need to handle the following 4 different types of mathematical reasoning tasks:

geometry - Geometric problem solving (angle calculation, area calculation, etc.)
graph_connectivity - Graph connectivity determination
graph_maxflow - Graph maximum flow problems
math_convexity - Mathematical function convexity determination

Your objectives are:

implement a efficient multimodel mathematical reasoning workflow: Build a framework capable of handling multiple mathematical problems, you can use tools like python etc. Your target is to make the final result as higher as possible.
Inference: Inference the dev set & test set result by using the workflow with the gpt-4o api.

You can submit your answer in the file above for 3 times (with different reasoning workflow and its corresponding inference result). You should try your best to get highest score.

Important Note: All final results and reasoning processes must be saved in the /workspace/data/outputs/ directory. The format requirement will be mentioned below.

Data

You will use diverse mathematical reasoning datasets located at:

/workspace/data/datasets/

The dataset is organized into two main subsets:

develop: Contains complete data with answers for reference (50% of data)
test: Contains data without answers for evaluation and testing (50% of data)

The dataset contains 4 different visual task subsets organized as follows:

/workspace/data/datasets/
  ├── develop/          # Training/reference data with answers
  │   ├── geometry/
  │   ├── graph_connectivity/
  │   ├── graph_maxflow/
  │   └── math_convexity/
  ├── test/          # Evaluation data without answers
  │   ├── geometry/
  │   ├── graph_connectivity/
  │   ├── graph_maxflow/
  │   └── math_convexity/
  └── [original task directories remain for reference]

The dataset contains problems for the following 4 tasks:

1. geometry (Geometric Problems)

File Format: Each sample contains:
- ex.json: Contains problem text, answer choices, image information, geometric logic forms, etc.
- image.png: Corresponding geometric image
Task Objective: Solve geometric problems such as angles, lengths, areas, etc.
Data Samples: 24 samples in develop, 24 samples in test

Data Format Example (develop with answer):

{
  "problem_text": "In \\odot K, M N = 16 and m \\widehat M N = 98. Find the measure of L N.",
  "choices": ["6.93", "7.50", "8.94", "10.00"],
  "answer": "C",
  "problem_type_graph": ["Circle"],
  "problem_type_goal": ["Length"],
  "logic_form": {
    "text_logic_form": ["Circle(K)", "Equals(LengthOf(Line(M,N)),16)", ...]
  }
}

Note: test samples are identical but without the "answer" field

2. graph_connectivity (Graph Connectivity)

File Format: example.json contains graph adjacency matrix and query vertices
Task Objective: Determine if two specified vertices in a graph are connected
Data Samples: 64 samples in develop, 64 samples in test

Data Format Example (develop with answer):

{
  "adjacency_matrix": "[[0, 0, 1], [0, 0, 0], [1, 0, 0]]",
  "query_node_1": 12,
  "query_node_2": 10,
  "label": false,
  "id": "isobench/algorithm/connectivity_008"
}

Note: test samples are identical but without the "label" field

3. graph_maxflow (Graph Maximum Flow)

File Format: example.json contains weighted graph structure with source and sink nodes
Task Objective: Calculate maximum flow from source to sink
Data Samples: 64 samples in develop, 64 samples in test

Data Format Example (develop with answer):

{
  "source_node": 0,
  "sink_node": 2,
  "adjacency_matrix": "[[0, 2, 7], [0, 0, 3], [0, 0, 0]]",
  "label": 9,
  "id": "isobench/algorithm/maxflow_105"
}

Note: test samples are identical but without the "label" field

4. math_convexity (Function Convexity)

File Format: example.json contains mathematical function in LaTeX and code representation
Task Objective: Determine if a function is convex or concave
Data Samples: 128 samples in develop, 128 samples in test

Data Format Example (develop with answer):

{
  "domain": "x > 0",
  "latex": "$$f(x) = 0.19 \\log{\\left(x \\right)} - 1.52 \\left|{x}\\right| + 3.58$$",
  "code": "f(x) = 0.19*log(x) - 1.52*Abs(x) + 3.58",
  "label": "concave",
  "id": "isobench/math/convexity_129"
}

Note: test samples are identical but without the "label" field

Constraints

Performance Constraints:

Development Time Limit: Maximum 12 hours to complete system implementation and testing
Single Problem Evaluation Time Limit: Each problem solving time should not exceed 3 minutes
Model Requirement: Must use gpt-4o model for reasoning
API Calls: Use OpenAI/Azure OpenAI API keys provided via environment variables

Technical Constraints:

Model Interface: Use OpenAI GPT-4o or Azure OpenAI services
Programming Language: Python
Output Format: JSON format result files
Resource Limitation: Efficiently utilize API calls and optimize reasoning efficiency

Evaluation

Your system will be evaluated using a two-part scoring system with a maximum total score of 100 points:

Scoring System:

Part 1: File Validation (10 points)

Purpose: Verifies that output files exist and contain proper answer fields
Scoring: (Valid Files / Total Files) × 10
Requirements:
- Each problem must have a result file at the correct path: outputs/{task_type}/{problem_id}/result.json
- Each result file must contain an "answer" field
- File must be valid JSON format

Part 2: Task Performance (90 points)

Purpose: Measures accuracy across all mathematical reasoning tasks
Scoring: (Total Correct Answers / Total Problems) × 90 (Task Performance Score will be calcuted from this score but not the original score)
Calculation: Based on all 280 problems across 4 task types
Note: This measures overall success rate across all task types combined

Total Score Calculation:

Total Score = File Validation Score + Task Performance Score
Maximum Score = 10 + 90 = 100 points

Performance Benchmarks:

Basic Performance Targets (Individual Task Accuracy):

Geometry: 62.5%
Graph Max Flow: 25.0%
Graph Connectivity: 96.1%
Function Convexity: 87.2%

Advanced Performance Targets (Individual Task Accuracy):

Geometry: 66.7%
Graph Max Flow: 66.3%
Graph Connectivity: 98.4%
Function Convexity: 94.9%

Evaluation Methods:

Primary Evaluation: Use the test for testing your system (answers removed)
Development/Training: Use the develop for development and training (answers included)
Result Files: Check output files in /workspace/data/outputs/ directory
Answer Verification: Compare your system’s predictions on test with ground truth answers
Automated Scoring: Evaluation system automatically calculates both file validation and task performance scores

Output Requirements:

Your system must process all samples in the test and save results in the following directory structure:

/workspace/data/outputs/
├── geometry/
│   ├── 8/result.json              # Result for geometry problem ID 8
│   ├── 67/result.json             # Result for geometry problem ID 67
│   └── ...                        # More geometry problem results
├── graph_connectivity/
│   ├── 10/result.json             # Result for connectivity problem ID 10
│   ├── 11/result.json             # Result for connectivity problem ID 11
│   └── ...                        # More connectivity problem results
├── graph_maxflow/
│   ├── 10/result.json             # Result for maxflow problem ID 10
│   └── ...                        # More maxflow problem results
├── math_convexity/
│   ├── 126/result.json            # Result for convexity problem ID 126
│   └── ...                        # More convexity problem results
└── summary.json                   # Overall performance summary

IMPORTANT: Each result file must be located in a subdirectory named after the problem ID and the file must be named result.json. For example, for geometry problem ID 8, the result should be at outputs/geometry/8/result.json.

Result File Format:

Each individual result file should contain a JSON object with the following required format:

For all tasks, the file MUST contain an “answer” field:

For geometry tasks:

{
  "task_type": "geometry",
  "problem_id": "8",
  "answer": "C",
  "reasoning": "Based on circle properties and chord-arc relationships, the measure of LN is 8.94, which corresponds to choice C.",
  "solution_steps": [
    "Identify circle properties from the diagram",
    "Apply chord-arc relationship formulas",
    "Calculate using given measurements: MN = 16, arc MN = 98°",
    "Determine LN = 8.94"
  ],
  "confidence": 0.85
}

For graph connectivity tasks:

{
  "task_type": "graph_connectivity",
  "problem_id": "10",
  "answer": false,
  "reasoning": "Applied DFS algorithm to determine connectivity between nodes 4 and 11. No path exists between these nodes.",
  "solution_steps": [
    "Parse adjacency matrix",
    "Apply DFS from node 4",
    "Check if node 11 is reachable",
    "Result: nodes are not connected"
  ],
  "confidence": 0.92
}

For graph maxflow tasks:

{
  "task_type": "graph_maxflow",
  "problem_id": "10",
  "answer": 16,
  "reasoning": "Applied Ford-Fulkerson algorithm to find maximum flow from source node 0 to sink node 2.",
  "solution_steps": [
    "Identify source node 0 and sink node 2",
    "Parse capacity matrix",
    "Apply Ford-Fulkerson algorithm",
    "Calculate maximum flow = 16"
  ],
  "confidence": 0.88
}

For math convexity tasks:

{
  "task_type": "math_convexity",
  "problem_id": "126",
  "answer": "convex",
  "reasoning": "Analyzed the function's second derivative to determine convexity. All terms contribute positively to the second derivative.",
  "solution_steps": [
    "Parse function: f(x) = e^(0.66x) + e^(1.25x) - 0.76*log(x) + 0.2*|x| + 10.13",
    "Compute second derivative",
    "Analyze sign of second derivative over domain x > 0",
    "Conclude function is convex"
  ],
  "confidence": 0.90
}

Answer Format Requirements:

CRITICAL: The “answer” field format must exactly match the expected answer format for each task type:

geometry: Must be one of: “A”, “B”, “C”, “D” (uppercase letters)
graph_connectivity: Must be boolean: true or false
graph_maxflow: Must be integer: e.g., 16, 25, 100
math_convexity: Must be one of: “convex”, “concave”, “neither” (lowercase)

Summary File Format:

The summary.json file should contain:

{
  "total_problems": 665,
  "completed_problems": 665,
  "task_results": {
    "geometry": {"total": 24, "completed": 24},
    "graph_connectivity": {"total": 64, "completed": 64},
    "graph_maxflow": {"total": 64, "completed": 64},
    "math_convexity": {"total": 128, "completed": 128},
  },
  "execution_time": "2.5 hours",
  "average_confidence": 0.78
}

Environment

Work in the provided environment with the following setup:

API Configuration:

OpenAI API: Access via environment variables OPENAI_API_KEY and OPENAI_BASE_URL
Azure OpenAI: Access via environment variables AZURE_OPENAI_API_KEY and AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_API_VERSION
Specified Model: GPT-4o Note: we will only provide one API, please check the real condition before you use the API.

System Components:

Reasoning Engine: Core problem analysis and reasoning module
Data Processor: Handle different input data formats
Result Generator: Generate standardized output results
Execution Environment: Support Python code execution and computation

Provided Scripts:

simple_api_example.py: OpenAI API example with two functions:
- call_with_image(image_path, problem_text): For image-based problems
- call_without_image(problem_text): For text-based problems
simple_azure_example.py: Azure OpenAI API example with same functions as above

Directory Structure:

Working Directory: /workspace/task/
Output Directory: /workspace/data/outputs/
Dataset Directory: /workspace/data/datasets/
- Training/Reference Data: /workspace/data/datasets/develop/
- Evaluation Data: /workspace/data/datasets/test/

Implementation Requirements:

Create main implementation file /workspace/data/outputs/main.py, ensuring it can be executed with the following command:

python /workspace/data/outputs/main.py

Important: Your main.py script must automatically process all samples in the test and generate the required output files in /workspace/data/outputs/ following the specified format. You should run main.py generate all files in Section Output Requirements before you eval / finish this task.

Your implementation should:

Design Universal Framework: Capable of handling multiple types of mathematical reasoning problems
Process Complete test: Automatically iterate through all 665 samples in the test directories
Implement Efficient Reasoning: Develop effective problem analysis and solving strategies
Generate Structured Output: Create properly formatted result files for each task type
Ensure Output Quality: Generate accurate answers and clear reasoning processes
Optimize Performance: Strive for highest accuracy across all task types
Ensure Reproducibility: Save complete reasoning processes and intermediate results

Innovation Goals:

This is an open-ended innovation task that encourages you to explore various innovative methods and techniques to achieve excellent performance in mathematical reasoning tasks. You can:

Design novel problem analysis methods
Develop efficient reasoning algorithms
Explore multi-step reasoning strategies
Innovatively utilize language model capabilities

Note: You should generate all files in Section Output Requirements before you eval / finish your task.