Qwen2.5-VL-72B-Instruct

2025-01-28
Chat, Vision
By Qwen

Input: ￥6.00 / M tokens Output: ￥12.00 / M tokens
Features: Image Input, Streaming, Text Input, Text Output
Context Window: 128K

Input: ￥6.00 / M tokens Output: ￥12.00 / M tokens
Features: Image Input, Streaming, Text Input, Text Output
Context Window: 128K

Model Description

Qwen2.5-VL-72B-Instruct represents a significant upgrade in the Qwen family of vision-language models. Building on feedback from developers, it excels in visual recognition (objects, text, charts, layouts), acts as a visual agent for tool-based reasoning, and processes long videos (1+ hours) with precise event localization. It supports object detection via bounding boxes/points and generates structured outputs (e.g., invoices, tables) for finance/commerce. Architectural improvements include dynamic FPS training for video understanding, optimized ViT with window attention/SwiGLU, and temporal mRoPE enhancements. Available in 3B/7B/72B variants, this 72B instruction-tuned model balances speed and performance.

Recommend Models

DeepSeek-R1

Chat, Reasoning
DeepSeek

Performance on par with OpenAI-o1, Fully open-source model & technical report, Code and models are released under the MIT License: Distill & commercialize freely.

2025-01-20

DeepGemini-2.5-pro

Chat, Reasoning
JuheAI

DeepSeek-R1 + gemini-2.5-pro-preview-03-25，The Deep series is composed of the DeepSeek-R1 (671b) model combined with the chain-of-thought reasoning of other models, fully utilizing the powerful capabilities of the DeepSeek chain-of-thought. It employs a strategy of leveraging other more powerful models for supplementation, thereby enhancing the overall model's capabilities.

2025-03-25

DeepSeek-R1-all

Chat, Reasoning
DeepSeek

Performance on par with OpenAI-o1, Fully open-source model & technical report, Code and models are released under the MIT License: Distill & commercialize freely.

2025-01-20