Overview
- Microsoft made the model available today on Microsoft Foundry and Hugging Face, alongside sample notebooks, inference code, a technical paper, and a model card.
- It fuses high-resolution visual perception with selective, task-aware reasoning to interpret images, diagrams, documents, and UI screens.
- Highlighted applications include computer-use agents for GUI automation, diagram-based math and scientific reasoning, and document, chart, and table understanding.
- Reasoning behavior can be set to hybrid, think, or nothink via prompt tokens, letting teams trade speed for deeper analysis in real time.
- Microsoft published internal benchmark comparisons across multimodal, math, OCR, and computer-use tasks and outlined safety training signals and deployment guidance aligned to its Responsible AI principles.