Overview
- The open-source release is live on Hugging Face with model weights, code, documentation, and a dockerized vllm-server example for GPU inference.
- The architecture integrates a dynamic-resolution NaViT-style visual encoder with ERNIE-4.5-0.3B to deliver a compact VLM optimized for low-resource deployment.
- Supported use cases include page and element parsing across 109 languages covering text, tables, formulas, charts, handwritten content, and historical documents.
- The project cites state-of-the-art results on OmniDocBench with additional metrics from MinerU and internal evaluations, including claims of outperforming some 72B multimodal models on chart recognition.
- Practical usage is provided through CLI and Python APIs with instructions to launch an inference server, while the reported performance advantages await independent verification.
