Particle.news
Download on the App Store

PaddlePaddle Releases PaddleOCR-VL, a 0.9B Vision-Language Model for Multilingual Document Parsing

The 0.9B model pairs a NaViT-style visual encoder with an ERNIE-4.5-0.3B language model for multilingual document parsing across 109 languages, with SOTA results reported by the authors.

Overview

  • The open-source release is live on Hugging Face with model weights, code, documentation, and a dockerized vllm-server example for GPU inference.
  • The architecture integrates a dynamic-resolution NaViT-style visual encoder with ERNIE-4.5-0.3B to deliver a compact VLM optimized for low-resource deployment.
  • Supported use cases include page and element parsing across 109 languages covering text, tables, formulas, charts, handwritten content, and historical documents.
  • The project cites state-of-the-art results on OmniDocBench with additional metrics from MinerU and internal evaluations, including claims of outperforming some 72B multimodal models on chart recognition.
  • Practical usage is provided through CLI and Python APIs with instructions to launch an inference server, while the reported performance advantages await independent verification.