PaddleOCR-VL-1.5: Comprehensive Analysis of the 0.9B SOTA Document Parsing Model

Introduction: A New Milestone in Document Parsing

In today's rapidly evolving AI landscape, document parsing technology has become a critical bridge connecting the physical and digital worlds. From academic papers to business contracts, from invoices to historical archives, massive volumes of document information urgently need to be accurately and efficiently digitized and structured. This goes beyond simple OCR (Optical Character Recognition) upgrades—it represents a comprehensive understanding of documents' deep semantics and structure.

On January 29, 2026, Baidu's PaddlePaddle team released PaddleOCR-VL-1.5, a multi-task Vision-Language Model (VLM) with only 0.9B (900 million) parameters that achieved 94.5% accuracy on the OmniDocBench v1.5 benchmark, setting a new State-of-the-Art (SOTA) record. Even more remarkably, this lightweight model outperformed massive general-purpose VLMs like Qwen3-VL-235B and Gemini-3 Pro in real-world robustness testing.

PaddleOCR-VL-1.5 represents more than just a performance improvement—it marks a paradigm shift in document parsing technology: from single text recognition to unified parsing of tables, formulas, charts, and seals; from recognition in ideal conditions to handling real-world challenges like scanning, skewing, warping, and screen photography. This marks the official entry of document parsing technology into a new era of "practicality" and "intelligence."

Core Highlights

1. Ultra-Lightweight Architecture with SOTA Performance

The most striking feature of PaddleOCR-VL-1.5 is its extreme parameter efficiency. With just 0.9B parameters, it achieves 94.5% accuracy on OmniDocBench v1.5, surpassing not only its predecessor PaddleOCR-VL-1.0 but also demonstrating remarkable advantages in comparisons with large general-purpose VLMs:

vs. Qwen3-VL-235B: Only 1/260 of the parameter size, yet superior performance on document parsing tasks

vs. Gemini-3 Pro: More stable performance in real-world scenario testing

vs. Specialized Models: Significant improvements in table, formula, and text recognition

This parameter efficiency stems from PaddlePaddle team's deep understanding and careful design of document parsing tasks. The model employs a NaViT-style dynamic resolution visual encoder paired with the lightweight ERNIE-4.5-0.3B language model, maintaining high accuracy while dramatically reducing computational costs and deployment barriers.

2. Six Core Capabilities in a Unified Model

PaddleOCR-VL-1.5 is a true multi-task model, supporting six core capabilities within a single architecture:

OCR (Text Recognition): Supports 100+ languages, with new support for Tibetan and Bengali scripts, optimized for rare characters, ancient texts, and text decorations (underlines, emphasis marks)

Table Recognition: Supports complex table structures, including automatic cross-page table merging, multilingual tables, and wireless tables

Formula Recognition: Supports LaTeX format output, specially optimized for physical distortions like scanning, warping, and screen photography

Chart Recognition: Understands and extracts data and trends from charts

Seal Recognition (New): Recognizes official seals and stamps, handling challenges like curved text, blurred images, and background interference

Text Spotting (New): Supports precise text line localization and recognition, using 4-point quadrilateral representation to adapt to rotated and skewed layouts

This unified multi-task architecture not only simplifies deployment but, more importantly, enables knowledge sharing and collaborative optimization across different tasks, achieving better performance on each task.

3. Real-World Robustness: Real5-OmniDocBench

To evaluate the model's performance in real-world scenarios, the PaddlePaddle team constructed the Real5-OmniDocBench benchmark, covering five common physical distortion scenarios:

Scanning: Noise and moiré patterns from scanners

Skew: Documents photographed at incorrect angles

Warping: Non-planar deformations from paper folding and bending

Screen Photography: Capturing content displayed on screens

Illumination: Uneven lighting and shadows

On this more practical test set, PaddleOCR-VL-1.5 achieved an overall accuracy of 92.05%, setting a new SOTA record. This means the model maintains stable high performance whether processing contract photos taken with a smartphone or historical documents processed by scanners.

Technical Architecture Deep Dive

PaddleOCR-VL-1.5 adopts an innovative two-stage architecture design that organically combines layout analysis and element recognition for end-to-end document parsing capabilities.

PP-DocLayoutV3: Unified Layout Analysis Engine

PP-DocLayoutV3 is the first stage of PaddleOCR-VL-1.5, responsible for document layout analysis. Unlike traditional rectangular detection boxes, PP-DocLayoutV3 introduces instance segmentation technology to predict precise pixel-level masks, crucial for handling skewed and warped documents.

Core Innovations:

Multi-Point Localization: Supports quadrilateral or even polygonal bounding box prediction instead of traditional two-point rectangles, enabling accurate framing of skewed and rotated document elements.

Unified Reading Order Prediction: Integrates reading order prediction directly into the Transformer decoder through a Global Pointer Mechanism to compute precedence relationships between elements, eliminating cascading errors in traditional methods.

Instance Segmentation Capability: Based on the RT-DETR object detector, PP-DocLayoutV3 uses mask-based detection heads to predict precise pixel-level masks, effectively isolating document components in non-ideal scenarios.

PaddleOCR-VL-1.5-0.9B: Element-Level Recognition Model

The second stage, PaddleOCR-VL-1.5-0.9B, performs fine-grained recognition on elements obtained from layout analysis. The model inherits the lightweight architecture of PaddleOCR-VL-0.9B but with significantly expanded capabilities.

Architecture Components:

Visual Encoder: Uses a NaViT-style dynamic resolution encoder, supporting maximum resolutions of 1280×28×28 (document parsing) and 2048×28×28 (text spotting)

Adaptive MLP Connector: Maps visual features to the language model's input space for effective vision-language alignment

Language Model: Uses the lightweight ERNIE-4.5-0.3B as the language backbone, a large-scale pre-trained Chinese language model with strong semantic understanding

Training Strategy: Progressive training paradigm with three stages - pre-training (46M image-text pairs), post-training (5.6M instruction data), and reinforcement learning (GRPO optimization).

Performance Evaluation & Comparison

OmniDocBench v1.5: Comprehensive Leadership

OmniDocBench v1.5 is one of the most authoritative document parsing benchmarks, covering multiple element types including text, tables, formulas, and charts. PaddleOCR-VL-1.5 achieved 94.5% overall accuracy, setting new SOTA records across multiple sub-tasks:

Overall Accuracy: 94.5% (surpassing all open-source and closed-source models)

Table Recognition: Significant improvements, especially on complex and cross-page tables

Formula Recognition: Substantially improved LaTeX format output quality

Text Recognition: Excellent performance on rare characters, ancient texts, and text decorations

Reading Order: End-to-end prediction accuracy reaches new heights

Comparison with Competitors:

Real5-OmniDocBench: Real-World Performance

On Real5-OmniDocBench testing, PaddleOCR-VL-1.5 demonstrated exceptional robustness with an overall accuracy of 92.05%, surpassing all tested models including general-purpose VLMs with far larger parameter sizes.

Hardware Requirements & Deployment

Hardware Requirements

Recommended Configuration:

GPU: NVIDIA A100, AMD Instinct MI series

VRAM: 8GB+ (16GB+ recommended for larger batches)

CPU: 8+ cores

RAM: 16GB+

Minimum Configuration:

GPU: NVIDIA RTX 3060 or equivalent

VRAM: 6GB+

CPU: 4+ cores

RAM: 8GB+

Supported Platforms: CUDA (NVIDIA GPU), ROCm (AMD GPU, Day 0 support), CPU inference

Deployment Options

1. Docker Deployment (Recommended)


docker run --rm --gpus all --network host \
    paddlepaddle/paddleocr-genai-vllm-server:latest-nvidia-gpu \
    paddleocr genai_server --model_name PaddleOCR-VL-1.5-0.9B

2. vLLM Accelerated Deployment


vllm serve PaddlePaddle/PaddleOCR-VL-1.5-0.9B \
    --host 0.0.0.0 --port 8080

3. Native PaddlePaddle Deployment


from paddleocr import PaddleOCRVL
pipeline = PaddleOCRVL()
output = pipeline.predict("document.pdf")

Use Cases & Best Practices

Typical Application Scenarios

Document Digitization: Convert paper documents and scans to editable digital formats

RAG System Preprocessing: Provide high-quality structured document data for LLMs

Invoice/Contract Recognition: Automatically extract key information from invoices and contracts

Academic Paper Parsing: Extract text, formulas, tables, and charts from papers

Multilingual Document Processing: Support 100+ languages

Seal Recognition: Recognize seals on official documents

Scene Text Recognition: Recognize text in billboards, signs, and posters

Best Practice Recommendations

Choose Appropriate Deployment: Use vLLM for production, Docker for development

Optimize Input Resolution: Adjust based on document type

Batch Processing: Use batching for large volumes to improve throughput

Result Post-processing: Utilize structured output for cross-page table merging

Error Handling: Adjust preprocessing parameters for failed recognitions

Conclusion

PaddleOCR-VL-1.5 represents a significant breakthrough in document parsing technology. With only 0.9B parameters, it achieves 94.5% SOTA accuracy and outperforms much larger general-purpose models in real-world robustness testing.

Core Advantages:

Parameter Efficiency: Minimal parameters, low deployment cost

Multi-Task Unified: Six core capabilities in one model

Real-World Robust: Handles scanning, skewing, warping scenarios

Open Source: Apache 2.0 license, fully open-source

Related Links:

Official Website: https://www.paddleocr.com

GitHub Repository: https://github.com/PaddlePaddle/PaddleOCR

HuggingFace Model: https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5

Technical Paper: https://arxiv.org/abs/2601.21957

Link

Z-Image: Free AI Image Generator

Z-Image-Turbo: Free AI Image Generator

Free Sora Watermark Remover

Zimage.run Google Site

Zhi Hu

Twitter

LTX-2

Model	Parameters	OmniDocBench v1.5	Features
PaddleOCR-VL-1.5	0.9B	94.5%	Lightweight, SOTA
Qwen3-VL-235B	235B	93.8%	General-purpose LLM
Gemini-3 Pro	Undisclosed	92.1%	Closed-source commercial
DeepSeek-OCR	Undisclosed	91.5%	Optical 2D mapping
MonkeyOCR v1.5	Undisclosed	90.2%	Three-stage framework