Back to Blog

FireRed-OCR-2B: Open-Source Document Parsing SOTA Model Outperforms 397B Parameters

2026-03-06 10 min read
FireRed-OCR Logo

Xiaohongshu (Little Red Book) has open-sourced FireRed-OCR, a remarkable 2B parameter model that achieves an impressive 92.94% on OmniDocBench v1.5. To put this in perspective, it outperforms Qwen3.5-397B (90.80%) and Gemini-3.0 Pro (90.33%). Released under the Apache 2.0 license, both the code and model weights are available for commercial use.

1. The "Structural Hallucination" Pain Point in Document Parsing

General-purpose large models face a common challenge when reading PDFs: they recognize text accurately but struggle with structural understanding.

Structural Hallucination Example

Typical issues include:

  • Table rows and columns get scrambled, with data misattributed
  • Mathematical formulas are "creative" with extra symbols appearing out of nowhere
  • Multi-column documents have混乱 reading order, with text crossing between columns

This isn't an occasional bug. General VLMs are trained to generate semantically coherent text but lack precise constraints on the pixel-level spatial structure of documents.

FireRed-OCR takes a direct approach: transforming general VLMs into "structural engineers" through a systematic training framework that enforces strict constraints on format and syntax.

2. Technical Architecture: Three-Stage Training + Format-Constrained GRPO

FireRed-OCR isn't just a simple fine-tune; it's a complete training pipeline.

Model Architecture

Stage 1: Multi-Task Pre-Alignment

At the visual perception level, the model establishes "spatial foundations". It learns object detection, region identification, and mapping from layout to Markdown, building a solid foundation for spatial localization.

Stage 2: Specialized Supervised Fine-Tuning (SFT)

The model is precisely tuned on high-quality, standardized Markdown datasets, ensuring logical consistency and hierarchical expression capabilities in its outputs.

Stage 3: Format-Constrained Reinforcement Learning (GRPO)

The core innovation lies here. GRPO (Group Relative Policy Optimization) is a reinforcement learning method. FireRed-OCR introduces specialized format reward signals on this foundation, covering four dimensions:

Dimension Reward Signal
Formula Syntax Correctness Is LaTeX valid?
Table Structure Integrity Are tags properly closed?
Hierarchical Tag Closure Is Markdown nesting correct?
Text Accuracy Character-level recognition precision

Every time the model outputs results, the system scores it across these four dimensions and provides feedback for self-correction.

3. Performance Comparison: What Does 92.94% Mean?

FireRed-OCR-2B's performance on OmniDocBench v1.5:

Model Overall Score Parameters Type
FireRed-OCR-2B 92.94% 2B End-to-End
Qwen3.5-397B 90.80% 397B End-to-End
Gemini-3.0 Pro 90.33% - End-to-End
DeepSeek-OCR 2 91.09% - End-to-End
GLM-OCR 94.60% - Pipeline
PaddleOCR-VL-1.5 94.50% 1.5B Pipeline

FireRed-OCR is the optimal end-to-end single-model solution. GLM-OCR and PaddleOCR-VL-1.5 use pipeline approaches (multiple specialized models in series), achieving higher scores but requiring more complex deployment.

For text recognition alone (OCRBench TextRec), FireRed-OCR-2B scores 93.5, ranking first among all models, surpassing GPT-5.2 (93.0) and Gemini-3.0 Pro (91.9).

In FireRedBench (the team's proprietary "stress test" benchmark featuring non-standard document layouts), FireRed-OCR-2B scores 74.62, ranking first among end-to-end solutions, surpassing the pipeline GLM-OCR (74.33) and slightly trailing PaddleOCR-VL-1.5 (76.47).

The baseline model Qwen3-VL-2B-Instruct only scored 65.58, showing significant improvement.

4. Deployment: A Few Lines of Code

With 2B parameters, the model uses approximately 4-5GB VRAM at bfloat16 precision. A single RTX 3090 / A10 GPU is sufficient for smooth inference.

Installation:

pip install transformers qwen-vl-utils
git clone https://github.com/FireRedTeam/FireRed-OCR.git
cd FireRed-OCR

Inference Example:

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from conv_for_infer import generate_conv

model = Qwen3VLForConditionalGeneration.from_pretrained(
    "FireRedTeam/FireRed-OCR",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("FireRedTeam/FireRed-OCR")

image_path = "./examples/complex_table.png"
messages = generate_conv(image_path)

inputs = processor.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True,
    return_dict=True, return_tensors="pt"
).to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=8192)
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)
print(output_text)  # Standard Markdown format

Performance Optimization Tips:

  • Enable flash_attention_2 to significantly reduce peak VRAM and improve throughput
  • max_new_tokens defaults to 8192; keep this value or higher for dense academic papers
  • Image quality matters significantly; use images ≥150 DPI for best results

5. Ideal Use Cases

FireRed-OCR excels at document parsing requiring structural integrity:

  • Academic papers (with formulas)
  • Financial reports and tables
  • Technical documentation
  • Multi-column book scans

For downstream tasks where "tables must not break" and "formulas must be correct" are critical requirements, it's currently the most reliable end-to-end solution.

Not Suitable For:

  • Extreme accuracy requirements with engineering resources to maintain multi-model systems → Choose PaddleOCR-VL-1.5 or GLM-OCR
  • Poor quality scans (<100 DPI) → Performance degrades significantly

6. Conclusion

FireRed-OCR demonstrates the power of "specialized optimization": instead of relying on parameter scale, it uses a carefully designed training framework to enable a 2B model to outperform 235B general models on specialized tasks.

For vertical tasks, specialized training is more efficient than simply scaling up models.

Resources