Xiaohongshu (Little Red Book) has open-sourced FireRed-OCR, a remarkable 2B parameter model that achieves an impressive 92.94% on OmniDocBench v1.5. To put this in perspective, it outperforms Qwen3.5-397B (90.80%) and Gemini-3.0 Pro (90.33%). Released under the Apache 2.0 license, both the code and model weights are available for commercial use.
1. The "Structural Hallucination" Pain Point in Document Parsing
General-purpose large models face a common challenge when reading PDFs: they recognize text accurately but struggle with structural understanding.
Typical issues include:
- Table rows and columns get scrambled, with data misattributed
- Mathematical formulas are "creative" with extra symbols appearing out of nowhere
- Multi-column documents have混乱 reading order, with text crossing between columns
This isn't an occasional bug. General VLMs are trained to generate semantically coherent text but lack precise constraints on the pixel-level spatial structure of documents.
FireRed-OCR takes a direct approach: transforming general VLMs into "structural engineers" through a systematic training framework that enforces strict constraints on format and syntax.
2. Technical Architecture: Three-Stage Training + Format-Constrained GRPO
FireRed-OCR isn't just a simple fine-tune; it's a complete training pipeline.
Stage 1: Multi-Task Pre-Alignment
At the visual perception level, the model establishes "spatial foundations". It learns object detection, region identification, and mapping from layout to Markdown, building a solid foundation for spatial localization.
Stage 2: Specialized Supervised Fine-Tuning (SFT)
The model is precisely tuned on high-quality, standardized Markdown datasets, ensuring logical consistency and hierarchical expression capabilities in its outputs.
Stage 3: Format-Constrained Reinforcement Learning (GRPO)
The core innovation lies here. GRPO (Group Relative Policy Optimization) is a reinforcement learning method. FireRed-OCR introduces specialized format reward signals on this foundation, covering four dimensions:
| Dimension | Reward Signal |
|---|---|
| Formula Syntax Correctness | Is LaTeX valid? |
| Table Structure Integrity | Are tags properly closed? |
| Hierarchical Tag Closure | Is Markdown nesting correct? |
| Text Accuracy | Character-level recognition precision |
Every time the model outputs results, the system scores it across these four dimensions and provides feedback for self-correction.
3. Performance Comparison: What Does 92.94% Mean?
FireRed-OCR-2B's performance on OmniDocBench v1.5:
| Model | Overall Score | Parameters | Type |
|---|---|---|---|
| FireRed-OCR-2B | 92.94% | 2B | End-to-End |
| Qwen3.5-397B | 90.80% | 397B | End-to-End |
| Gemini-3.0 Pro | 90.33% | - | End-to-End |
| DeepSeek-OCR 2 | 91.09% | - | End-to-End |
| GLM-OCR | 94.60% | - | Pipeline |
| PaddleOCR-VL-1.5 | 94.50% | 1.5B | Pipeline |
FireRed-OCR is the optimal end-to-end single-model solution. GLM-OCR and PaddleOCR-VL-1.5 use pipeline approaches (multiple specialized models in series), achieving higher scores but requiring more complex deployment.
For text recognition alone (OCRBench TextRec), FireRed-OCR-2B scores 93.5, ranking first among all models, surpassing GPT-5.2 (93.0) and Gemini-3.0 Pro (91.9).
In FireRedBench (the team's proprietary "stress test" benchmark featuring non-standard document layouts), FireRed-OCR-2B scores 74.62, ranking first among end-to-end solutions, surpassing the pipeline GLM-OCR (74.33) and slightly trailing PaddleOCR-VL-1.5 (76.47).
The baseline model Qwen3-VL-2B-Instruct only scored 65.58, showing significant improvement.
4. Deployment: A Few Lines of Code
With 2B parameters, the model uses approximately 4-5GB VRAM at bfloat16 precision. A single RTX 3090 / A10 GPU is sufficient for smooth inference.
Installation:
pip install transformers qwen-vl-utils
git clone https://github.com/FireRedTeam/FireRed-OCR.git
cd FireRed-OCR
Inference Example:
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from conv_for_infer import generate_conv
model = Qwen3VLForConditionalGeneration.from_pretrained(
"FireRedTeam/FireRed-OCR",
torch_dtype=torch.bfloat16,
device_map="auto",
)
processor = AutoProcessor.from_pretrained("FireRedTeam/FireRed-OCR")
image_path = "./examples/complex_table.png"
messages = generate_conv(image_path)
inputs = processor.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True,
return_dict=True, return_tensors="pt"
).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=8192)
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)
print(output_text) # Standard Markdown format
Performance Optimization Tips:
- Enable
flash_attention_2to significantly reduce peak VRAM and improve throughput max_new_tokensdefaults to 8192; keep this value or higher for dense academic papers- Image quality matters significantly; use images ≥150 DPI for best results
5. Ideal Use Cases
FireRed-OCR excels at document parsing requiring structural integrity:
- Academic papers (with formulas)
- Financial reports and tables
- Technical documentation
- Multi-column book scans
For downstream tasks where "tables must not break" and "formulas must be correct" are critical requirements, it's currently the most reliable end-to-end solution.
Not Suitable For:
- Extreme accuracy requirements with engineering resources to maintain multi-model systems → Choose PaddleOCR-VL-1.5 or GLM-OCR
- Poor quality scans (<100 DPI) → Performance degrades significantly
6. Conclusion
FireRed-OCR demonstrates the power of "specialized optimization": instead of relying on parameter scale, it uses a carefully designed training framework to enable a 2B model to outperform 235B general models on specialized tasks.
For vertical tasks, specialized training is more efficient than simply scaling up models.