DocReward: A Document Reward Model for Structuring and Stylizing


TL;DR

This paper proposes a document reward model called DocReward, which is trained on a large-scale dataset of 117K document pairs and is specifically designed to evaluate the structural and stylistic professionalism of documents. It significantly outperforms powerful baseline models such as GPT-5 on this task.

Key Definitions

The core concepts proposed or used in this paper include:

At present, automated professional document generation in agentic workflows is an important direction. However, existing research mainly focuses on improving the quality of text content, while overlooking the visual structure and style that are crucial for readability and professionalism.

The key bottleneck in this field is the lack of a suitable reward model to guide agents in generating documents that are more professional in structure and style. Although there are aesthetic evaluation models for graphic design, UI interfaces, or single images, they are not suitable for multi-page documents; traditional document AI models (such as LayoutLM) mainly focus on extracting information from documents rather than evaluating their layout quality.

Therefore, the core question this paper aims to address is: How can we quantitatively evaluate the structural and stylistic professionalism of documents, and create a reward model that can effectively guide document-generation agents?

Method

This paper proposes DocReward, a reward model focused on evaluating the structural and stylistic professionalism of documents. Its core lies in constructing a high-quality preference dataset \(DocStruct-117K\) and training the model on it for scoring.

Dataset Construction (DocStruct-117K)

To enable the model to learn professionalism evaluation that is independent of text content, the paper designs an elaborate dataset construction pipeline:

Figure illustration

  1. Collect high-quality source documents: First, a large number of human-created, high-quality professional documents were collected from sources such as GovDocs1, NapierOne, and CommonCrawl (e.g., government reports, business proposals, academic papers). After filtering, examples with strong structure and style were retained.

Top 10 Document Domain Distribution (Total: 32).

Top 30 Document Type Distribution.

  1. Generate diverse corresponding documents: The plain-text content of the source documents was extracted, and then agents driven by multiple large language models (such as GPT-4o, GPT-5, etc.) were used to regenerate DOCX documents from scratch. This process simulates the real-world scenario of generating professional documents from plain text, producing many versions with different structures and styles but identical content. In addition, an “improvement agent” was designed to refine generated documents by referring to the original document.

  2. Annotate preference pairs: Documents with the same content were paired and labeled with winner/loser relationships. The annotation rules are as follows:

    • Human vs. generated: If one document is the original professional document created by a human and the other is generated by an agent, the original document is always labeled as the “winner.”
    • Generated vs. generated: If both documents are generated by agents, GPT-5 is used as a proxy annotator. By providing GPT-5 with the original professional document as a reference, it judges which of the two generated documents is closer to the reference standard.

Through this process, the \(DocStruct-117K\) dataset was ultimately built, containing 117,108 document pairs.


Domain Document Type Number of Documents Average Pages Total Document Pairs Human vs. Generated Generated vs. Generated
32 267 69,137 3.2 117,108 36,664 80,444


Model Architecture and Optimization

\[\min_{\theta}-\log\sigma\big(\mathcal{R}_{\theta}(D_{\mathrm{img}}^{w})-\mathcal{R}_{\theta}(D_{\mathrm{img}}^{l})\big)\]

where $\mathcal{R}_{\theta}$ is the reward model and $\sigma$ is the sigmoid function. This loss penalizes the model when it assigns a higher score to the loser than to the winner.

Innovations

Experimental Conclusions

This paper comprehensively validates the effectiveness of DocReward through both internal and external evaluations.

Internal Evaluation: Accuracy Surpasses Strong Baselines

On a test set annotated by human experts, DocReward performs far better than all baseline models, including GPT-5.


Model Type Model Real vs. Synth (Accuracy %) Synth vs. Synth (Accuracy %) Overall (Accuracy %)
Pairwise        
  GPT-4o 58.91 66.43 63.22
  Claude Sonnet 4 57.86 69.02 64.26
  GPT-5 64.78 72.32 69.10
Pointwise        
  GPT-4o 50.99 64.21 58.56
  Claude Sonnet 4 48.02 66.79 58.77
  GPT-5 64.85 73.43 69.77
  DocReward-3B (this paper) 72.77 97.42 86.89
  DocReward-7B (this paper) 78.22 97.42 89.22



Reward Model Times Preferring Position 1 Times Preferring Position 2
GPT-4o 202 271
Claude Sonnet 4 189 284
GPT-5 240 233


External Evaluation: Effectively Guiding Document Generation

To verify the practical value of DocReward, this paper conducted an external evaluation: a document generation intelligent agent was asked to generate multiple candidate documents, and Random, GPT-5, and DocReward were then used as reward models to select the best version. The results were ultimately judged by human evaluators.


Reward Model Win Rate (%) Loss Rate (%) Tie Rate (%)
Random 24.6 66.2 9.2
GPT-5 37.7 40.0 22.3
DocReward (this paper) 60.8 16.9 22.3


The results show that documents selected using DocReward achieved a 60.8% win rate, far higher than GPT-5 (37.7%). This proves that DocReward’s reward signal is highly aligned with human preferences for structure and style, and can effectively guide the generation intelligent agent to produce documents that are more favored by humans.

Interpretability Analysis

Through case studies and attention map visualization, it can be seen that DocReward is indeed focusing on the right signals.

Attention map visualization Attention map visualization Attention map visualization

Summary

The experimental results strongly demonstrate that DocReward outperforms existing general-purpose large models in evaluating the structural and stylistic professionalism of documents, and can serve as an effective reward model to substantially improve the final quality of automated document generation.