A Survey on Efficient Large Language Model Training: From Data-centric Perspectives


Introduction

Post-training of Large Language Models (LLMs) has become a critical stage for unlocking their domain adaptation capabilities and task generalization potential. This stage effectively enhances the model’s abilities in long-context reasoning, human alignment, instruction fine-tuning, and domain-specific adaptation.

Figure illustration

However, in the post-training stage of LLMs, data, as the core driving force behind model evolution, is facing a severe “data challenge”: the cost of manually annotating high-quality data is rising rapidly, while the marginal gains from simply scaling up data are diminishing. In addition, static datasets inherently limit the model’s ability to adapt to ever-changing real-world knowledge. The linear dependence between data volume and model performance fundamentally stems from the inefficient use of data in traditional post-training paradigms.

Researchers have explored various methods to fully tap into the data potential in LLM post-training, but the field still lacks a comprehensive survey. This article presents the first systematic survey of data-efficient LLM post-training from a data-centric perspective. Specifically, it proposes the concept of a “data flywheel” (as shown in the figure), which consists of five key components: data selection, data quality enhancement, synthetic data generation, data distillation and compression, and a self-evolving data ecosystem. Based on this framework, this article categorizes existing work, summarizes the key components, and points out future research directions.

Differences from prior surveys: Although existing surveys have examined certain aspects of LLM post-training, such as data selection, synthetic data generation, model self-feedback, self-evolution, trustworthiness, and time efficiency, these studies mainly focus on individual components rather than an overall perspective. This survey systematically examines these methods through the lens of data efficiency, filling the gap and providing key insights for maximizing data value extraction.

Taxonomy

This section divides data-efficient LLM post-training methods into five core categories:

The table below compares the performance of these five categories across key dimensions, where more “+” signs indicate higher requirements or better performance.

Method Category Data Efficiency Computational Demand Model Capability Dependence Quality Requirement Domain Adaptability
Data Selection +++ ++ + +++ ++
Data Quality Enhancement ++ ++ ++ ++ ++
Synthetic Data Generation ++ +++ +++ + +++
Data Distillation and Compression +++ ++ +++ ++ ++
Self-Evolving Data Ecosystems ++ +++ +++ + +++

These five dimensions complement one another: selection filters high-quality data, enhancement improves data utility, generation expands data coverage, distillation condenses knowledge, and self-evolution enables continuous improvement. Together, they pursue the goal of maximizing model performance with minimal data requirements.

Data Selection

Data selection is crucial for improving the efficiency of LLM post-training by identifying high-value data subsets. As shown in the figure below, this article divides existing methods into four dimensions: (1) static filtering based on intrinsic data properties, (2) dynamic selection that adapts during training, (3) Agent strategies using collaborative mechanisms, and (4) labeling efficiency achieved through human-machine collaboration.

Figure illustration

Static Filtering

Static filtering identifies samples with high information density and representativeness by evaluating intrinsic data properties offline.

Dynamic Selection

Dynamic methods adaptively adjust data weights by evaluating each sample’s importance to the model.

Agent Strategies

Agent-based methods leverage collaborative mechanisms for reliable data selection.

Labeling Efficiency

These methods effectively optimize the labeling process through iterative human-machine collaboration.

Discussion

Current data selection methods face challenges in aligning static metrics with dynamic model needs, managing the computational complexity of the optimization process, and achieving cross-domain generalization. Future research directions point to meta-learning-based selection frameworks, causal inference for sample analysis, and efficiency-aware optimization that takes hardware constraints into account.

Data Quality Enhancement

As shown in the figure below, improving data quality is crucial for maximizing the effectiveness of LLM post-training. Through semantic refinement, toxicity control, and distribution stabilization, researchers aim to improve the informativeness, safety, and robustness of training data. This article categorizes existing methods into three directions.

Figure illustration

Semantic Rewriting

Semantic rewriting focuses on increasing data diversity through controlled transformations while preserving the original meaning.

Toxicity Control

Mitigating harmful content is a key part of data quality enhancement.

Distribution Stabilization

Stabilizing data distributions ensures that models generalize well across different tasks and domains.

Discussion

Semantic rewriting, toxicity control, and distribution stabilization are key strategies for improving the quality of LLM post-training data. Future work should integrate these methods into a unified framework to maximize data diversity and model performance while reducing costs.

Synthetic Data Generation

Generating synthetic training data is a powerful strategy for overcoming data scarcity and enhancing the robustness of LLM post-training. As shown in the figure below, synthetic data generation methods can be divided into three categories: instruction-driven generation, knowledge-guided generation, and adversarial generation.

Figure illustration

Instruction-Driven Synthetic Data Generation

Instruction-driven methods leverage LLMs’ ability to directly generate new samples from task prompts. For example, SynPO generates preference pairs for alignment tasks, Magpie enables template-free instruction generation, while other work synthesizes proof steps, significantly improving GPT-4’s proof capabilities.

Knowledge-Guided Synthetic Data Generation

Knowledge-guided methods integrate external knowledge to guide data generation.

Adversarial Generation

Adversarial generation methods systematically probe model vulnerabilities to enhance robustness. For example, one work uses agent-based simulation to generate edge cases, reducing error rates on dialectal variations by 19%; another introduces contrastive unlearning to address data defects; ToxiCraft generates subtle harmful content, revealing significant gaps in commercial safety filters.

Discussion

Each method has its trade-offs: instruction-driven methods scale quickly but risk semantic drift; knowledge-guided methods preserve fidelity through structured constraints; adversarial generation enhances robustness by exposing vulnerabilities. Future work should combine the strengths of these methods while continuing to focus on optimizing generation costs and developing theoretical foundations.

Data Distillation and Compression

Data distillation and compression techniques improve the efficiency of LLM post-training by reducing data complexity while preserving performance. As shown in the figure below, this includes three complementary approaches: model distillation for knowledge transfer, data distillation for dataset compression, and joint compression for unified optimization.

Figure illustration

Model Distillation

Model distillation transfers knowledge from large teacher models to smaller student models while preserving performance. Recent advances include Impossible Distillation, which can create high-quality student models from low-quality teacher models; and Cross-Tokenizer Distillation, which enables knowledge transfer across different architectures through universal logit distillation. For edge deployment, XAI-driven distillation and BitDistiller produce interpretable models and achieve sub-4-bit precision, respectively.

Data Distillation

Data distillation focuses on selecting high-information-density samples to create compact yet representative datasets. Studies show that LLM-generated labels can effectively train classifiers comparable to human annotations. LLMLingua-2 achieves prompt compression through token-level distillation. Domain-specific applications include Self-Data Distillation for model fine-tuning and multi-teacher distillation for medical data integration.

Joint Compression

Joint compression combines model compression with data selection to optimize overall efficiency. Some work jointly optimizes structured pruning and label smoothing, compressing LLaMA-7B to 2.8B parameters with minimal performance loss. Efficient Edge Distillation achieves adaptive width scaling on edge devices through supernet training. In recommender systems, Prompt Distillation aims to align ID-based and text-based representations to reduce inference time.

Discussion

These three methods offer complementary advantages for improving LLM efficiency: model distillation optimizes architecture, data distillation selects high-impact samples, and joint compression unifies model and data optimization. Future research should focus on integrating these methods, especially for edge AI and low-resource application scenarios.

Self-Evolving Data Ecosystems

Self-evolving data ecosystems strategically optimize LLM post-training through autonomous data generation, real-time feedback, and continuous learning. As shown in the figure below, this ecosystem forms a closed loop of generation, evaluation, and adaptive training. This article discusses its three key components: self-iterative optimization, dynamic evaluation feedback, and LLM as judge.

Figure illustration

Self-Iterative Optimization

Self-iterative optimization enables LLMs to use their own outputs to generate new training data, thereby autonomously improving their capabilities.

Dynamic Evaluation Feedback

The dynamic evaluation feedback system allows the model to make real-time adjustments based on its performance, dynamically optimizing its outputs.

LLM-as-a-Judge

The “LLM-as-a-Judge” system represents a paradigm shift from external evaluation to self-evaluation, in which the model evaluates its own or other models’ outputs.

Discussion

The combination of self-iterative optimization, dynamic evaluation feedback, and LLM-as-a-Judge creates a powerful framework for autonomous LLM improvement. Although these methods show great potential in reducing human intervention, future work should focus on unifying them into a scalable framework and enabling them to generalize across different tasks.

Challenges and Future Directions