WizardCoder: Empowering Code Large Language Models with Evol-Instruct


TL;DR

This paper proposes a method called \(Code Evol-Instruct\), which automatically evolves and enhances the complexity of programming instructions to fine-tune code large language models, thereby creating the \(WizardCoder\) model family with outstanding performance on multiple benchmarks.

Key Definitions

The core of this paper revolves around the new method \(Code Evol-Instruct\), which gives rise to the \(WizardCoder\) models.

Method

The core contribution of this paper is the \(Code Evol-Instruct\) method, which improves the quality of code instructions through iterative evolution and uses the resulting data to train the \(WizardCoder\) models.

Method overview Caption: Illustration of the Code Evol-Instruct method.

Method Pipeline

The entire pipeline consists of two steps:

  1. Instruction Evolution: First, a basic code instruction dataset (Code Alpaca in this paper) is used as the seed. Then, \(Code Evol-Instruct\) is applied to iteratively evolve these instructions.
  2. Model Fine-Tuning: The evolved high-complexity instruction dataset is used to fine-tune pretrained open-source code large language models (such as StarCoder and CodeLlama), ultimately producing the \(WizardCoder\) models.

Innovation: The Design of Code Evol-Instruct

The innovation of \(Code Evol-Instruct\) lies in its evolution strategy, which is specifically designed for code tasks. It uses a specific prompt template to drive a large language model (such as GPT-3.5) to increase instruction difficulty.

Evolution Prompt Template: ``\(Please increase the difficulty of the given programming test question a bit. You can increase the difficulty using, but not limited to, the following methods: {method} {question}\)`\(Here,\){question}\(is the original instruction to be evolved, and\){method}$$ is one of the following five specially designed code evolution heuristics, selected at random:

  1. Add constraints: Add new constraints and requirements to the original problem (about 10 more words).
  2. Replace requirements: Replace a commonly used requirement in a programming task with a less common and more specific one.
  3. Deepen reasoning: If the original problem can be solved with only a few logical steps, add more reasoning steps.
  4. Introduce misleading information: Provide a piece of incorrect code as reference to increase confusion (an adversarial sample idea).
  5. Increase complexity requirements: Propose higher time or space complexity requirements (but do not use this frequently).

Training Process

The training dataset is built starting from the Code Alpaca dataset. Through multiple rounds of iterative evolution with \(Code Evol-Instruct\), the data generated in each round is merged with all previous rounds of data and the original data for model fine-tuning. During training, an external development set is used to determine when to stop evolution (Evol Stop) to prevent performance degradation.

Fine-tuning prompt format: \(`\) Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

Instruction:

{instruction}

Response:

\(`\)

Multilingual performance comparison Caption: WizardCoder-34B shows a significant advantage over the then state-of-the-art open-source model (the CodeLlama-34B series) across multiple programming languages.

Experimental Conclusions

This paper conducts a comprehensive evaluation on five major code generation benchmarks: HumanEval, HumanEval+, MBPP, DS-1000, and MultiPL-E. The experimental results fully validate the outstanding performance of \(WizardCoder\).

Key Experimental Results

EvalPlus leaderboard Caption: On the EvalPlus leaderboard, WizardCoder-34B performs better than GPT-3.5 on HumanEval+, second only to GPT-4.

Model Parameters HumanEval MBPP
Closed-source models      
GPT-3.5 (ChatGPT) Unknown 48.1 52.2
GPT-4 Unknown 67.0 -
Open-source models      
StarCoder-15B 15B 33.6 43.6*
CodeLlama-Python-34B 34B 53.7 56.2
WizardCoder (this paper) 15B 57.3 51.8
WizardCoder (this paper) 34B 71.5 61.2

Table note: Comparison of pass@1 (%) results on the HumanEval and MBPP benchmarks.

In-depth Analysis and Conclusions