Logo OpenCodeInterpreter

Integrating Code Generation with Execution and Refinement

Tianyu Zheng*1, Ge Zhang*1,2, Tianhao Shen*1, Xueling Liu*1,
Bill Yuchen Lin3, Jie Fu1,4, Wenhu Chen1,2, Xiang Yue*โ€ ,1,5,

1Multimodal Art Projection Research Community, 2University of Waterloo, 3Allen Institute for Artificial Intelligence, 4HKUST, 5IN.AI Research

*Core Contributors
โ€ Corresponding to: xiangyue.work@gmail.com, zhengtianyu0428@gmail.com, ge.zhang@uwaterloo.ca,
geometric reasoning

Overview of the OpenCodeInterpreter and its pass@1 accuracy on the HumanEval. With appropriate feedback, OpenCodeInterpreter-DS-33B achieves performance comparable to that of the GPT-4 Code Interpreter.

๐Ÿ””News

๐Ÿ†[2024-03-13]: Our 33B model has claimed the top spot on the BigCode leaderboard!

๐Ÿ’ก[2024-03-01]: We have open-sourced OpenCodeInterpreter-SC2 series Model (based on StarCoder2 base)!

๐Ÿ› ๏ธ[2024-02-29]: Our official online demo is deployed on HuggingFace Spaces! Take a look at Demo Page!

๐Ÿ› ๏ธ[2024-02-28]: We have open-sourced the Demo Local Deployment Code with a Setup Guide.

โœจ[2024-02-26]: We have open-sourced the OpenCodeInterpreter-DS-1.3b Model.

๐Ÿ“˜[2024-02-26]: We have open-sourced the CodeFeedback-Filtered-Instruction Dataset.

๐Ÿš€[2024-02-23]: We have open-sourced the datasets used in our project named Code-Feedback.

๐Ÿ”ฅ[2024-02-19]: We have open-sourced all models in the OpenCodeInterpreter series ! We welcome everyone to try out our models and look forward to your participation! ๐Ÿ˜†

Introduction

The introduction of large language models has significantly advanced code generation. However, open-source models often lack the execution capabilities and iterative refinement of advanced systems like the GPT-4 Code Interpreter. To address this, we introduce OpenCodeInterpreter, a family of open-source code systems designed for generating, executing, and iteratively refining code. Supported by Code-Feedback, a dataset featuring 68K multi-turn interactions, OpenCodeInterpreter integrates execution and human feedback for dynamic code refinement. Our comprehensive evaluation of OpenCodeInterpreter across key benchmarks such as HumanEval, MBPP, and their enhanced versions from EvalPlus reveals its exceptional performance. Notably, OpenCodeInterpreter-33B achieves an accuracy of 83.2 (76.4) on the average (and plus versions) of HumanEval and MBPP, closely rivaling GPT-4's 84.2 (76.2) and further elevates to 91.6 (84.6) with synthesized human feedback from GPT-4. OpenCodeInterpreter brings the gap between open-source code generation models and proprietary systems like GPT-4 Code Interpreter.

Logo OpenCodeInterpreter

Overview

Code generation has long been a cornerstone challenge in computer science, evolving significantly over the years from its initial reliance on symbolic methods to the recent revolutionary impact of large language models (LLMs). These LLMs, pre-trained on vast code corpora, have dramatically advanced the field by generating code that closely aligns with user intents, thereby offering substantial support for software development. As these models continue to evolve, they have become integral tools in automating and enhancing the coding process, exemplified by innovations such as GitHub Copilot.

To further enhance the capabilities of pre-trained code models, instruction-tuning methods have been introduced. Among these, OpenCodeInterpreter stands out as a pioneering approach, leveraging a unique dataset named Code-Feedback. This dataset comprises 68K multi-turn interactions that include both user instructions and compiler feedback, enabling the model to not only generate but also refine code based on execution outputs and human guidance. Such advancements allow OpenCodeInterpreter to produce solutions that are not only technically sound but also closely aligned with user expectations, setting a new standard in code generation.

OpenCodeInterpreter's innovative integration of execution and human feedback marks a significant leap forward in the domain. By harnessing compiler diagnostics to correct errors and incorporating human insights for code refinement, OpenCodeInterpreter achieves an unparalleled balance of accuracy and user alignment. Its performance on widely recognized benchmarks, including HumanEval and MBPP, demonstrates its superior ability to iteratively refine code, achieving results that narrow the performance gap with proprietary systems like GPT-4's Code Interpreter. OpenCodeInterpreter thus heralds a new era in code generation, offering open-source systems that rival the sophistication and efficacy of their proprietary counterparts.

Code-Feedback

In the development of Code-Feedback, we meticulously crafted our dataset, known as Code-Feedback, to train the OpenCodeInterpreter, with a focus on specific criteria that ensure the dataset's effectiveness and relevance to real-world coding challenges. Code-Feedback is distinguished by its inclusion of diverse and challenging queries derived from actual coding tasks. This diversity is crucial, as it ensures that the dataset covers a broad spectrum of problems, providing both variety and complexity. Moreover, the dataset adopts a multi-turn dialogue structure, enhancing its utility by incorporating execution feedback, such as outputs and diagnostics from compilers, alongside human feedback, which includes additional guidance or instructions from users. This structure is pivotal in simulating real-world coding scenarios where iterative feedback and adjustments are common.

The creation of Code-Feedback involved a comprehensive approach, employing five distinct methods to gather and curate data. This multifaceted approach was designed to fulfill the dataset's three key criteria: the incorporation of diverse real-world queries, a structured multi-turn dialogue format, and the interleaving of text and code responses to offer a comprehensive solution to coding queries. The sources of our queries were twofold, comprising a variety of open-source datasets and coding challenges from platforms like LeetCode. This combination ensures a rich and varied dataset that accurately reflects the nature of coding tasks encountered by developers. In subsequent sections, we delve into the specific methods employed in constructing the dataset, illustrating our commitment to creating a robust and effective tool for coding instruction and feedback.

algebraic reasoning

Summary of our proposed dataset OpenCodeInterpreter construction and comparison with existing code instruction tuning datasets. M.T: Multi Turn, E.F: Execute Feedback, H.F: Human Feedback.

Main Results

This section outlines the experimental framework for evaluating the performance of OpenCodeInterpreter and its comparison with leading models in both single-turn and multi-turn code generation settings. The study leverages data from the EvalPlus leaderboard, examining OpenCodeInterpreter's performance against benchmarks such as GPT-3.5/4-Turbo, CodeLlama-Python, WizardCoder, Deepseek-Coder, and CodeT5+ across various scales on the HumanEval and MBPP benchmarks and their advanced versions. For multi-turn code generation, the focus shifts to assessing OpenCodeInterpreter's capability in iterative refinement through a two-round limit, considering execution feedback and human feedback scenarios. The experimental setup aims to highlight OpenCodeInterpreter's adaptability and proficiency in code generation, underscored by its achievements in setting new standards in software development tools through iterative feedback and refinement.

Pass rate of different code models on HumanEval (+), MBPP (+) and their average (+).

`CL': based on CodeLlama; `DS': based on DeepseekCoder. Baseline results are copied from the EvalPlus Leaderboard or replicated by running the official checkpoints.

Strong Baselines Our Methods
Model Size Type Open-source HumanEval (+) MBPP (+) Average (+)
Model Data
GPT-4 Turbo - - โ—‹ โ—‹ 85.4 (81.7) 83.0 (70.7) 84.2 (76.2)
      + Execution Feedback 88.0 (84.2) 92.0 (78.2) 90.0 (81.2)
GPT-3.5 Turbo - - โ—‹ โ—‹ 72.6 (65.9) 81.7 (69.4) 77.2 (67.7)
      + Execution Feedback 76.8 (70.7) 87.0 (73.9) 81.9 (72.3)
Gemini Pro - - โ—‹ โ—‹ 63.4 (55.5) 72.9 (57.9) 68.2 (56.7)
~7B Scale
StarCoder 7B Base โ— โ— 24.4 (20.7) 33.1 (28.8) 28.8 (24.8)
CodeT5+ 6B Base โ— โ— 29.3 (23.8) 51.9 (40.9) 40.6 (32.4)
CodeGen-Mono 6B Base โ— โ— 29.3 (25.6) 49.9 (42.1) 39.6 (33.9)
Mistral 7B Base โ— โ—‹ 28.7 (23.2) 50.1 (40.9) 39.4 (32.1)
OpenChat 7B Instruct โ— โ— 72.0 (67.1) 62.7 (52.9) 67.4 (60.0)
CodeLlama-Python 7B Base โ— โ—‹ 37.8 (34.1) 57.6 (45.4) 47.7 (39.8)
      WizardCoder-CL 7B Instruct โ—‹ โ—‹ 48.2 (40.9) 56.6 (47.1) 52.4 (44.0)
      Magicoder-CL 7B Instruct โ— โ— 60.4 (55.5) 64.2 (52.6) 62.3 (54.1)
      Magicoders-S-CL 7B Instruct โ— โ— 70.7 (66.5) 68.4 (56.6) 69.6 (61.6)
      OpenCodeInterpreter-CL 7B Instruct โ— โ— 72.6 (67.7) 66.4 (55.4) 69.5 (61.6)
      + Execution Feedback 75.6 (70.1) 69.9 (60.7) 72.8 (65.4)
DeepseekCoder 6.7B Base โ— โ—‹ 47.6 (39.6) 70.2 (56.6) 58.9 (48.1)
      DeepseekCoder-Instruct 6.7B Instruct โ— โ—‹ 73.8 (70.1) 73.2 (63.4) 73.5 (66.8)
      + Execution Feedback 80.5 (75.6) 79.9 (70.4) 80.2 (73.0)
      Magicoder-DS 6.7B Instruct โ— โ— 66.5 (60.4) 75.4 (61.9) 71.0 (61.2)
      Magicoder-S-DS 6.7B Instruct โ— โ— 76.8 (70.7) 75.7 (64.4) 76.3 (67.6)
      + Execution Feedback 77.4 (72.0) 73.2 (62.4) 75.3 (67.2)
      OpenCodeInterpreter-DS 6.7B Instruct โ— โ— 76.2 (72.0) 73.9 (63.7) 75.1 (67.9)
      + Execution Feedback 81.1 (78.7) 82.7 (72.4) 81.9 (75.6)
      + Synth. Human Feedback 87.2 (86.6) 86.2 (74.2) 86.7 (80.4)
      + Synth. Human Feedback (Oracle) 89.7 (86.6) 87.2 (75.2) 88.5 (80.9)
๏ฝž13B Scale
CodeGen-Mono 16B Base โ— โ— 32.9 (27.4) 52.6 (43.6) 42.8 (35.5)
StarCoder 15B Base โ— โ—‹ 34.1 (29.3) 55.1 (46.1) 44.6 (37.7)
CodeT5+ 16B Base โ— โ—‹ 31.7 (26.2) 54.6 (44.4) 43.2 (35.3)
CodeLlama-Python 13B Base โ— โ—‹ 42.7 (36.6) 61.2 (50.9) 52.0 (43.8)
      OpenCodeInterpreter-CL 13B Instruct โ— โ— 77.4 (73.8) 70.7 (59.2) 74.1 (66.5)
      + Execution Feedback 81.1 (76.8) 78.2 (67.2) 79.7 (72.0)
๏ฝž34B Scale
CodeLlama-Python 34B Base โ— โ—‹ 51.8 (43.9) 67.2 (52.9) 59.5 (48.4)
      Speechless-CL-v2.0 34B Instruct โ— โ— 77.4 (71.3) 72.4 (59.1) 74.9 (65.2)
      XwinCoder-CL 34B Instruct โ— โ— 75.6 (67.7) 76.2 (62.4) 75.9 (65.1)
      Phind-CL-v2 34B Instruct โ— โ—‹ 71.3 (67.1) - -
      WizardCoder-CL 34B Instruct โ— โ—‹ 73.2 (64.6) 73.2 (59.9) 73.2 (62.3)
      OpenCodeInterpreter-CL 34B Instruct โ— โ— 78.0 (72.6) 73.4 (61.4) 75.7 (67.0)
      + Execution Feedback 81.7 (78.7) 80.2 (67.9) 81.0 (73.3)
DeepSeekCoder 33B Base โ— โ—‹ 51.2 (44.5)
      DeepSeekCoder-Instruct 33B Instruct โ— โ—‹ 81.1 (75.0) 78.7 (66.7) 79.9 (70.9)
      + Execution Feedback 81.1 (76.2) 82.7 (73.4) 81.9 (74.8)
      WizardCoder-V1.1 33B Instruct โ— โ—‹ 79.9 (73.2) 78.9 (66.9) 79.4 (70.1)
      + Execution Feedback 74.4 (69.5) 79.9 (68.2) 77.2 (68.9)
      OpenCodeInterpreter-DS 33B Instruct โ— โ— 79.3 (74.3) 78.7 (66.4) 79.0 (70.4)
      + Execution Feedback 82.9 (80.5) 83.5 (72.2) 83.2 (76.4)
      + Synth. Human Feedback 88.4 (86.0) 87.5 (75.9) 88.0 (81.0)
      + Synth. Human Feedback (Oracle) 92.7 (89.7) 90.5 (79.5) 91.6 (84.6)
๏ฝž70B Scale
CodeLlama-Python 70B Base โ— โ—‹ 55.5 (50.0) 65.4 (53.4) 60.5 (51.7)
      CodeLlama-Instruct 70B Instruct โ— โ—‹ 72.0 (65.2) 75.4 (61.7) 73.7 (63.5)
      OpenCodeInterpreter-CL 70B Instruct โ— โ— 76.2 (70.7) 73.0 (61.9) 74.6 (66.3)
      + Execution Feedback 79.9 (77.4) 81.5 (69.9) 80.7 (73.7)

Example

BibTeX


      @article{opencodeinterpreter,
        title={OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement},
        author={Tianyu Zheng and Ge Zhang and Tianhao Shen and Xueling Liu and Bill Yuchen Lin and Jie Fu and Wenhu Chen and Xiang Yue},
        journal={https://arxiv.org/abs/2402.14658},
        year={2024},
      }