可執行程式碼行動與LLM代理的研究與實作

了解，我會尋找 2024 年及 2025 年的相關研究，專注於 Executable Code Actions Elicit Better LLM Agents 的主題，並且確認是否有開源 Python 實作，特別是有 GitHub repository 的項目。搜尋結果出來後，我會提供相關研究的概述、開源程式碼連結，以及它們的應用與技術細節。請稍等，我會盡快回覆！

Below we summarize several recent works (from 2024 and 2025) that enable large language model (LLM) agents to generate and execute Python code as part of their reasoning. For each, we list the title, authors, a brief overview, the open-source implementation (GitHub) if available, technical details, and main application scenarios.

Executable Code Actions Elicit Better LLM Agents (Wang et al., 2024)

Authors: Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, Heng Ji.
Summary: This work proposes CodeAct, a paradigm where an LLM agent outputs actions in the form of executable Python code instead of text or structured schemas. The code is run in a Python interpreter, and the agent can use the execution results to inform subsequent stepsgithub.com. The authors show that using code as the “unified action space” makes agent reasoning more concise and effective than text or JSON-based action formatshuggingface.co github.com. In evaluations on benchmarks like API-Bank and a new M³ToolEval suite, CodeAct achieved up to 20% higher success rates compared to traditional output formatsgithub.com. They also created an instruction-tuning dataset (CodeActInstruct, ~7k multi-turn interactions) to fine-tune an agent (based on Mistral-7B) for code-based actionsgithub.com.
GitHub: The official implementation is open-sourced: xingyaoww/code-actgithub.com. This includes the CodeAct agent, data, and a chat UI.
Tech Details: The CodeAct agent is built using Python and integrates with a real Python interpreter to execute generated code step-by-stepgithub.com. The framework supports multi-turn interactions: the LLM writes code (calls to tools or APIs) which is executed, and the resulting output or errors are fed back to the LLM for iterative refinement. The authors fine-tuned a Llama/Mistral-based model on CodeActInstruct to better follow the code-generation style. They also provide deployment tools (e.g. Kubernetes, llama.cpp support) for running the agentgithub.com. The agent leverages standard Python libraries and the model’s code-writing ability; no specialized model architecture is needed beyond the integration of an execution environment.
Application: CodeAct is intended for LLM agents solving complex, open-ended tasks by calling tools or performing computations via code. Example use cases include question answering with tool use (web search, math, API calls), data analysis, or interacting with external APIs in a loop. By using code, the agent can perform sequences of actions (e.g. data fetching, calculation, formatting) in one go, making it efficient for scenarios that require tool use and reasoning. The approach is general and can be applied to any situation where an LLM needs to plan and execute actions in an environment. (Notably, the authors reported that code-based actions required ~30% fewer steps than a JSON-based approach for the same taskshuggingface.co, suggesting faster and cheaper agent executions in practice.)

Hugging Face SmolAgents (Roucher et al., 2024)

Authors: Aymeric Roucher, Merve Noyan, Thomas Wolf (Hugging Face team).
Summary: SmolAgents is an open-source library (released by Hugging Face in late 2024) for building LLM-powered agents with a focus on simplicity and code-based actionshuggingface.co huggingface.co. It provides a lightweight framework where an agent (specifically a CodeAgent class) writes its actions as Python code to use various tools. Instead of complex JSON schemas or proprietary formats, SmolAgents treats tool invocations as Python function calls generated by the LLMhuggingface.co. The library was designed to be “barebones” yet powerful: it allows an LLM to orchestrate tools and even other agents through code. Early benchmarks showed that this “code-as-actions” approach improves agent performance, aligning with findings by Wang et al. (2024) that code-based strategies outperform JSON-based oneshuggingface.co.
GitHub: Repository is available at huggingface/smolagents (Apache-2.0 license, ~8k stars)github.com. The repo includes documentation and examples for building custom agents.
Tech Details: SmolAgents is implemented in Python. It provides an easy API to define tools (each tool is basically a Python function/class that the agent can call) and to initialize agents with a chosen LLM backend. For example, one can use Hugging Face’s HfApiModel (to call an open LLM via API or local model) and tools like DuckDuckGoSearchTool to give the agent web search capabilityhuggingface.co. The core agent (CodeAgent) works by prompting the LLM to output a snippet of Python code that uses the available tools; SmolAgents then executes this code in a sandboxed environment (they support sandboxing via E2B to ensure safety)huggingface.co. After execution, the result is fed back to the LLM in the next prompt, enabling iterative reasoning. The design is inspired by frameworks like ReAct, but with first-class support for code as the medium of action. This means developers can leverage the vast Python ecosystem (e.g. call APIs, run computations, manipulate data) directly through the agent’s code, rather than implementing custom parser logic.
Application: SmolAgents is intended for developers to easily create custom AI agents that can perform multi-step tool-using tasks. Scenarios include: research assistants that search the web and summarize results, data analysis bots that query data sources and compute answers, or assistants that control a browser or file system. Essentially, any workflow where an LLM needs to call external tools/APIs or handle data can be implemented. Hugging Face demonstrated the library by evaluating open-source models on complex agent tasks (e.g., the GAIA challenge for multi-step reasoning) and even integrating vision-language modelshuggingface.co. In practice, SmolAgents lowers the barrier to experiment with agentic behavior: you can plug in an open LLM (like Code Llama or GPT-4 via API) and get an agent that writes and runs code to solve problems, useful for both research and building practical assistants.

CodeAgent: Tool-Integrated Agents for Repo-Level Code Generation (Zhang et al., 2024)

Authors: Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, Zhi Jin (Peking University).
Summary: CodeAgent (ACL 2024) is a framework that enhances LLM-based code generation by equipping an agent with external programming tools to handle tasks at the scale of an entire code repositoryarxiv.org. Traditional code generation with LLMs usually works at the function or file level, but real software projects involve navigating multiple files, understanding dependencies, and testing. CodeAgent introduces five tools to assist the LLM: for example, information retrieval from documentation or repo files, code symbol navigation (finding where functions or classes are defined), and code testing/execution to verify outputsarxiv.org. The agent uses these tools in a loop, guided by one of four possible agent strategies (including ReAct prompting, a tool-planning approach, OpenAI function-calling format, and a rule-based strategy)arxiv.org. The authors curated a new benchmark called CodeAgentBench with 101 tasks across five popular Python projects to evaluate repo-level coding. Results show that CodeAgent significantly boosts performance – improving pass rates by 18% to 250% over baseline LLMs that don’t use toolsarxiv.org. It also outperformed GitHub Copilot in these scenarios, demonstrating more accurate and efficient code solutions for complex, multi-file tasksarxiv.org.
GitHub: The project’s code and data are open-sourced at zkcpku/CodeAgentarxiv.org, which includes the CodeAgentBench benchmark and implementation of the agent and tools.
Tech Details: CodeAgent is implemented as an agent wrapper around various LLMs – the paper reports experiments with 9 different models (13B to 175B, both open-source and proprietary)arxiv.org– showing the framework is model-agnostic. The key is the set of integrated tools: they include a code parser/indexer to navigate the repository (e.g. find where a function is defined), a documentation/Q&A tool to answer questions about usage, a file editor for writing or modifying code, and a unit test executor to run tests or sample inputs on the generated code. The agent employs a strategy (like ReAct) to decide when to call a tool (the LLM is prompted in a way that it can output a tool action, e.g., “SearchDoc(‘…’)”). A controller executes that action and returns the result to the LLM, which continues the generation. They explored different prompting schemes; for instance, an OpenAI function-calling approach where the tools are defined as functions in the prompt, or a rule-based policy that forces certain tool uses. By combining tool feedback with the LLM’s code generation, CodeAgent can iteratively refine code: e.g., if a test fails, the agent can read the error and fix the code. The implementation likely uses Python libraries for searching code (perhaps grep or language server protocols) and running tests (e.g. pytest or a sandbox execution).
Application: This work targets AI-assisted software development, especially for tasks that require understanding a codebase context (like adding a feature or fixing a bug across multiple files). CodeAgent could be used in IDEs or coding assistants to automate writing code that spans modules – for example, updating API implementations and all references, guided by documentation. It’s essentially an automated programmer that not only writes code but also reads the existing project and tests its output. Outside of coding, the idea of an LLM with specialized tools could extend to any complex workflow with structured data. But CodeAgent’s main scenario is repo-level code generation and code review. In fact, a related concept by others uses “Code Agents” for code review automation, where multiple agents discuss code changes. Overall, CodeAgent shows how giving an LLM tool-assisted agency (searching code, running it) can tackle significantly more complex programming tasks than standalone code completion.

Authors: Tianyu Zheng, Ge Zhang, Tianhao Shen, Xueling Liu, Bill Yuchen Lin, Jie Fu, Wenhu Chen, Xiang Yue.
Summary: OpenCodeInterpreter is a project that delivers an open-source alternative to ChatGPT’s famed “Code Interpreter” (which is a closed AI system that can write and execute code). The authors introduce a family of models trained to not only generate code, but also to execute it and iteratively refine their solutions based on execution feedbackwww.reddit.com. To achieve this, they created a dataset called Code-Feedback with 68k multi-turn interactions, where an LLM generates code, sees the result or error, and then receives feedback/hints to improve the codewww.reddit.com. By training on this, their models learn to handle the full loop of coding: propose a solution, run it, and fix it if needed (possibly incorporating user or simulated feedback). The results are impressive – their 33B parameter model (OpenCodeInterpreter-33B) achieves 83.2% accuracy on the HumanEval coding benchmark (and 76.4% on HumanEval+ from EvalPlus), which is nearly on par with GPT-4’s code interpreter (84.2% on HumanEval)ar5iv.org. With additional synthetic feedback (using GPT-4 to generate higher-quality hints during inference), the accuracy jumps to 91.6%, actually surpassing GPT-4 on those benchmarksar5iv.org. In essence, this research closes the gap between open models and the proprietary GPT-4 Code Interpreter, demonstrating that iterative execution-enhanced code generation is feasible with open-source LLMs.
GitHub: The team has released their code and models. The main code repository is OpenCodeInterpreter/OpenCodeInterpreter, and they have provided model checkpoints and the dataset on Hugging Face (links on their project page)opencodeinterpreter.github.io opencodeinterpreter.github.io. A live demo is also hosted on Hugging Face Spaces.
Tech Details: OpenCodeInterpreter involves training large language models with a new methodology. They likely started with a strong code-centric base model (for example, a 33B model derived from StarCoder or Llama 2 code variant; indeed news mentions a StarCoder2-based series)opencodeinterpreter.github.io. The training data (Code-Feedback) contains multi-turn QA where the model’s previous code output and an execution result (or an error message) are part of the input for the next turn, and the model must produce a refined code solution. This teaches the model to handle the dialogue of coding. At runtime, the system works similar to ChatGPT’s Code Interpreter: the model generates some code, that code is executed in a sandbox (e.g., a Python environment managed by the tool), and the output (or traceback) is fed back into the model’s context to prompt the next step. The process continues until the task is solved. The implementation uses Python for execution and likely automates the loop of run-check-refine. They also incorporate human feedback signals: their best results involve “synthesized human feedback from GPT-4”www.reddit.com– essentially using GPT-4 to provide hints or verification, which the OpenCodeInterpreter model can use to improve itself further (a form of knowledge distillation or self-refinement). All models and data are released, including smaller variants (e.g., a 1.3B model was open-sourced as a demo)opencodeinterpreter.github.io.
Application: The primary application is automated code generation and problem solving. OpenCodeInterpreter can be used to solve programming challenges (like those in HumanEval or competitive programming) by having the model figure out the solution through trial and error. More broadly, it serves as an AI coding assistant that can autonomously debug its code. For instance, given a data analysis task in natural language, the model can write a Python script, run it, see if it produced the desired output (or any errors), and adjust accordingly – all without human intervention. This could be applied in data science (ask the model to generate and run code to answer a question using a dataset), in educational tools (where a student gets step-by-step code help), or as a component in agent systems that require code execution. Essentially, OpenCodeInterpreter provides a blueprint for making LLMs active problem-solvers via code, not just static code generators, which is valuable for any scenario where executing code is necessary to verify or obtain an answer.

Open Interpreter (OpenInterpreter Project, 2023–2024)

Authors: Developed by the OpenInterpreter community (initially by Killian Lucas and contributors).
Summary: Open Interpreter is an open-source tool that lets you chat with an LLM and have it execute code on your local machine to perform tasks. It was inspired by ChatGPT’s Code Interpreter plugin, aiming to give similar capabilities using open models or API models. With Open Interpreter, a user can ask in natural language, and the system will generate and run code (Python, JavaScript, shell commands, etc.) to carry out the requestgithub.com. For example, if you ask to analyze a CSV file, the agent will write a Python script to load the file and produce the analysis, execute it, and return the results. It essentially turns your computer into an “AI agent” that can use any installed software or libraries via code. This approach provides a natural language interface to the computer’s capabilitiesgithub.com. Open Interpreter can handle a wide range of actions: creating or editing images and PDFs, controlling a web browser for research, plotting graphs and analyzing datasets, managing files, and moregithub.com. It gained popularity for bringing code-execution powers to local LLM setups, all while being open source and extensible.
GitHub: The project is available at OpenInterpreter/open-interpreter (AGPL-3.0). It has a command-line tool (interpreter) for easy use after installationgithub.com. Development is active, with version 1.0 in progress and a community on Discordgithub.com.
Tech Details: Open Interpreter is essentially a runtime wrapper around LLMs. It can work with various models – by default it may use OpenAI’s GPT-4 or GPT-3.5 via API, but it also supports local models (through libraries like LlamaCpp or Hugging Face Transformers for models such as Llama-2, Code Llama, etc.)github.com. The system continuously listens for the LLM to output code in its responses. The user’s query is given to the model with instructions that it can respond with code blocks when needed. When the model returns code (delimited in Markdown syntax), Open Interpreter executes that code on the machine (in a secure manner). To mitigate risks, it asks the user for approval before running potentially risky codegithub.com, and recent updates run code in an isolated Docker container by default for safetymicrosoft.github.io. The architecture supports multi-language execution: primarily Python for most tasks, but also shell commands or JavaScript depending on the prompt and needsgithub.com. The “tools” available to the agent are essentially anything the machine can do via code – it’s not restricted to a fixed set of APIs. For example, if asked to open a webpage, the model might generate Python code using webbrowser or selenium; to edit an image it might use Pillow; to search the web it could call an API or control a browser. Open Interpreter manages the session state, so the model can remember previous results (you can even save and resume sessions) and build on them. It’s built in Python and uses asyncio to handle the interactive loop of model I/O and code execution.
Application: Open Interpreter provides a general-purpose AI assistant for PCs. Its applications are very broad, essentially covering any task you could solve with programming. Common use cases include: data analysis and visualization (ask a question about your dataset, it will write & run code to answer), automating web interactions (fill forms, scrape information), file operations (organize files, parse documents), and even creative tasks like generating an image via an API then editing it. Because it can install or use any library, it’s like having a junior developer at your command line. This is especially useful for non-programmer users – they can just describe what they want in plain language, and the system does it via code. It’s also a great experimentation platform for researchers to test LLMs’ capabilities in performing multi-step tool use in a real environment. Overall, Open Interpreter turns natural language instructions into live actions on a computergithub.com, demonstrating the power and risk of LLMs with direct execution. (Users should supervise and confirm each action, as the system can execute arbitrary code.) It’s an important step towards more interactive and agentive AI that can not only converse but also act by running code in the real world. Sources: The information above is drawn from the cited references, including academic papers and official documentation/blogs for each project. Key sources include the arXiv paper for CodeActgithub.com github.com, the Hugging Face blog for SmolAgentshuggingface.co, the ACL paper for CodeAgentarxiv.org arxiv.org, the OpenCodeInterpreter arXiv preprintar5iv.org ar5iv.org, and the Open Interpreter READMEgithub.com github.com, among others. Each project’s GitHub repository (linked above) provides further implementation details and open-source code.