Self-calling code criteria will help you decide which LLMs to use for your programming tasks


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. learn more


As large language models (LLMs) continue to improve in coding, the criteria used to evaluate their performance are becoming increasingly useful.

That's because even though there are so many LLMs that have similar high scores on these criteria, it can be difficult to understand which ones to use on specific software development projects and initiatives.

A new paper by Yale University and Tsinghua University presents a new way for models' ability to deal with “self-invoking code generation” problems that require reasoning, generating code, and reusing existing code in problem solving.

Auto-calling code generation is much more similar to realistic programming situations and provides a better understanding of current LLMs' ability to solve real-world coding problems.

Self-invoking code generation

There are two popular criteria used to evaluate the coding abilities of LLMs Human Eval and MBPP (Mostly Python problems). These are databases of manual problems that require the model to write code for simple tasks.

However, these criteria only cover a subset of the challenges that software developers face in the real world. In practical situations, software developers don't just write new code – they also have to understand and reuse existing code and create reusable components to solve complex problems. – complex

“The ability to understand and use self-generated code, i.e. self-invoking code generation, plays an important role for LLMs to leverage their reasoning abilities to generate code that is not the current criteria capture,” the researchers write.

To test the ability of LLMs in self-invoking code generation, the researchers created two new criteria, HumanEval Pro and MBPP Prowhich expands the existing data. Each problem in HumanEval Pro and MBPP Pro builds on an existing example in the original data set and includes additional elements that the model needs to solve used the main problem and solution to solve a more complex problem.

Self-invoking code generation
Self-invoking code generation (source: arXiv)

For example, the original problem may be something simple, like writing a function that replaces each occurrence of a given character in a string with a new character.

The extended problem is to write a function that changes the occurrences of multiple characters in a string with the modifiers given to them. This required writing the new function model that includes the previous function that he created in the simple problem.

“This evaluation of self-invoking code generation provides a deeper insight into the programming capabilities of LLMs, extending beyond the realm of single-problem code generation,” the researchers write.

LLMs perform poorly at self-invoking code generation

The researchers tested HumanEval Pro and MBPP Pro on more than 20 open and private models, including GPT-4oOpen AI o1 – mini, Claude 3.5 Sonnetas well as Qwen, DeepSeek, and Codstral series.

Their findings show a significant difference between traditional coding criteria and self-proclaimed code generation tasks. “While frontier LLMs excel at generating individual code snippets, they often struggle to effectively use their own generated code to solve more complex problems complex,” wrote the researchers.

For example, with one generation (pass@1), o1-mini performs 96.2% on HumanEval but only 76.2% on HumanEval Pro.

Another interesting finding is that, while instructional analysis provides significant improvements on simple coding tasks, it shows a smaller yield on self-invoking code generation. The researchers note that “current instruction-based debugging approaches are not efficient enough for more complex self-invoking code generation tasks,” suggesting that we need to rethink how we train basic models for coding and reasoning tasks.

To help advance research on automated code generation, the researchers propose an approach to automatically reuse existing coding criteria for automated code generation . The approach uses boundary LLMs to generate self-problems based on the original problems. They then generate candidate solutions and verify their correctness by executing the code and running test cases on them. The pipeline reduces the need for manual code review to help generate more examples with less effort.

Problems of automatically generating self-invoking code (source: arXiv)

A complex landscape

This new family of criteria comes at a time when older coding criteria are rapidly being overtaken by boundary models. Current frontier models such as GPT-4o, o1, and Claude 3.5 Sonnet already have very high scores on HumanEval and MBPP as well as their more advanced versions, HumanEval+ and MBPP+.

At the same time, there are more complex criteria such as SWE-Beinnwhich assesses module capabilities in end-to-end software engineering tasks that require a wide range of skills such as using external libraries and files, and managing DevOps tools. SWE-Bench is a very difficult benchmark and even the most advanced models show poor performance. For example, OpenAI o1 is inconsistent on SWE-Bench Verified.

https://twitter.com/alex_cuadron/status/1876017241042587964?s=46

Auto-calling code generation sits somewhere between the simple benchmarks and SWE-Bench. It helps assess a specific type of reasoning ability: using existing code within a module to deal with complex problems. Auto-calling code benchmarks can be a very practical proxy for the usefulness of LLMs in real-world situations, where human programmers are in control and AI copilots help them perform specific coding tasks in the development process software.

“HumanEval Pro and MBPP Pro are positioned to serve as valuable benchmarks for code-related evaluations and to stimulate future LLM development by shedding light on current model deficiencies and encouraging new -customization in training methods,” the researchers write.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *