The FM Agent


TL;DR

This paper proposes a general-purpose multi-agent framework called FM Agent, which innovatively combines the reasoning capabilities of large language models (LLM) with large-scale evolutionary search to automatically solve complex real-world challenges across multiple domains, including operations research, machine learning, GPU kernel optimization, and mathematical problems, achieving state-of-the-art (SOTA) results.

Key Definitions

At present, autonomous AI research agent driven by large language models (LLM) are developing rapidly, and one mainstream direction is to use multiple LLM agent, together with evolutionary or reinforcement-learning-style search loops, to solve complex open-ended problems. However, in industry, high-value domains such as combinatorial optimization, machine learning, and high-performance computing kernel tuning still largely rely on experts with deep domain knowledge to perform manual, project-based iterative optimization when searching for efficient solutions. This process is not only costly, but also difficult to fully automate. Some existing automation methods, such as AI compilers, lack generalization to new tasks because they depend on predefined rules.

The core problem this paper aims to solve is: how to build a general-purpose, scalable AI system that can autonomously solve complex cross-domain problems, thereby reducing dependence on human experts and accelerating scientific discovery and engineering innovation.

Method

The FM Agent framework is designed as a two-stage autonomous discovery and optimization process, aiming to efficiently solve complex problems. It first generates a diverse pool of initial solutions through the “Cold Start Stage,” and then enters the “Evolve Stage” for large-scale iterative search and optimization. The entire framework is built on a high-performance distributed infrastructure to support large-scale parallel computation.

FM Agent framework overview

Innovations

The core innovation of FM Agent lies in its architectural design, which seamlessly integrates the reasoning capabilities of LLM, the exploratory power of evolutionary computation, and a scalable distributed system.

Cold Start Stage

The goal of this stage is to build a highly diverse, high-quality initial population of solution candidates for the subsequent evolutionary search, thereby expanding the global search space and effectively preventing premature convergence.

Evolve Stage

The evolution module is the core of FM Agent, which innovates and improves the initial solution candidates through large-scale, population-based search. At its heart is an efficient evolutionary strategy.

Figure illustration

Figure illustration

Distributed Infrastructure

The underlying layer of FM Agent is a scalable distributed infrastructure built for high-throughput evolutionary computation.

Human-in-the-loop feedback module

This is an optional module designed to flexibly incorporate domain experts’ knowledge into the autonomous evolution process. It provides a visual interface that allows experts to monitor evolutionary metrics in real time, such as fitness changes and population diversity, and to guide the evolution direction through natural-language instructions or code-level interventions. In addition, this module supports building an expert knowledge base and uses RAG technology to automatically retrieve relevant knowledge when optimization hits a bottleneck, providing information for mutation and crossover operations and enhancing the rationality of the search.

Experimental Results

This paper validates the effectiveness and generalization ability of FM Agent through experiments on authoritative benchmarks in three different domains: machine learning, combinatorial optimization, and GPU kernel generation. All experiments were completed autonomously by LLM, without human intervention.

Machine Learning (MLE-Bench)

MLE-Bench is a complex real-world machine learning task benchmark based on Kaggle competitions.

Metric InternAgent Auto-Agent ML-Master Human FM Agent (this paper)
Valid submission rate 98.67% 93.33% 85.33% - 98.67%
Above median human 48.44% 40.00% 44.90% 50.00% 65.33%
Any medal 20.31% 22.86% 23.44% 22.00% 29.33%
Gold medal 4.69% 2.86% 6.25% 4.00% 8.00%

Figure illustration

Combinatorial Optimization (ALE-Bench)

ALE-Bench is a goal-driven algorithm benchmark composed of computationally hard algorithmic competition problems.

Method Average Score ≥400 ≥1600 ≥2000 (Yellow)
Self-Refine (baseline) 1201.3 100.0% 30.0% 10.0%
ALE-Agent (SOTA) 1879.3 100.0% 70.0% 30.0%
FM Agent (this work) 1976.8 100.0% 80.0% 40.0%

Figure illustration

GPU Kernel Generation (KernelBench)

KernelBench is designed to evaluate an LLM’s ability to generate efficient GPU kernels. Experiments were conducted at the most difficult Level 3, with stricter numerical precision requirements.

Figure illustration

Compared with previous SOTA methods (such as the agent-based AI CUDA Engineer and the reinforcement-learning-based CUDA-L1), FM Agent achieved SOTA speedups ranging from 2x to 9x over the cuBLAS baseline across multiple kernels while maintaining high numerical precision ($10^{-5}$), consistently outperforming the previous best results.

Final conclusion: The experimental results strongly demonstrate that FM Agent is a robust and general-purpose problem-solving framework. It can autonomously discover state-of-the-art solutions across multiple complex domains, including machine learning, combinatorial optimization, and systems optimization, validating the superiority of its architecture that combines LLM reasoning with large-scale evolutionary search.