MLGym (Meta)¶

external · status: active · focus: end-to-end · discipline: computer-science · started: 2025

Project page: https://github.com/facebookresearch/MLGym

Source: projects/landscape/mlgym.yml

Positioning¶

A gym-style framework and benchmark (MLGym-Bench, arXiv:2502.14499) for advancing AI research agents on 13 diverse ML research tasks (CV, NLP, RL, game theory). Like aviary, MLGym is evaluation infrastructure for RISE-style systems rather than a RISE pipeline itself.

Distinctive contribution¶

First gym environment specifically for ML research tasks — not general QA or knowledge work — covering idea generation, data processing, method implementation, training, and result analysis as a unified RL training surface. Distinct from Aviary in that it targets the training of research agents via RL, not only their evaluation.

Evaluation scores¶

Dimension	Score (0–3)	Note
Lifecycle coverage	0	Cross-cutting evaluation/training infrastructure; not a pipeline producing scholarship.
Autonomy level	2	Supervised: user defines tasks; agents act within them; RL training is a separate pipeline.
Architectural transparency	3	Open source; arXiv:2502.14499; benchmark task definitions public.
Inputs supported	2	13 task definitions across CV, NLP, RL, game theory; extensible to user-defined tasks.
Outputs / reproducibility	3	Reproducible by design — gym-style task specs + trajectory artifacts.
Internal evaluation	2	Used to benchmark AI research agents in the arXiv paper; framework described as experimental.
Openness	2	CC BY-NC 4.0 license — open for research and non-commercial use; not permissive for commercial deployment.
Maturity / traction	2	599 stars; Meta institutional backing; last push 2025-08; flagged as 'experimental, under heavy development'.
Cross-family policy	0	Single-family in published baselines; framework agnostic but no policy.
Runtime assurance	1	Task-level trajectory + leaderboard scoring; runtime gating minimal.
Cross-platform portability	1	Open framework, multiple agents pluggable; not multi-IDE.

Scored on 2026-05-18. See the evaluation rubric.

Tags¶

Pipeline stages: hypothesis-generation research-design data-analysis code-generation

Architectural features: tool-use artifact-versioning iterative-loop

Inputs: task-specification

Outputs: agent-trajectories evaluation-metrics training-data

Data sources: benchmark-datasets

Knowledge sources: task-descriptions

Limitations¶

Non-commercial license restricts deployment options.
Evaluation infrastructure — value depends on downstream agent systems being benchmarked.
ML-research orientation; not designed for empirical-social-science or biomedical tasks.
Author flag: 'Please expect major changes to the design.'

Papers describing this project¶

MLGym: A New Framework and Benchmark for Advancing AI Research Agents — Nathani, D., Madaan, L., Roberts, N., Bashlykov, N., Menon, A., Moens, V., et al. (2025). arXiv. arXiv:2502.14499

Wu, J. et al. (2025). Agentic Reasoning: A Streamlined Framework for Enhancing LLM Reasoning with Agentic Tools wu2025agenticreasoning