Skip to content

MLGym (Meta)

external · status: active · focus: end-to-end · discipline: computer-science · started: 2025

Project page: https://github.com/facebookresearch/MLGym

Source: projects/landscape/mlgym.yml

Positioning

A gym-style framework and benchmark (MLGym-Bench, arXiv:2502.14499) for advancing AI research agents on 13 diverse ML research tasks (CV, NLP, RL, game theory). Like aviary, MLGym is evaluation infrastructure for RISE-style systems rather than a RISE pipeline itself.

Distinctive contribution

First gym environment specifically for ML research tasks — not general QA or knowledge work — covering idea generation, data processing, method implementation, training, and result analysis as a unified RL training surface. Distinct from Aviary in that it targets the training of research agents via RL, not only their evaluation.

Evaluation scores

Dimension Score (0–3) Note
Lifecycle coverage 0 Cross-cutting evaluation/training infrastructure; not a pipeline producing scholarship.
Autonomy level 2 Supervised: user defines tasks; agents act within them; RL training is a separate pipeline.
Architectural transparency 3 Open source; arXiv:2502.14499; benchmark task definitions public.
Inputs supported 2 13 task definitions across CV, NLP, RL, game theory; extensible to user-defined tasks.
Outputs / reproducibility 3 Reproducible by design — gym-style task specs + trajectory artifacts.
Internal evaluation 2 Used to benchmark AI research agents in the arXiv paper; framework described as experimental.
Openness 2 CC BY-NC 4.0 license — open for research and non-commercial use; not permissive for commercial deployment.
Maturity / traction 2 599 stars; Meta institutional backing; last push 2025-08; flagged as 'experimental, under heavy development'.
Cross-family policy 0 Single-family in published baselines; framework agnostic but no policy.
Runtime assurance 1 Task-level trajectory + leaderboard scoring; runtime gating minimal.
Cross-platform portability 1 Open framework, multiple agents pluggable; not multi-IDE.

Scored on 2026-05-18. See the evaluation rubric.

Tags

Pipeline stages: hypothesis-generation research-design data-analysis code-generation

Architectural features: tool-use artifact-versioning iterative-loop

Inputs: task-specification

Outputs: agent-trajectories evaluation-metrics training-data

Data sources: benchmark-datasets

Knowledge sources: task-descriptions

Limitations

  • Non-commercial license restricts deployment options.
  • Evaluation infrastructure — value depends on downstream agent systems being benchmarked.
  • ML-research orientation; not designed for empirical-social-science or biomedical tasks.
  • Author flag: 'Please expect major changes to the design.'

Papers describing this project

  • MLGym: A New Framework and Benchmark for Advancing AI Research Agents — Nathani, D., Madaan, L., Roberts, N., Bashlykov, N., Menon, A., Moens, V., et al. (2025). arXiv. arXiv:2502.14499