Skip to content

Aviary (FutureHouse)

external · status: active · focus: end-to-end · discipline: general · started: 2024

Project page: https://github.com/Future-House/aviary

Source: projects/landscape/aviary.yml

Positioning

A gymnasium for defining custom language-agent environments (arXiv:2412.21154), with pre-built environments for math, general knowledge, biological sequences, scientific literature search, and protein stability. Aviary is evaluation infrastructure for RISE systems, not a RISE pipeline itself — but it is directly relevant because it defines tasks against which RISE pipelines can be measured.

Distinctive contribution

Provides a standardized environment-and-agent abstraction (paired with the LDP language-decision-process library) for benchmarking agentic systems on scientific tasks. Lowers the cost of producing comparable evaluations across pipelines.

Evaluation scores

Dimension Score (0–3) Note
Lifecycle coverage 0 Cross-cutting evaluation infrastructure; does not produce scholarly artifacts itself.
Autonomy level 2 Supervised: user defines environments; agents act within them.
Architectural transparency 3 Open under Apache-2.0; arXiv paper; tutorials; full documentation site.
Inputs supported 2 Multiple environment definitions ship with the library; user-defined environments supported.
Outputs / reproducibility 3 Pip-installable; deterministic given fixed model + environment; designed for benchmark reproducibility.
Internal evaluation 2 Used by FutureHouse to evaluate their own agents (Robin, PaperQA); benchmarks published in the arXiv paper.
Openness 3 Apache-2.0; PyPI as fhaviary; sister library LDP also open.
Maturity / traction 2 261 stars; active development; embedded in FutureHouse evaluation stack.
Cross-family policy 1 Environment-agnostic; cross-family possible by user setup.
Runtime assurance 1 Trajectory logging + environment-level scoring; not a runtime claim-audit harness.
Cross-platform portability 2 Pip-installable; pairs with LDP; multi-environment by design.

Scored on 2026-05-18. See the evaluation rubric.

Tags

Pipeline stages: data-analysis literature-discovery

Architectural features: tool-use artifact-versioning

Inputs: task-specification

Outputs: agent-trajectories evaluation-metrics

Data sources: benchmark-datasets scientific-literature

Knowledge sources: paper-qa

Limitations

  • Evaluation infrastructure, not a RISE pipeline — included here for completeness, scored conservatively on lifecycle coverage.
  • Benchmarks reflect the environments included; coverage of social-science tasks limited.

Papers describing this project

  • Aviary: training language agents on challenging scientific tasks — Narayanan, S., Braza, J. D., Griffiths, R., Ponnapati, M., Bou, A., Laurent, J., et al. (2024). arXiv. arXiv:2412.21154