Aviary (FutureHouse)¶
external · status: active · focus: end-to-end · discipline: general · started: 2024
Project page: https://github.com/Future-House/aviary
Source: projects/landscape/aviary.yml
Positioning¶
A gymnasium for defining custom language-agent environments (arXiv:2412.21154), with pre-built environments for math, general knowledge, biological sequences, scientific literature search, and protein stability. Aviary is evaluation infrastructure for RISE systems, not a RISE pipeline itself — but it is directly relevant because it defines tasks against which RISE pipelines can be measured.
Distinctive contribution¶
Provides a standardized environment-and-agent abstraction (paired with the LDP language-decision-process library) for benchmarking agentic systems on scientific tasks. Lowers the cost of producing comparable evaluations across pipelines.
Evaluation scores¶
| Dimension | Score (0–3) | Note |
|---|---|---|
| Lifecycle coverage | 0 | Cross-cutting evaluation infrastructure; does not produce scholarly artifacts itself. |
| Autonomy level | 2 | Supervised: user defines environments; agents act within them. |
| Architectural transparency | 3 | Open under Apache-2.0; arXiv paper; tutorials; full documentation site. |
| Inputs supported | 2 | Multiple environment definitions ship with the library; user-defined environments supported. |
| Outputs / reproducibility | 3 | Pip-installable; deterministic given fixed model + environment; designed for benchmark reproducibility. |
| Internal evaluation | 2 | Used by FutureHouse to evaluate their own agents (Robin, PaperQA); benchmarks published in the arXiv paper. |
| Openness | 3 | Apache-2.0; PyPI as fhaviary; sister library LDP also open. |
| Maturity / traction | 2 | 261 stars; active development; embedded in FutureHouse evaluation stack. |
| Cross-family policy | 1 | Environment-agnostic; cross-family possible by user setup. |
| Runtime assurance | 1 | Trajectory logging + environment-level scoring; not a runtime claim-audit harness. |
| Cross-platform portability | 2 | Pip-installable; pairs with LDP; multi-environment by design. |
Scored on 2026-05-18. See the evaluation rubric.
Tags¶
Pipeline stages: data-analysis literature-discovery
Architectural features: tool-use artifact-versioning
Inputs: task-specification
Outputs: agent-trajectories evaluation-metrics
Data sources: benchmark-datasets scientific-literature
Knowledge sources: paper-qa
Limitations¶
- Evaluation infrastructure, not a RISE pipeline — included here for completeness, scored conservatively on lifecycle coverage.
- Benchmarks reflect the environments included; coverage of social-science tasks limited.
Related projects in this catalog¶
Papers describing this project¶
- Aviary: training language agents on challenging scientific tasks — Narayanan, S., Braza, J. D., Griffiths, R., Ponnapati, M., Bou, A., Laurent, J., et al. (2024). arXiv. arXiv:2412.21154
Related references (literature catalog)¶
- Wu, J. et al. (2025). Agentic Reasoning: A Streamlined Framework for Enhancing LLM Reasoning with Agentic Tools
wu2025agenticreasoning