The data layer¶

The data layer supplies the empirical material a RISE pipeline operates on. In the diagram it enters from the left; in practice it is often the dimension along which RISE systems differ most sharply, because access to good data is the binding constraint on what a pipeline can produce.

Source typology¶

Three orthogonal axes:

Axis	Values
Curation	raw ↔ curated (peer-reviewed, harmonized, version-tracked)
Provenance	primary (collected/measured) ↔ derived (computed from other data)
Access	public ↔ proprietary ↔ controlled (IRB, DUA-gated)

A RISE system's reach is bounded by which combinations it can access. Most catalog projects can read public + curated data (arXiv, OpenAlex); fewer can ingest raw private corpora, and almost none can negotiate controlled-access pipelines.

Catalog patterns¶

Domain-corpus consumers — paper-qa, open-scholar: the data layer is predominantly scientific text (PDFs, abstracts).
Market and macro consumers — e2er: the data layer is empirical time series (FRED, yfinance, Hyperliquid).
Replication consumers — social-science-replicability: the data layer is the target paper's data, often constrained by what the original authors released.
Domain-specific (biology) — robin: the data layer is gated behind FutureHouse's Edison platform.
No data layer — zeropaper, refine-ink: these are drafting/revision tools whose inputs are textual rather than empirical.

Access patterns¶

Direct API. Cleanest case (FRED, Semantic Scholar, OpenAlex, arXiv). RISE systems typically wrap these in standard adapters.
Sandboxed fetching. When the system must traverse the web at large, isolation (containers, SSRF protection, allowlists) is a hard requirement to avoid prompt-injection from fetched content. This is engineering territory; few catalog projects document it thoroughly.
Bulk download + index. PaperQA2's approach: pre-fetch the corpus, index it, query it offline.
Brokered access. Edison-style commercial gates (Robin) trade reproducibility for capability.

Provenance and citability¶

For a RISE artifact to be scholarly, its data must be citable. In practice this means:

Persistent identifiers (DOI, accession number) for each dataset consumed.
Versioning at the dataset level (the analysis was run against v2, not v3).
Dataset cards or equivalent documentation accompanying each source.
A manifest in the output that lists data sources with versions and access timestamps.

Most catalog projects implement some of this; few implement all of it. Field-level evaluation that compares RISE-produced manifests against journal data-availability standards is an open empirical opportunity.

Sandboxing concerns¶

Two recurring failure modes:

Prompt injection from fetched content — a third-party page instructs the agent to do something other than what its user asked. Mitigation: wrap untrusted content in boundary tags, strip control characters, run the consuming step with reduced tool permissions.
Data exfiltration — an agent with access to private data posts a quote of it into a public endpoint. Mitigation: separate the data-reading agent from the publish-capable agent; container-level egress filtering.

A RISE system that handles non-trivial data should treat these as first-order concerns, not afterthoughts.

Catalogs and discovery¶

A pipeline that can ideate but cannot find data appropriate to its idea is bottlenecked. Few catalog projects address this directly; those that do (e.g., e2er via its research-API integration) rely on pre-built catalogs of available datasets rather than open-ended discovery. Open-ended dataset discovery is one of the under-developed capabilities in current RISE systems.