May 6, 2026
A Case for Resilient Research Data Infrastructure
A blog summary of the SciOS Resilient Data Futures whitepaper. The full paper is at rdf.scios.tech/narratives, and the underlying discourse graph and contribution model are at github.com/jring-o/rdf.
The Architecture Problem in Research Data
73 to 93 percent of papers fail to deliver their underlying data on request. This lost data, endemic across all research institutions as a byproduct of architectural decisions baked into daily operations, represents a $1.1 billion liability for an average R-1 institution. The architectural solution that hedges this liability is simple to implement, has been proven over four decades, and provides the exact properties AI-ready data policies are asking for.
And as for the 20% of data that is retrievable? Don't worry. It's often one organizational decision — an acquisition, a defunding, a jurisdictional change — away from oblivion.
Training, data management plans, researcher discipline, better staffed libraries — these address real problems and produce real, but bounded improvements. None of them addresses the property that makes such volumes of loss possible in the first place: research data typically exists in a single copy, held by a single organization, funded by a single grant, maintained by a single person who is expected to leave in several years.
This flawed architecture operationalizes loss in ways familiar to any institution. A graduate student leaves and the operational knowledge of where the data lives leaves with them. A laptop is stolen. A grant ends and the server it funded ends with it. A repository shuts down (191 of them have since 2012, at a median operating age of twelve years). A platform changes its access terms (the Twitter API in 2023, GitHub blocking developers in Iran in 2019, GISAID's account suspensions, CNKI's foreign-access cutoff). A backup script runs sideways and erases 77 terabytes of supercomputer data in an afternoon.
These look like different problems, but they're not. They are manifestations of the same core problem eating away at our ability to maintain scientific and technological advancement: research data infrastructure emerged to serve a reality that no longer exists.
Four kinds of preservation, only one of which scales
There's a four-tier framework to research data storage.
Tier 0 is local storage — one copy on one system. Most research data lives here, and the default outcome is the 17-percent annual availability decay the literature has been documenting for over a decade.
Tier 1 is hosted storage with one provider — an institutional repository, a domain repository, a cloud bucket. This is the architecture most data management plans describe and most mandates produce. It absorbs local failure. It does not absorb provider failure, which arrives through hardware failure, bankruptcy, acquisition, defunding, jurisdictional change, or terms-of-service shift on a timeline the institution does not control.
Tier 2 is coordinated preservation across institutional agreements. INSDC mirrors GenBank across three continents; the Worldwide Protein Data Bank synchronizes four sites weekly; CLOCKSS preserves 63 million articles across 12 nodes. These are the most sophisticated preservation systems ever built, and they work until the coordination doesn't.
Tier 3 is protocol-level distribution. Redundancy is a byproduct of use rather than the output of an organization's continued commitment. The architecture is visible in the longest-running information systems on the internet — DNS for 43 years, email for 44, BitTorrent for 25, Git for 21. These systems have outlived the companies, the operating systems, and in some cases the institutional structures that designed them. None of them lives or dies with any single organization.
Most research data sits at Tier 0 or Tier 1. Tier 2 is reserved for data whose value justifies the financial and operational costs. Tier 3 is the only architecture in the field that produces scalable, accessible preservation, verification, and audit evidence as structural byproducts of operation.
Verification falls out of the architecture
The funder regime is shifting from "did you write a plan?" to "did you do it, and can you prove it?" The NIH's May 2026 standardized DMSP format, the Gates Foundation's transition to programmatic compliance checking, and the False Claims Act's implied-certification doctrine are converging on a single question: can the institution surface, on inspection, evidence that the data exists where the plan said it would, intact, with the access controls it claimed?
At Tiers 0 and 1, the institution cannot answer the question by inspection. Every property is an assertion the provider made and the institution cannot independently verify. At Tier 2, the consortium often runs verification on members' behalf, but it remains an assertion the consortium makes about its own protocols rather than one the institution can independently re-run — and MetaArchive's sunset audit showed those internal protocols can silently fail with no external party positioned to catch it. At Tier 3, all of those properties fall out of a single cryptographic query against the distributed network. Audit becomes inspection rather than forensic reconstruction.
The $1.1 billion liability is a carrying cost, not a realized loss — a four-term formula (sunk grant value, irreplaceable-dataset replacement, foregone reuse, and False Claims Act exposure on data the institution can't independently verify) applied to a representative R-1. The full math is in the paper. The probability of latent exposure converting to realized cost is rising while the architecture that produces verification by inspection is the architecture that hedges the liability, on the same deployment.
AI runs on the same architecture
Provenance, reproducibility, federation, and verification — the data properties any defensible AI program needs in order to train, document, and deploy a model — are the same architectural properties Tier 3 produces by default. Content addressing produces provenance. Persistence across the lifetime of any model trained on the corpus is exactly what Tier 3 delivers. Permissioned distribution networks produce federation without consolidating sensitive data into a single trust domain. A single cryptographic query produces verification any third party can independently re-run.
The infrastructure that hedges an institution's data-loss exposure is the infrastructure that produces its AI-ready substrate. The same investment holds both positions on the same deployment.
The infrastructure already exists
A standalone Tier 3 node — a BitTorrent seeder, a Forgejo instance, an IPFS pinning node, a Matrix homeserver, an AT Protocol PDS, a Tor relay, an Academic Torrents seeder — runs $42 to $360 a year on commodity hosting. Meanwhile, universities run servers at 37 to 41 percent utilization, networks at 26 percent, on bandwidth contracted at flat rates regardless of traffic. The marginal cost of adding a protocol node to existing institutional infrastructure approaches zero. TU Dortmund, TU Dresden, MIT, and the forty-five-plus universities running Tor relays already do it.
What you need to do
- An architectural audit of every research dataset against the four tiers. Record the number of independent copies, the failure domains they occupy, and the verification capability available.
- At least one Tier 3 node on existing institutional infrastructure within twelve months. The reference deployments exist. The operational overhead fits inside a student-volunteer team.
- Compliance evidence generated at the point of deposit. Produce content addresses, hashes, and signed attestations as the data lands, instead of reconstructing them under audit.
- Verifiable evidence of preservation required of grantees, not self-reported plans. Replace the human-readable plan with the inspectable artifact.
- F&A funding for preservation, not project funding. Three-year grants cannot underwrite multi-decade obligations.
- Local clones of everything that matters, as standard practice at the lab level. A single local clone is the difference between a Tier 1 access restriction producing permanent loss and producing temporary inconvenience.
How
The infrastructure exists, and it's fairly straight-forward to deploy. The economics favor deployment by more than an order of magnitude while the compliance regime appears to be closing the window in which the decision is voluntary. And that same deployment positions the institution for the emerging AI-data-readiness funding cycle.
For institutions: SciOS's Resilient Data Futures Lab has already worked with research labs to implement this infrastructure. We are further coordinating reference deployments, audit templates, and cost models across institutions via the RDF working group out of the lab. If you'd like our help directly implementing a Tier 3 solution for your data, we're ready to embed with your team, identify the solutions that best fit your processes, and to get it running. Reach out at contact@scios.tech to scope the engagement. If you'd like to move forward on your own, ask your faculty and students. Someone at your institution almost certainly wants to implement Tier 3 infrastructure (or already has).
For faculty/students/contributors: If research data, distributed systems, and AI-ready data, are in your native vocabulary, the next decade of scientific infrastructure is yours to architect. The RDF working group out of the Resilient Data Futures Lab meets monthly (next call on May 7th). Join the Resilient Data Futures Lab for an invitation to the calls where we discuss various solutions and our progress implementing them.
For builders: If the need to implement solutions is in your blood, whether at your own institution or on a laptop in your basement, SciOS builds daily. Reach out to join us: contact@scios.tech.
"Publishing" Method — 43 questions, 53 claims, 122 pieces of evidence, 6 methods, and 135 sources
The full paper is at rdf.scios.tech/narratives.
Beneath the paper, we structured our work as a discourse graph, a communication method designed to make every claim, evidence item, question, source, and method individually addressable, individually contributable, and stored on distributed infrastructure rather than a single hosted server. A separate post on what this publishing form changes about scientific communication is on the way, or you can read more about Discourse Graphs now.
To engage with this paper — provide evidence that counters a claim, pose a new question, discuss the details of a claim or node, and so on — the discourse graph and contribution model live at github.com/jring-o/rdf, and the rendered narrative, narrative generator, node browser, and node creator are at rdf.scios.tech.
As always, feel free to reach out for questions, comments, engagements, or just to say hello.