open sciencedata managementcommunity resources

Open Data for Storage Research: Building a Shared Benchmark Repository for PLC and TLC SSDs

UUnknown

2026-01-23

9 min read

Design a FAIR community repository and metadata schema to compare PLC and TLC SSDs with reproducible benchmarks and cost/endurance data.

Hook: Why the storage research community must stop reinventing SSD benchmarks

For students, instructors and researchers trying to compare PLC and TLC SSDs, the problem isn’t a lack of data — it’s fragmentation. Benchmarks are scattered across vendor white papers, lab notebooks and paywalled articles; test conditions and metadata are inconsistent or missing; cost and endurance trade-offs are recorded in incomparable units. The result: wasted time, ambiguous conclusions and difficulty reproducing results. In 2026, as PLC (5-bit-per-cell) prototypes begin appearing more often and AI workloads continue driving demand for high-capacity, low-cost flash, a community-facing, open dataset and metadata schema for SSD benchmarks is no longer a nice-to-have — it’s essential.

Executive summary: What this article delivers

This article presents a practical blueprint for building a shared benchmark repository that hosts performance, endurance and cost data for NAND technologies (PLC, TLC and others). You’ll get:

A minimal, interoperable metadata schema tailored for SSD benchmarks
Concrete test definitions and reproducible benchmarking practices
Data hosting, licensing and provenance recommendations aligned with FAIR principles
Governance, quality control and incentives to grow a community repository
Examples for meta-analysis and reproducible comparisons across devices

Context and 2026 trends

By early 2026, several trends shape the urgency and design of a shared SSD dataset:

Vendors such as SK Hynix have publicized novel manufacturing approaches that make PLC plausible at scale, intensifying interest in cost-per-bit vs. endurance trade-offs (late 2025 reports).
AI workloads and generative workloads have increased demand for high-capacity SSDs, placing endurance and power consumption at the center of evaluation criteria.
Open science momentum and funder expectations for shareable, reproducible data have grown; researchers are increasingly required to publish datasets alongside papers.
Community tooling (containerized benchmarks, CI pipelines and data registries) matured enough to support automated ingestion and validation workflows.

Design goals: What a community repository must achieve

To be useful and durable, a community SSD benchmark repository must meet these goals:

Interoperability — machine-readable metadata (JSON-LD/JSON Schema) and standard units so datasets combine cleanly.
Reproducibility — full procedure records: firmware, board, power state, ambient conditions and exact benchmark commands.
FAIRness — Findable, Accessible, Interoperable and Reusable: DOIs, persistent identifiers and standard licenses.
Extensibility — schema must accommodate new tests (e.g., PLC-specific stress modes) and new metrics (energy per I/O).
Governance — community curation, review and versioning to maintain quality.

Minimal dataset and metadata schema (practical template)

Below is a practical, minimal schema you can adopt immediately. It captures device descriptors, test setup, raw results and provenance. The schema is presented as human-readable fields and a short JSON snippet to jump-start implementation.

Core entity groups

Device: manufacturer, model, NAND type (PLC/TLC/QLC), die process, capacity, over-provisioning percent, controller, firmware version, vendor part number.
Test environment: test rig (host CPU, kernel, SATA/NVMe controller), power supply, ambient temperature, humidity, power-loss injection method, enclosure.
Test protocol: benchmark tool and exact command line, IO pattern (sequential/random), block size, concurrency, queue depth, duration, preconditioning routine.
Metrics: throughput (MB/s), IOPS, latency percentiles (P50/P95/P99), endurance cycles to failure (program/erase cycles), data retention measurements, energy per I/O (Joules), cost-per-GB at measurement date.
Provenance: author, institution, timestamp, dataset DOI, raw logs location, hash of raw files (SHA256), license.

Example JSON Schema fragment (start here)

{
  "device": {
    "manufacturer": "string",
    "model": "string",
    "nand_type": "PLC|TLC|QLC",
    "capacity_bytes": "integer",
    "over_provisioning_pct": "number",
    "controller": "string",
    "firmware_version": "string"
  },
  "environment": {
    "host_os": "string",
    "kernel_version": "string",
    "nvme_controller": "string",
    "ambient_celsius": "number",
    "power_supply_spec": "string"
  },
  "test_protocol": {
    "tool": "fio|vdbench|custom",
    "command": "string",
    "pattern": "seq|rand",
    "block_size_bytes": "integer",
    "queue_depth": "integer",
    "duration_seconds": "integer",
    "preconditioning": "string"
  },
  "metrics": {
    "throughput_mbps": "number",
    "iops": "number",
    "latency_ms": {"p50": "number", "p95": "number", "p99": "number"},
    "endurance_cycles": "integer",
    "energy_joules": "number",
    "cost_usd_per_gb": "number"
  },
  "provenance": {
    "author": "string",
    "institution": "string",
    "timestamp": "ISO8601 string",
    "doi": "string",
    "raw_data_hash": "sha256"
  }
}

Note: This fragment is intentionally minimal. Use JSON Schema or schema.org extensions and add controlled vocabularies for fields like nand_type, benchmark tool and pattern to enforce consistency.

Defining reproducible benchmarks: protocols and preconditioning

Reproducibility is where many SSD studies fail. To compare PLC and TLC meaningfully, tests must be standardized and fully documented. Adopt the following practices:

Preconditioning: define explicit programs (e.g., fill-to-100% then perform 10 full drive writes) and publish the exact commands and time stamps.
Repetition: run each test at least three times and report central tendency and dispersion (mean, SD, confidence intervals).
Firmware and thermal control: record firmware and power/thermal management states; report steady-state temperature and use external sensors where possible.
Power-loss scenarios: if evaluating durability, specify power-loss injection method (rising edge, immediate cut) and whether capacitors were used.
Energy measurement: measure at the device level when possible; report method and sensor calibration.
Raw logs: archive raw fio logs, SMART dumps, NVMe logs and provide cryptographic hashes to guarantee integrity.

Cost and marketplace data: capturing temporal context

Cost-per-GB is time-sensitive. Capture these fields to make cost data useful for meta-analysis:

Retail price (USD) with timestamp and seller
Manufacturer suggested price if available
Bulk/enterprise pricing notes
Currency and exchange rates used to normalize prices

In meta-analyses, treat cost as a covariate dependent on date; use inflation-adjusted dollars or categorize prices into bins to reduce heteroscedasticity.

Data hosting, persistent identifiers and licensing

Select infrastructure that supports FAIR principles:

Repository: use Zenodo, Dataverse, OSF or institutional data repositories for DOIs. For active curation, combine GitHub/GitLab (metadata and code) with Zenodo snapshots for DOI minting.
Large files: use object storage (S3/MinIO) with signed URLs; store hashes in the metadata repository.
Licensing: prefer CC0 or CC-BY for data, and MIT/Apache for code. Include SPDX identifiers in metadata.
Provenance: capture dataset creation pipelines using PROV or RO-Crate to make transformation steps transparent.

Validation, QA and community governance

To scale beyond a few contributors, institute lightweight quality control:

Automated validators: JSON Schema validation, unit checks (e.g., throughput > 0), and cross-field constraints (endurance_cycles must be integer).
Review process: community reviewers sign off on submissions before DOI minting.
Versioning policy: semantic versioning for schema; immutable DOIs for published datasets and link to updated dataset versions. Embed versioning and change logs into your governance model — see guidance on governance and best practices.
Curation metadata: add a "curation_status" field (draft, reviewed, published) and list reviewers.

Meta-analysis methods for PLC vs TLC comparisons

When you have consolidated datasets, treat device comparisons as observational studies and control for confounders:

Use mixed-effects models with random intercepts for study/lab and fixed effects for NAND type, firmware generation, ambient temperature and preconditioning protocol.
Normalize endurance measurements to program/erase cycle definitions or report both raw cycles and normalized cycles based on vendor-specified cell endurance.
Apply propensity-score matching if comparing devices from different eras or market segments to reduce selection bias.
Report heterogeneity (I^2) and perform sensitivity analyses excluding vendor-provided data or unpublished datasets.

Tools and reproducible workflows

Practical tooling choices to operationalize the repository:

Benchmarking: fio (scripted), Vdbench, or vendor SDKs wrapped in containers (Docker) for portability.
Orchestration: GitHub Actions or GitLab CI to run validators, publish snapshots to Zenodo and deploy metadata APIs.
Provenance: RO-Crate for bundling data, code and metadata; DataCite for DOI metadata mapping.
Search and API: ElasticSearch-backed API exposing faceted search over nand_type, metric ranges, date, and license.

Community building: incentives and sustainability

People contribute when they benefit. Design incentives:

DOIs and citation guidance: datasets should be citable and count in impact metrics.
Contributor badges and authorship records on dataset landing pages.
Small grants or sponsorships for maintaining the registry; partnerships with academic labs and consortia.
Annual benchmark challenges and leaderboards that encourage contributions while emphasizing reproducible methods over raw rank.

Quality case study (fictional, practical example)

Imagine Lab A and Lab B both publish endurance data for a PLC prototype and a TLC control. Lab A measured at 25 °C with firmware v1.2 and a 4-hour preconditioning routine; Lab B used 35 °C and a 24-hour preconditioning. Without metadata, an analyst might conclude PLC is less durable. With the schema above, you can include temperature and preconditioning as covariates — the mixed-effects model shows that after adjustment, PLC performs comparably on endurance but shows higher variance under elevated temperature. That nuance is only possible with rich metadata.

"Aggregating SSD benchmarks without metadata is like comparing apples to oranges and calling it a fruit salad." — Community Best Practices

Advanced strategies and future-proofing (2026+)

Looking ahead, plan for:

Firmware transparency: advocating for vendor-provided debug logs or 'research firmware' to enable fair tests.
Energy-centric metrics: as AI workloads scale, energy-per-IO and energy-per-GB transferred will matter more than raw throughput.
Cross-layer benchmarks: incorporate host software stacks and compression/FTL behaviors as first-class metadata so meta-analyses reflect system-level performance.
Machine-readable provenance: instrumented datasets containing reproducible notebooks and container hashes, enabling one-click re-runs in cloud CI environments.

Actionable checklist to launch a repository in 8 weeks

Week 1: Form a steering group (3–5 labs) and agree on minimum fields from this article.
Week 2: Publish a JSON Schema and README on GitHub; register a community namespace.
Week 3: Create a validation CI pipeline and sample dataset (one PLC, one TLC run) with RO-Crate packaging.
Week 4–5: Integrate Zenodo for DOI minting and pick a license (CC0 for data recommended).
Week 6: Open call for contributions; host a virtual workshop demonstrating upload and validation.
Week 7–8: Triage submissions, publish the first curated dataset and announce via mailing lists and social channels.

Final thoughts: research impact and reproducibility

In 2026, with PLC entering the conversation more loudly, the academic and practitioner communities need a shared language to evaluate storage trade-offs. A community-facing SSD benchmark repository — built around an explicit metadata schema, FAIR practices and reproducible benchmarks — transforms isolated experiments into cumulative science. It enables robust meta-analyses, reduces duplication of effort, and makes evidence-based claims about cost, endurance and performance defensible.

Call to action

Ready to make SSD benchmarking reproducible? Start small: clone the JSON Schema template in this article, run a single fio workload with full metadata, and publish it with a DOI. Join peers to form a steering group, or propose this schema to your lab’s data management team. If you’re an instructor, adopt the repository as a class project to teach FAIR data and reproducible research practices. Contribute one dataset — your single submission will help shape standards that make PLC vs TLC comparisons fair, transparent and useful for everyone.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.