data sharingpolicy researchopen science

Creating a Public-Use Dataset to Study Financial Inclusion: ABLE Accounts Expansion

UUnknown

2026-02-08

10 min read

Step-by-step guide to build anonymized, IRB‑compliant public files from administrative sources to study ABLE account expansion and financial inclusion.

Hook: Why researchers struggle to measure the financial impact of ABLE expansion—and how to fix it

Access to administrative sources promises definitive answers about how the 2025–26 expansion of ABLE accounts (eligibility now extended to age 46) affects savings, benefit receipt, and financial inclusion. Yet many teams hit three persistent barriers: (1) paywalled or siloed administrative data, (2) privacy constraints that block sharing, and (3) lack of reproducible, IRB‑approved pipelines to make a safe public‑use file. This guide gives a step‑by‑step, practical workflow for turning administrative records into an anonymized, IRB‑compliant public‑use dataset to evaluate ABLE eligibility expansion and its downstream effects on financial inclusion.

Executive summary (inverted pyramid)

Start with stakeholder alignment and an IRB‑ready protocol. Use secure deterministic linkage inside a protected environment to join sources, then apply layered anonymization (removing direct identifiers, generalization, perturbation, and synthesis). Produce a high‑utility public‑use file (PUF) plus a tightly governed restricted tier for validated researchers. Validate residual disclosure risk using both empirical attacks and formal metrics (e.g., k‑anonymity, l‑diversity, differential privacy). Release rich metadata, code, and synthetic exemplars to maximize reuse while preserving privacy. Below are the practical steps, templates, and governance practices I’ve used successfully in studies of social benefits, adapted for ABLE research in 2026.

Step 1 — Project scoping and stakeholder alignment

Define the research questions. Example: Did ABLE eligibility expansion to age 46 increase net asset accumulation among SSI recipients aged 26–46 within 24 months of eligibility?
Map required inputs. Typical administrative sources: state ABLE account registries, bank transaction: account opening and balances, SSA/SSI enrollment files, Medicaid enrollment, unemployment insurance (UI) wage records, and tax records (where accessible).
Engage stewards early. Meet custodians with a clear diagram of data flows, intended linkages, and proposed outputs. Establish a memorandum of understanding (MOU) and discuss DUAs for both restricted and public files.
Decide on final output tiers. Recommend a two‑tier model: (A) a heavily anonymized public‑use file (PUF) and (B) a restricted analysis environment (secure data enclave) with richer variables available under vetted proposals.

Step 2 — IRB and legal foundations (what to prepare)

For administrative data projects in 2026, IRBs expect detailed technical plans. Prepare the following sections for your IRB application and DUA negotiations:

Purpose and significance—specific aims about ABLE expansion and financial inclusion outcomes.
Data inventory and provenance—source, fields requested, refresh cadence, and custodians.
Data flow diagram—who has access at each step and where linkages occur (include secure enclave nodes).
Re-identification risk assessment—baseline risk, planned mitigations, re‑identification testing, and thresholds for release.
Privacy-preserving techniques—deterministic linkage via HMAC, generalization rules, suppression thresholds, differential privacy parameters (if applicable), and synthetic-data strategy.
Data security—encryption at rest, TLS in transit, key management, access controls, logging, and incident response.
Consent and waiver justification—for admin data you will usually request a consent waiver—justify minimal risk and impracticability of consent.

Example IRB language (brief)

This project uses existing administrative records to examine financial outcomes following ABLE eligibility expansion. All direct identifiers will be removed prior to public distribution; linkage will be performed within a secure enclave under HMAC-based pseudonymization. The expected risk of re-identification for the public file is minimal due to aggregation, perturbation, and formal disclosure control testing.

Step 3 — Secure linkage and creating a non-reversible pseudo‑ID

Linkage frequently requires name, DOB, SSN, and other identifiers. Never export raw identifiers outside a secure environment. Recommended practice:

Ingest raw identifiers only inside a FISMA‑/FedRAMP‑equivalent secure enclave or institutional secure server.
Create a deterministic pseudonym using a keyed HMAC (e.g., HMAC‑SHA256) with an institutional secret key. Truncate the output to 128 bits to create a stable, non‑reversible ID for linkage.
Store the key in a hardware security module (HSM) or institutional key manager; auditors must have access but not general analysts.
Record linkage rules and match quality metrics (precision, recall) in the data dictionary.

Step 4 — Variable selection and minimization

Privacy and utility are in tension. Use data minimization—keep only variables necessary to answer the pre‑specified questions. For ABLE effects, prioritize:

Stable demographics: sex, broad race/ethnicity categories, age bins (not exact DOB)
Program indicators: ABLE account open date (binned or day-shifted), SSI/SSDI enrollment flags, Medicaid enrollment
Financial outcomes: annual account balances (top-coded), number of deposits/withdrawals, employment earnings (binned), participation in public benefits
Event markers: disability onset year (if reliable), ABLE enrollment cohort indicator

Step 5 — De-identification recipe for the public‑use file (PUF)

Implement layered controls. The suggested pipeline:

Remove direct identifiers: name, SSN, street address, email.
Generalize quasi-identifiers: convert DOB to 5‑year age bands; convert ZIP to 3‑digit ZIP or county; collapse race into broader groups only when cell sizes would be small.
Date shifts: apply a consistent random offset per subject (drawn from a uniform window, e.g., ±180 days) to event dates to preserve relative timing while breaking exact dates. Record the window in metadata but not the seed.
Top-coding and binning: cap balances at a reasonable threshold (e.g., $50k) and export top-coded bins; publish distributional summaries separately.
Threshold suppression: apply cell suppression rules (e.g., suppress cells with fewer than 10 individuals) for tabulations.
Perturbation or DP: where high utility is needed, add calibrated noise using differential privacy for counts and aggregates; choose epsilon conservatively (2026 best practice: report epsilon and provide sensitivity analysis). Use mature libraries (OpenDP, Google's DP libraries, or SmartNoise derivatives) that by 2026 include production features for administrative data.
Synthetic data alternative: publish a fully synthetic PUF trained on the original but generated under DP guarantees. Keep an audited mapping of which variables are synthetic vs. real in the metadata.

Practical checklist for transformations

All direct IDs removed: yes/no
HMAC pseudonym used: yes/no
Age banding applied: bands specified
Geographic generalization: specified level
Top‑coding thresholds: listed
DP parameters: epsilon/delta published

Step 6 — Formal disclosure risk testing

Before release, run both empirical and formal risk assessments:

Re‑identification simulation: attempt linkage to public data (voter files, commercial lists) using retained quasi‑identifiers to estimate match rates. See methods on assessing identity risk in identity-sensitive environments.
Metrics: compute k‑anonymity, l‑diversity, and t‑closeness on key quasi‑identifier sets.
DP risk budgeting: if you used DP for aggregates, document the cumulative privacy loss across outputs.
Independent audit: in 2026 it’s increasingly expected that an external privacy audit will verify anonymization claims—budget for a short audit or an institutional privacy officer review. See security audit takeaways for practical expectations.

Step 7 — Documentation, reproducibility, and metadata

High‑quality PUFs are only reusable if they’re well documented. Provide:

Data dictionary with field definitions, coding schemes, and missing‑value rules.
Provenance notes describing original sources, refresh cadence, and linkage quality metrics.
De‑identification log describing each transformation (including DP parameters, bins, and top codes).
Processing code (R/Python scripts) published with a DOI—release code that reproduces the transformations from sanitized intermediate tables (not raw identifiers). Use a CI/CD and governance approach like modern micro-app pipelines for reproducibility.
Example analysis notebooks that show how to estimate intent‑to‑treat (ITT) or difference‑in‑differences effects using the PUF and where to seek restricted data for pre‑registration checks.

Step 8 — Release model and governance

Adopt a responsible release model:

Publish the PUF under a permissive license (CC0 or similar) if legal constraints allow.
Host on a stable repository (ICPSR, Zenodo, institutional repository) and assign a DOI.
Provide a restricted tier: approved researchers may access richer variables in a vetted secure environment under a DUA, with pre-registration and output checks before data export.
Establish a vulnerability disclosure process: a contact point for reporting re‑identification concerns and a timeline for remediation.

Step 9 — Post-release monitoring and updates

Data privacy is not a one‑time action. Commit to:

Periodic re‑assessment of disclosure risk (at least annually or when new external data emerges).
Versioning: increment the dataset version and preserve prior versions with clear change logs.
Community feedback: track citations and user requests to inform future enrichments (e.g., additional aggregated crosswalks).

Advanced strategies (2026 trends and practical options)

By 2026 several pragmatic approaches have become mainstream for administrative-public releases:

DP‑trained synthetic data: synthetic microdata generated with DP guarantees that preserve multivariate relationships for many econometric analyses; useful as PUF when raw‑like records are desirable but privacy must be protected.
Federated analytics: analyses run across custodians without centralized pooling. Especially helpful when state ABLE registries cannot share raw tables but can run standardized code and return DP aggregates. See engineering and governance tips in developer productivity and governance.
Hybrid release: publish a PUF plus an API for safe queries (aggregates only, DP‑protected) to support reproducible replications of results.
Benchmarking datasets: create and publish a fully documented synthetic benchmark that mirrors sample sizes and distributions so methodologists can test alternative estimators for ABLE effects without accessing sensitive data.

Illustrative example: from admin tables to PUF for ABLE impact

Imagine a pilot combining two state ABLE registries with SSA records and state Medicaid files. Steps we used in a recent (2025–26) pilot:

Link inside a secure enclave using HMAC pseudonyms.
Generate cohorts: pre‑expansion (age 40–45) vs. post‑expansion (age 41–46) with enrollment cohorts flagged.
Aggregate monthly balances into annual bins and top‑code at $50k; produce both continuous and binned versions for restricted access.
Apply DP noise to counts and publish a DP synthetic microdata PUF for public analysis; restricted users got access to the original linked, de‑identified microdata in a secure enclave under DUA.
Release code and a reproducible notebook that replicates published tables using only the PUF and synthetic data to demonstrate expected effect sizes under plausible scenarios.

Risk tradeoffs and decision heuristics

Some practical heuristics I recommend:

If cell counts <20 for key cross‑tabs, opt to aggregate or mask.
Prefer cohort indicators and temporal aggregates rather than exact event dates for PUFs.
When in doubt, create both a PUF and a restricted enclave option—many stakeholders accept tighter researcher vetting if it enables high‑value analyses.
Publish your anonymization decision log—it builds trust and helps reviewers evaluate reproducibility claims.

Actionable takeaway — a short checklist to start today

Draft your IRB sections: data inventory, risk assessment, and data flow diagram.
Negotiate DUAs and an MOU with custodians; clarify tiers of access.
Set up a secure enclave or identify an institutional partner that provides one. See architecture patterns in resilient architectures.
Implement HMAC pseudonymization inside the enclave for deterministic linking (technical identity risk guidance).
Design the PUF transformations (age bins, geo generalization, top‑coding, DP parameters) and test them with production-ready DP libraries and tooling.
Run re‑identification simulations and commission an external audit if feasible (see security audit guidance).
Publish the PUF, code, and metadata with a DOI; provide a path to restricted access.

Conclusion & call to action

Research on financial inclusion and the ABLE expansion is time sensitive: the 2025–26 policy change expands the eligible population (roughly 14 million additional Americans are affected at scale) and creates a rare natural experiment. Turning administrative records into a reusable, privacy‑preserving public‑use dataset requires careful planning, institutional buy‑in, and modern privacy engineering. Follow the step‑by‑step approach above to produce rigorous, IRB‑approved datasets that advance evidence while protecting participant privacy.

Next steps: prepare your IRB package now. If you want a ready‑to‑use template, download the PUF checklist, IRB language snippets, and example HMAC implementation (Python/R) from our reproducibility kit and join a community of researchers working on ABLE evaluations in 2026. Share your anonymization logs and help set community best practices—publish, then iterate.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.