data tutorialautomotive dataopen science

Building an Open Dataset from Automotive World’s Downloadable Forecast Tables

UUnknown

2026-01-31

10 min read

Turn Automotive World Excel forecasts into FAIR datasets for classroom and research — with ready-made R/Python scripts and publishing steps.

Stop wrestling with paywalled Excel tables — turn them into reusable, FAIR datasets for teaching and research

If you've ever downloaded an Excel forecast table from an industry report (like Automotive World) only to spend hours deciphering merged headers, hidden rows, and inconsistent units, you're not alone. Researchers, educators, and students in 2026 face three persistent pain points: FAIR data (Findable, Accessible, Interoperable, Reusable), time-consuming cleaning workflows, and uncertainty about how to publish datasets so they're actually reusable. This guide gives a practical, step-by-step pipeline and ready-to-run R and Python scripts that take you from downloaded .xlsx to a validated, licensed, DOI-minted dataset fit for classroom assignments and academic citation.

Why build a FAIR automotive dataset from forecast tables in 2026?

Three trends that matter right now:

Funder and journal mandates for FAIR data (Findable, Accessible, Interoperable, Reusable) are routine in 2025–2026; datasets without clear metadata and licenses increasingly block publication.
Publishers are offering more downloadable tables but still ship them as messy Excel workbooks — meaning automated pipelines are both necessary and high-impact.
AI-assisted cleaning (LLM-assisted header mapping, pattern-based unit normalization) has matured — but human-curated metadata and provenance remain essential for trust.

Legal & ethical checklist before extracting

Before you extract tables from any paywalled or publisher-controlled resource, run this checklist:

Check licensing: Does the Excel file include a license or terms of use? If not, assume reuse restrictions. Contact the publisher for permission if you plan to republish verbatim content.
Prefer derived data: When in doubt, publish a derived dataset (e.g., aggregated, anonymized, or normalized values) and link back to the original source with attribution.
Attribute clearly: Record the original file URL, access date, and any subscription requirements in your dataset metadata.
Respect copyright: Avoid republishing proprietary text or brand-sensitive data without consent. Publishing transformed/aggregated numerical datasets with attribution is commonly acceptable but verify for each source.

Overview: Minimal pipeline to go from .xlsx to FAIR dataset

Ingest: programmatically read all sheets/versions from the Excel workbook.
Parse: detect header rows, unmerge values, standardize column names.
Clean & Normalize: handle units, missing values, numeric parsing, and date formats.
Structure: pivot to a tidy (long) format; add explicit variable columns (unit, measure).
Validate: run schema checks (Frictionless/Data Package validation).
Document: prepare DataCite metadata, README, and provenance notes.
Publish: push to Zenodo/Figshare/OSF or GitHub + Zenodo, assign DOI, choose license.

Step 1 — Ingest: robust reading of Automotive World Excel workbooks

Automotive World often provides multiple downloadable tables (brand production forecasts, model plans, key statistics). Use code to read every sheet and avoid manual copy/paste. Below are two ready-to-run scripts — a Python notebook style and an R script — that implement this pipeline.

Python (pandas + frictionless) — essentials

This script reads all sheets, auto-detects a header row, flattens multi-index headers, tidies the table, validates it using frictionless, and writes CSV + datapackage.json.

#!/usr/bin/env python3
import pandas as pd
import re
from frictionless import Package, Resource
from datetime import datetime

XLSX_PATH = 'toyota_production_forecast_2026.xlsx'  # replace with your download
OUT_CSV = 'toyota_production_forecast_2026.csv'
DATAPACKAGE = 'datapackage.json'

# Helper: find header row by scanning first 10 rows for a year or known label
def detect_header_row(df_preview):
    for i in range(0, 6):
        row = df_preview.iloc[i].astype(str).str.lower().str.cat(sep=' ')
        if re.search(r'20\d{2}|production|forecast|brand', row):
            return i
    return 0

# Read all sheets
sheets = pd.read_excel(XLSX_PATH, sheet_name=None, header=None, engine='openpyxl')
frames = []
for sheet_name, raw in sheets.items():
    preview = raw.head(10)
    header_row = detect_header_row(preview)
    df = pd.read_excel(XLSX_PATH, sheet_name=sheet_name, header=header_row, engine='openpyxl')
    df['sheet_source'] = sheet_name
    # Drop fully-empty columns
    df = df.dropna(axis=1, how='all')
    frames.append(df)

combined = pd.concat(frames, ignore_index=True, sort=False)

# Clean column names
combined.columns = [re.sub(r"\s+|\W+", '_', str(c).strip()).lower().strip('_') for c in combined.columns]

# Example normalization: pivot year columns to long form if years are columns
year_cols = [c for c in combined.columns if re.match(r'20\d{2}', c)]
if year_cols:
    id_cols = [c for c in combined.columns if c not in year_cols]
    tidy = combined.melt(id_vars=id_cols, value_vars=year_cols, var_name='year', value_name='value')
else:
    tidy = combined.copy()

# Parse numeric values
def parse_number(x):
    try:
        if pd.isna(x):
            return None
        # remove commas and non-numeric suffixes
        return float(re.sub(r'[^0-9\.-]', '', str(x)))
    except Exception:
        return None

if 'value' in tidy.columns:
    tidy['value'] = tidy['value'].apply(parse_number)

# Add provenance
tidy['extracted_from'] = XLSX_PATH
tidy['extraction_date'] = datetime.utcnow().isoformat()

# Write CSV
tidy.to_csv(OUT_CSV, index=False)

# Create a frictionless package and validate
resource = Resource(path=OUT_CSV)
package = Package(resources=[resource])
package.to_descriptor(DATAPACKAGE)
print('Wrote', OUT_CSV, 'and', DATAPACKAGE)

R (readxl + janitor + zen4R) — essentials

This R script follows the same flow and shows how to upload to Zenodo with zen4R. Replace placeholders with your Zenodo token to publish.

# R script: clean_xlsx_to_csv.R
library(readxl)
library(dplyr)
library(tidyr)
library(janitor)
library(zen4R) # for Zenodo upload

xlsx_path <- 'toyota_production_forecast_2026.xlsx'
out_csv <- 'toyota_production_forecast_2026.csv'

sheets <- excel_sheets(xlsx_path)
all <- list()
for (s in sheets) {
  raw <- read_excel(xlsx_path, sheet = s, col_names = FALSE)
  # detect header row: look for first row containing a year
  header_row <- which(apply(raw[1:6,], 1, function(x) any(grepl('20\\d{2}', as.character(x)))))
  if (length(header_row)) header_row <- header_row[1] else header_row <- 1
  df <- read_excel(xlsx_path, sheet = s, skip = header_row - 1)
  df <- df %>% remove_empty('cols') %>% clean_names()
  df$sheet_source <- s
  all[[s]] <- df
}
combined <- bind_rows(all)

# pivot if year columns present
year_cols <- names(combined)[grepl('^20\\d{2}$', names(combined))]
if (length(year_cols)) {
  tidy <- combined %>% pivot_longer(cols = all_of(year_cols), names_to = 'year', values_to = 'value')
} else {
  tidy <- combined
}

# numeric parsing
tidy <- tidy %>% mutate(value = as.numeric(gsub('[^0-9.-]', '', as.character(value))),
                        extracted_from = xlsx_path,
                        extraction_date = Sys.time())

write.csv(tidy, out_csv, row.names = FALSE)

# Optional: publish to Zenodo
# z <- ZenodoManager$new(zenodo_token = Sys.getenv('ZENODO_TOKEN'))
# record <- z$deposit_file(out_csv, metadata = list(title='Toyota production forecast dataset (derived)',
#                                                   upload_type='dataset',
#                                                   description='Derived from Automotive World Excel tables. See provenance notes.',
#                                                   license='cc-by-4.0'))

Step 2 — Cleaning pitfalls and robust transforms

Common problems and fixes:

Merged headers: Detect multi-row headers and flatten by concatenating with a separator (e.g., brand_model -> brand__model). Keep the raw header rows in provenance files.
Hidden rows or notes: Use read_excel(..., col_names=FALSE) and inspect the first 15 rows programmatically to find the header index. Save the skipped rows as a source_notes.txt.
Units in header or cells: Create a dedicated unit column (e.g., 'units: 000s', 'units: vehicles') and normalize to standard SI-like units where possible.
Thousands separators & non-breaking spaces: Strip non-digit characters before coercing to numeric.
Inconsistent year formats: Normalize year columns to four-digit integers and ensure they're stored as integers or ISO dates.

Step 3 — Metadata, schema, and FAIR compliance

To be truly FAIR, do more than publish a CSV. Provide:

DataCite metadata: title, authors (with ORCID), description, keywords (e.g., automotive dataset, production forecast), publication year, funder, related identifier (original Automotive World article), and license.
Field-level schema: A simple JSON Table Schema or datapackage.json describing column names, types, units, and controlled vocabularies for categorical fields.
Provenance file: original_file_url, access_date, extraction_method, script version (commit hash), and who ran the extraction.
README: usage notes, known issues, and recommended citation (include DOI once minted).

Example datapackage.json snippet

{
  "name": "toyota-production-forecast-2026",
  "title": "Toyota production forecast (derived) 2020-2030",
  "licence": "CC-BY-4.0",
  "resources": [{
    "path": "toyota_production_forecast_2026.csv",
    "schema": {
      "fields": [
        {"name": "brand", "type": "string"},
        {"name": "model", "type": "string"},
        {"name": "year", "type": "integer"},
        {"name": "value", "type": "number", "unit": "vehicles"}
      ]
    }
  }]
}

Step 4 — Validate and version

Run automated validation before publishing:

Use the Frictionless Python library (or QA tools) to detect schema mismatches and missing values.
Run unit tests in CI that re-run the extraction on a sanitized sample file to detect upstream format changes.
Use data versioning (Git LFS, DVC, or Zenodo releases) so educators can reference stable snapshots with DOIs for course syllabi.

Step 5 — Publish: which repository and how

Repository choices and tradeoffs:

Zenodo: Best for academic citation — automatic DOI, GitHub integration for DOI upon release, supports communities and versioning.
Figshare: Good UI and citation metadata; pay attention to license options.
OSF (Open Science Framework): Great for projects combining code, preregistrations, and data; supports versioning and access controls.
GitHub + GitHub Pages: Use for development and hosting code; pair with Zenodo to mint DOIs for releases.

Recommended license: CC-BY 4.0 for derived datasets you intend others to reuse, with a clear “derived from” statement linking to the original Automotive World page.

Classroom & research reuse patterns

How instructors and students can use the dataset responsibly:

Create a lab assignment: students reproduce the clean table from the raw Excel workbook using the provided scripts, submit a short methods note, and compare forecasts across brands.
Use as a case study in data curation: students validate and extend the datapackage (add new fields, unit conversions, or merge with macroeconomic indicators).
Enable reproducible analysis: accompany the dataset with a Jupyter Book or RMarkdown tutorial and a Binder/Colab snapshot for zero-install access.

Advanced strategies & 2026-forward practices

To future-proof your dataset and workflows:

Automated extraction pipelines: Use GitHub Actions to run extraction & validation when a new workbook is added to the repo. On success, create a release and let Zenodo mint a DOI. Consider threat modeling your CI — see red teaming of supervised pipelines for supply-chain attack scenarios and defenses.
AI-assisted mapping: Use an LLM (locally or via an enterprise service) to map ambiguous headers to canonical variable names, but always log and human-review mappings (see work on autonomous desktop AIs and mapping workflows).
Semantic annotations: Add schema.org Dataset markup and link variables to controlled vocabularies (e.g., Eurostat/NACE codes for industry sectors) to boost discoverability — combine this with site-level observability and indexing best practices (site search observability).
Data lineage: Adopt W3C PROV or the emerging RDA recommendations (2025–2026 updates) to store provenance in machine-readable form.

Troubleshooting: short recipes

Mixed numeric/text in column: coerce after regex cleanup; flag rows that fail conversion for manual review.
Duplicate rows across sheets: de-duplicate using a composite key (brand + model + year + sheet_source) and keep the most complete row.
Non-standard missing codes: map strings like 'n/a', '-', '—', 'TBC' to explicit NA and document them.

Actionable takeaways

Always capture provenance: original file, access date, extraction script version, and operator name (or ORCID) — store in a provenance.json or README.
Validate with Frictionless/Data Package; treat validation failures as signals of upstream format changes.
Publish a derived dataset with a clear license and attribution instead of redistributing paywalled files verbatim.
Automate: CI for extraction + validation, GitHub & Zenodo for versioned DOI releases, and collaborative file tagging and edge indexing for discoverability and controlled sharing.

Ready-to-use resources

What to include in your repository for others to reproduce:

raw_xlsx/ (original downloads, if allowed)
scripts/ (python and R extraction scripts with requirements.txt / renv.lock)
data/ (clean CSV + datapackage.json)
metadata/ (DataCite JSON, README, provenance.json)
notebooks/ (Jupyter and RMarkdown for classroom exercises)

Final notes on ethics and attribution

When the source is a paywalled article like an Automotive World profile, be conservative: link to the original, include the original citation, and seek explicit permission if you plan to redistribute exact tables. Derived data with aggregation and clear attribution is the safe and reproducible path for classroom reuse and scholarly outputs.

Conclusion — Start a reusable pipeline today

Transforming downloadable Excel forecast tables into FAIR datasets is a high-leverage activity in 2026: it saves time across courses and research projects, unlocks reproducible analysis, and meets modern data-sharing expectations. Use the Python and R scripts above as templates, add robust metadata, validate with Frictionless, and publish to Zenodo or OSF with a CC-BY 4.0 license. If you follow the pipeline in this guide, you'll turn messy industry tables into trusted teaching datasets and citable research assets.

Take action now: Clone your extraction repo, run the scripts on one Automotive World Excel table (or a permitted sample), create a datapackage.json, and publish a versioned release to Zenodo. Then share the DOI with your class or collaborators so they can reproduce your results.

Want the full example repository with CI, notebook demos, and a ready-to-run GitHub Action that mints a Zenodo DOI on release? Email our data-curation team or search for the "automotive-forecast-fair-dataset" template on our site to get started.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.