teaching resourcessports analyticsstatistics

From Bracketology to Research Methods: Teaching Statistical Inference Using College Basketball Upsets

UUnknown

2026-02-01

11 min read

Teach inference and model validation through 2025–26 college basketball upsets (Vanderbilt, Seton Hall, Nebraska, George Mason).

Hook: Turn a classroom problem into a teachable moment

Students and instructors in sports analytics courses face familiar pain points: paywalled data, messy box-score exports, and opaque model pipelines that break when you hand them to a grader. What if you could use the most engaging classroom stimulus — real, high‑stakes upsets — to teach statistical inference, effect sizes, and model validation while demonstrating reproducible workflows and citation best practices? The 2025–26 surprise teams — Vanderbilt, Seton Hall, Nebraska and George Mason — provide compact, accessible case studies that make hypothesis testing intuitive and directly applicable to sports analytics courses in 2026.

Executive summary (most important first)

This article gives a ready-to-run teaching module that uses upset analysis in college basketball to teach:

Hypothesis testing (parametric and nonparametric)
Effect-size estimation and interpretation
Model building and validation for upset prediction
Reproducible analysis pipelines and citation/data-sharing workflows

It includes classroom dataset design, sample code patterns (Python/R), lab assignments, grading rubrics, and recommended tools (Zotero, GitHub, Binder, Zenodo). Designed for 2026 curricula where journals and departments increasingly require open code and data, this module is hands‑on, scalable, and grounded in the real world of college basketball analytics.

Context: Why 2026 is an ideal time to teach with sports upsets

By late 2025 and into 2026 the academic and sports-analytics ecosystems moved decisively toward reproducibility and open tools: more conference proceedings and journals expect data & code availability; community-curated college basketball datasets and model reproducibility templates are now plentiful; and classroom-friendly hosting options (Binder, CodeOcean, GitHub Actions) make running notebooks without local installs routine. Simultaneously, the 2025–26 season produced several early surprise teams (Vanderbilt, Seton Hall, Nebraska, George Mason) that are perfect pedagogical examples because they produce observable deviations from preseason expectations within a single season — a clean setting for hypothesis testing and model evaluation.

Why use upsets for teaching statistical inference?

Engagement: Students already follow college basketball; upsets are emotionally salient.
Concrete null hypotheses: e.g., “This team’s offensive efficiency equals the conference baseline.”
Clear effect measures: margins, efficiency differences, odds ratios, Cohen’s d translate easily to sports terms.
Accessible data: box scores, play-by-play, betting lines, and public team metrics provide multiple measurements per game for robust inference.

Module overview: learning objectives and schedule

Learning objectives (single module or 4-week unit):

Formulate and pre-register sports analytics hypotheses about upsets.
Choose and compute appropriate test statistics and effect sizes.
Build predictive models for upset probability and validate them using time-aware CV.
Package results reproducibly and publish data/code with accurate citations.

Suggested 4-week schedule:

Week 1 — Data acquisition, cleaning, exploratory analysis (introduce classroom dataset).
Week 2 — Hypothesis testing (t-tests, permutation, bootstrap CI).
Week 3 — Effect-size estimation and interpretation.
Week 4 — Predictive modeling & model validation (logistic regression, calibration, CV). Deliver reproducible notebook and short write-up.

Classroom dataset: design and practical pipeline

Build a compact classroom dataset that combines box scores, team metrics, and expected-win probabilities. Use public sources (Sports-Reference, NCAA box scores, team stats pages) and community datasets created in 2025–26. Keep the CSV small enough for students to download quickly but rich enough to model upsets.

Recommended schema (CSV)

date (YYYY-MM-DD)
season
team
opponent
home (1/0)
points_for
points_against
possessions
efg_pct
turnover_rate
off_reb_pct
free_throw_rate
preseason_predicted_margin (optional from a baseline model)
predicted_win_prob (from an Elo/logistic baseline)
is_upset (1 if team won and predicted_win_prob < 0.5)

Reproducible pipeline

Raw scraping scripts in /data-raw (Python/R).
Cleaning and feature engineering stored as Jupyter/RMarkdown notebooks that output cleaned CSV in /data.
Analysis notebooks in /notebooks referencing /data.
Use Git + GitHub for version control. Add GitHub Actions to run lightweight tests (linting, unit tests) and a CI job to rebuild dataset on merge.
Publish release to Zenodo to mint a DOI for the dataset and citation.

Hypothesis testing: classroom exercises using upsets

Use two complementary hypothesis-testing strategies: parametric (when assumptions roughly hold) and nonparametric (permutation/bootstrap) for robust inference.

Example 1 — Team vs conference baseline (parametric)

Hypothesis: "Vanderbilt's offensive efficiency (points per 100 possessions) is greater than the conference mean this season."

Set up:

Null H0: mu_Vandy = mu_conf
Alternative H1: mu_Vandy > mu_conf

Test: independent-sample t-test or Welch's t-test if variances differ. Verify normality reasonably via Q-Q plot; if sample sizes are small or skewed, switch to bootstrap CI or permutation test.

Example 2 — Upset frequency (permutation)

Hypothesis: "The observed upset frequency for Seton Hall is higher than expected from the baseline model (predicted_win_prob)."

Permutation test outline:

Compute observed upset count (number of games where is_upset==1).
Permute the predicted labels among games (or permute outcomes relative to predicted probabilities) to break association, recompute upset count.
Repeat 10,000 times to form null distribution; p-value = proportion of permuted counts ≥ observed.

# Python-style pseudocode for a permutation test
import numpy as np
n_perms = 10000
obs = data['is_upset'].sum()
perms = []
for i in range(n_perms):
    perm = np.random.permutation(data['is_upset'])
    perms.append( (perm==1).sum() )
p_value = np.mean(np.array(perms) >= obs)

Effect sizes: magnitude matters more than p-values

In classroom assessment and reporting, emphasize that statistical significance is not the same as practical importance. Teach students to report both p-values and effect sizes with confidence intervals.

Common sports effect-size metrics

Cohen's d for mean differences (e.g., offensive efficiency difference).
Odds ratio for binary outcomes (e.g., likelihood of upset).
Standardized regression coefficients for model predictors.
Brier score and calibration slope for probabilistic predictions.

Computing Cohen's d (quick formula)

Cohen's d = (mean_group1 - mean_group2) / pooled_sd. In words: the difference expressed in standard-deviation units. Thresholds are context-dependent; in sports an effect size of 0.3 could be meaningful for a team-level season metric.

# R-style pseudocode for Cohen's d
mean1 = mean(team$off_eff)
mean2 = mean(conf$off_eff)
pooled_sd = sqrt(((n1-1)*sd1^2 + (n2-1)*sd2^2)/(n1+n2-2))
d = (mean1 - mean2)/pooled_sd

Modeling upsets: features, algorithms, and validation

Predictive modeling exercises teach both inferential and predictive skills. Typical target: is_upset (binary). Typical baseline algorithm: logistic regression; extensions: regularized logistic, gradient-boosted trees.

Features to include

Pre-game predicted win probability (baseline model or betting line)
Elo difference
Rest days differential
Home-court indicator
Recent form (last 5 games efficiency difference)
Key box-score rates (efg, turnover_rate, rebound rates)

Validation strategies (teach these explicitly)

Time-aware cross-validation: do not shuffle across dates; use rolling window CV or forward-chaining CV.
Calibration checks: Reliability curves, Brier score, calibration slope.
Discrimination: ROC AUC, PR AUC for imbalanced upset targets.
Backtesting: Evaluate model on holdout weeks or a different season (best practice in sports analytics).
Explainability: Feature importance, SHAP values to show what drove upset probability. Expect AI-assisted model explainability tools to be more common in automated grading pipelines.

Example model-validation checklist for student deliverables

Describe data splits and rationale (dates, seasons).
Show calibration plot and Brier score for final model.
Report AUC with CI (bootstrap) and confusion matrix at a chosen threshold.
Provide an error analysis of false positives/negatives (systematic patterns?).
Share reproducible code and environment specification (renv/poetry/Docker).

Case studies: specific classroom hypotheses for the 2025–26 surprises

Below are focused, testable hypotheses and modeling tasks for each surprise team. These are intentionally framed as classroom exercises — you can adapt sample sizes and features based on the dataset you provide students.

Vanderbilt

Hypothesis: Vanderbilt's defensive efficiency improved significantly compared with its 2024–25 baseline and vs SEC peers this season.
Test: paired or two-sample test on team defensive efficiency per game; bootstrap CI for median difference.
Model task: predict Vanderbilt upsets using opponent turnover rate and Vanderbilt defensive rebound percentage.

Seton Hall

Hypothesis: Seton Hall’s upset frequency is higher than predicted by preseason Elo because of improvements in turnover margin.
Test: permutation test on upset counts; regression to estimate effect size of turnover margin on upset odds.
Model task: include interaction of turnover margin and rest days in a logistic model; interpret standardized coefficients.

Nebraska

Hypothesis: Nebraska’s rebounding differential explains a meaningful portion of their unexpected wins.
Test: mediation-style analysis or regression decomposition showing portion of variance in margin explained by rebounding.
Model task: compare regularized logistic vs tree-based models and use SHAP to interpret predictors.

George Mason

Hypothesis: George Mason’s three-point rate change accounts for a nontrivial effect size in their margin distribution.
Test: compare pre/post three-point rate using bootstrap confidence intervals for difference in means; compute Cohen’s d.
Model task: predict upset probability incorporating lineup-level three-point tendencies if play-by-play is available.

Tools, software, and citation workflows (tutorial-level guidance)

Teaching reproducible sports analytics requires a short set of tools your students can master quickly. Recommended stack in 2026:

Data & analysis: Python (pandas, statsmodels, scikit-learn, shap) or R (tidyverse, infer, broom, glmnet, targets).
Notebooks: JupyterLab or RStudio with Quarto for HTML/PDF outputs.
Environment management: poetry/pipenv (Python) or renv/packrat (R); Docker for final reproducible image.
CI and reproducibility: GitHub Actions to run tests and build artifacts; Binder/CodeOcean to allow graders to run notebooks online without local installs.
Data publishing & citation: Zenodo or OSF to mint a DOI; include a CITATION.cff and machine-readable metadata in the repo.
Citation manager: Zotero with Better BibTeX for export of .bib to include in reproducible reports.

Step-by-step: publish dataset and notebook with a DOI

Push a release tag to GitHub for your module repository.
Connect GitHub repo to Zenodo and create a release; Zenodo mints a DOI.
Include a README with dataset schema, license (e.g., CC BY), and instructions for citation (include Zotero-compatible metadata).
Provide a Binder badge and instructions for running notebooks online without install.

Assessment: grading rubrics and reproducibility checks

Use a rubric that weights (1) statistical reasoning and null-hypothesis clarity, (2) correctness of tests and effect-size reporting, (3) model validation and calibration, (4) reproducibility of code and environment, and (5) communication quality (figures, interpretation, limitations). Include an automated reproducibility check: can the notebook run from the repository in a clean Binder environment and produce the main figures?

Advanced strategies and future predictions (2026+)

Expect these trends to shape how you teach this module in coming years:

More granular tracking data will become accessible to classroom researchers via university partnerships and curated releases — incorporate sequence- and event-level inference modules.
Journals and departments will continue to mandate code and data availability; teaching reproducible packaging (Docker, renv) will be essential.
Emphasis on causal inference: students will move from predictive correlational models to quasi‑experimental designs (instrumental variables, difference-in-differences) to study coaching/lineup changes.
AI-assisted model explainability tools will be integrated into grading (automatic SHAP plots, calibration diagnostics generated by CI pipelines).

Actionable takeaways for instructors

Build a small, well-documented classroom dataset centered on the 2025–26 surprises and publish it with a DOI.
Use both parametric and nonparametric hypothesis tests; teach when each is appropriate.
Require effect-size reporting (Cohen's d, odds ratios) alongside p-values.
Make model validation an explicit deliverable: time-aware CV, calibration, and error analysis.
Automate reproducibility checks with GitHub Actions and provide Binder/CodeOcean links for graders.

Pedagogical point: Upset analysis converts abstract inferential concepts into measurable, emotionally engaging problems — students learn both the math and the craft of reproducible research.

Quick-start resources checklist

Starter repository with data-raw/, data/, notebooks/, and CITATION.cff.
One-page lab handouts: Hypothesis testing lab, Effect-size lab, Model validation lab.
Grading rubric and reproducibility checklist (Binder runs notebooks and produces key figures).
Template Zotero library with key sports-analytics and reproducibility citations (exportable .bib for student reports).

Final notes and call to action

Using the 2025–26 surprise teams (Vanderbilt, Seton Hall, Nebraska, George Mason) turns season intrigue into rigorous training in statistical inference, effect-size thinking, and model validation — all wrapped in modern, reproducible workflows demanded by 2026 academia. If you're an instructor: assemble the classroom dataset, adapt the lab handouts above, and integrate the reproducibility checklist into your grading. If you're a student: fork the starter repo, run the Binder notebook, and try the permutation test and logistic-model exercise for your favorite upset.

Next step: Download or fork a starter repository (create a GitHub repo with the schema above), publish a release to Zenodo for a DOI, and run one full lab in class next week. Want a ready-made package of labs and a sample dataset tailored to your course length? Request the materials from your department library or subscribe to module updates and reproducible templates offered by researchers.site.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.