Cross-Country Season Prediction Script Documentation

This document provides detailed explanations of each section in the 2026 cross-country season prediction R script, explaining the purpose, implementation, and reasoning behind each component.

Section: {r statistical-odds} - Statistical Odds Calculation

Purpose

This section trains dedicated logistic regression models for each categorical outcome (Win, TopThree, Top5, Top10, Top30) using the optimal feature sets identified in the previous feature selection phase. It generates probabilistic predictions for 2026 season performance and converts these into betting odds formats.

Implementation Details

1. Feature Extraction and Validation

# Extract optimal features for each outcome category
win_features_men <- best_features_odds_men[["Win"]]$features
topthree_features_men <- best_features_odds_men[["TopThree"]]$features
top5_features_men <- best_features_odds_men[["Top5"]]$features
top10_features_men <- best_features_odds_men[["Top10"]]$features
top30_features_men <- best_features_odds_men[["Top30"]]$features

Purpose: Extracts the optimal feature sets determined by the exhaustive search algorithm for each target category. Each outcome may have different optimal features based on its specific predictive requirements.

Validation: Ensures all categories have valid feature sets and reports feature counts for verification.

2. Model Formula Creation

win_formula_men <- as.formula(paste("Win ~", paste(win_features_men, collapse = " + ")))
topthree_formula_men <- as.formula(paste("TopThree ~", paste(topthree_features_men, collapse = " + ")))

Purpose: Creates logistic regression formulas dynamically based on the optimal feature sets. This approach allows for different features to be used for different outcomes, maximizing prediction accuracy for each category.

3. Model Training with Convergence Validation

win_model <- glm(win_formula, family = binomial, data = df_place)

# Validate model convergence
if (!win_model$converged) {
  warning("Win model did not converge")
}

# Check for model fitting issues
if (any(is.na(coef(win_model)))) {
  warning("Win model has NA coefficients - possible multicollinearity")
}

Purpose: Trains separate logistic regression models for each outcome category. Each model is optimized for its specific target, allowing for more nuanced predictions than a single multi-class model.

Validation: Checks for convergence issues and multicollinearity problems that could indicate model instability.

4. Model Predictions with Quality Control

win_probs <- predict(win_model, pred_data, type = "response")

# Validate predictions
if (any(is.na(win_probs))) {
  warning("NA predictions detected for Win")
}
if (any(win_probs < 0 | win_probs > 1, na.rm = TRUE)) {
  warning("Invalid probability values for Win (outside 0-1 range)")
}

Purpose: Generates probability predictions for each outcome category and validates the results are within expected ranges.

5. Probability Normalization

# Win probabilities should sum to 100% (1.0)
total_win_prob <- sum(win_probs, na.rm = TRUE)
if (total_win_prob > 0) {
  win_probs_normalized <- (win_probs / total_win_prob) * 1.0
} else {
  win_probs_normalized <- win_probs
}

# Top3 probabilities should sum to 300% (3.0) since 3 positions
total_top3_prob <- sum(top3_probs, na.rm = TRUE)
if (total_top3_prob > 0) {
  top3_probs_normalized <- (top3_probs / total_top3_prob) * 3.0
} else {
  top3_probs_normalized <- top3_probs
}

Purpose: Normalizes probabilities to ensure they sum to the expected totals for each category:

Win: 100% (only one winner per race)
Top3: 300% (three positions available)
Top5: 500% (five positions available)
Top10: 1000% (ten positions available)
Top30: 3000% (thirty positions available)

This normalization ensures the probabilities are properly calibrated for betting market expectations.

6. Results Integration and Odds Calculation

results <- data.frame(
  Skier = pred_data$Skier,
  Nation = pred_data$Nation,
  Win_Prob = win_probs_normalized,
  Top3_Prob = top3_probs_normalized,
  Top5_Prob = top5_probs_normalized,
  Top10_Prob = top10_probs_normalized,
  Top30_Prob = top30_probs_normalized,
  Outside_Prob = 1 - (top30_probs_normalized / 30.0)
)

# Calculate decimal and American odds
Win_Decimal_Odds = ifelse(Win_Prob > 0, 1 / Win_Prob, Inf)
Win_American_Odds = ifelse(Win_Prob >= 0.5,
                          -Win_Prob/(1-Win_Prob) * 100,
                          (1-Win_Prob)/Win_Prob * 100)

Purpose: Creates a comprehensive results dataframe with probabilities and converts them into both decimal and American odds formats for practical betting applications.

7. Debugging and Model Verification

# Debugging: Show prediction values for Gus Schumacher
gus_indices <- which(pred_data$Skier == "Gus Schumacher")
if (length(gus_indices) > 0) {
  cat("\n--- DEBUGGING: Gus Schumacher Win Model Values ---\n")
  gus_data <- pred_data[gus_indices[1], , drop = FALSE]
  
  # Show the features used in the Win model for Gus
  cat("Win model features for Gus Schumacher:\n")
  for (feature in win_features_men) {
    if (feature %in% names(gus_data)) {
      value <- gus_data[[feature]]
      cat(sprintf("  %s: %s\n", feature, 
                 ifelse(is.na(value), "NA", 
                       ifelse(is.numeric(value), round(value, 4), value))))
    } else {
      cat(sprintf("  %s: MISSING\n", feature))
    }
  }
  
  # Show Win prediction for Gus
  gus_win_prob <- win_probs[gus_indices[1]]
  cat(sprintf("Win probability for Gus Schumacher: %.4f (%.2f%%)\n", 
             gus_win_prob, gus_win_prob * 100))
}

Purpose: Provides detailed debugging output for Gus Schumacher to verify that the correct feature values are being used in the Win model. This helps validate model inputs and troubleshoot any prediction issues.

Debug Output Includes:

All Win model features and their actual values for Gus Schumacher
Identification of missing features
Final Win probability prediction in both decimal and percentage format
Proper handling of NA values and missing data

8. Data Quality Validation

The section includes extensive validation:

Target variable presence: Ensures all required categorical outcomes exist in training data
Feature availability: Verifies all optimal features are available in prediction data
Probability consistency: Checks that smaller categories have lower probabilities than larger ones
Model convergence: Validates that all models converged successfully
Prediction ranges: Ensures all probabilities are between 0 and 1

Statistical Justification

Model Architecture

Using separate models for each outcome category rather than a single multi-class model provides several advantages:

Feature Optimization: Each outcome can use its own optimal feature set
Model Specialization: Models can learn outcome-specific patterns
Interpretability: Easier to understand which factors drive each specific outcome
Calibration: Individual models can be better calibrated for their specific tasks

Normalization Approach

The probability normalization ensures market consistency by forcing the total implied probabilities to match the number of available positions. This prevents arbitrage opportunities and ensures the odds reflect realistic market conditions.

Outcome Categories

The five categories (Win, Top3, Top5, Top10, Top30) provide comprehensive coverage of racing outcomes:

Win: Championship-level predictions for race victories
Top3: Podium predictions (most common betting market)
Top5/Top10: Broader success metrics for consistent performers
Top30: Points-scoring predictions (World Cup points typically awarded to top 30)

Error Handling and Robustness

Convergence monitoring: Tracks model fitting issues
Missing data handling: Manages NA values in predictions
Infinite odds protection: Caps extremely high odds at 999.99
Probability validation: Ensures all outputs are mathematically valid
Feature validation: Confirms all required features are available

Section: {r load-data} - Data Loading & Validation

Purpose

This section is responsible for loading the core chronometer data files that contain race results for cross-country skiing and performing comprehensive validation to ensure data quality and integrity before any analysis begins.

Data Sources

The section loads two primary CSV files:

Men’s data: /Users/syverjohansen/ski/elo/python/ski/polars/excel365/men_chrono.csv
Ladies data: /Users/syverjohansen/ski/elo/python/ski/polars/excel365/ladies_chrono.csv

These files contain chronological race data generated by the ELO rating system pipeline, with each row representing a single race result for a skier.

Implementation Details

1. File Existence Validation

if (!file.exists(men_file)) stop("Men's data file not found: ", men_file)
if (!file.exists(ladies_file)) stop("Ladies data file not found: ", ladies_file)

Purpose: Ensures the data files exist before attempting to load them. This prevents cryptic errors later and provides clear feedback if the data pipeline hasn’t run or files have been moved.

2. Safe Data Loading with Error Handling

tryCatch({
  M_chrono <- read_csv(men_file, show_col_types = FALSE)
  cat("✓ Men's data loaded:", nrow(M_chrono), "rows\n")
}, error = function(e) {
  stop("Failed to load men's data: ", e$message)
})

Purpose: Uses tryCatch() to handle potential file corruption, permission issues, or formatting problems. The show_col_types = FALSE suppresses column type messages for cleaner output. Immediately reports the number of rows loaded for verification.

3. Required Column Validation

required_cols <- c("Skier", "Date", "Season", "Event", "City", "Distance", "Place", "Race", "ID")
missing_men <- setdiff(required_cols, names(M_chrono))

Purpose: Validates that all essential columns are present in the loaded data. These columns are critical for:

Skier: Athlete identification
Date: Race timing and chronological ordering
Season: Season grouping for analysis
Event: Race type classification (World Cup, Tour de Ski, etc.)
City: Race location for context
Distance: Race distance for categorization
Place: Finishing position for points calculation
Race: Race identifier within the broader dataset
ID: Unique skier identifier for linking with other datasets

4. Data Quality Validation

Place Column Validation

invalid_places_m <- sum(is.na(M_chrono$Place) | M_chrono$Place < 0 | !is.finite(M_chrono$Place))
if (invalid_places_m > nrow(M_chrono) * 0.1) {
  warning("More than 10% of men's Place values are invalid")
}

Purpose: Validates that finishing places are logical (positive integers). The 10% threshold allows for some data imperfections (DNS, DSQ, etc.) while flagging systematic data quality issues.

Skier Name Validation

missing_skiers_m <- sum(is.na(M_chrono$Skier) | M_chrono$Skier == "")
if (missing_skiers_m > 0) warning("Men's data has missing skier names")

Purpose: Ensures all race results can be attributed to specific athletes. Missing skier names would break the analysis pipeline.

Date Validation

date_errors_m <- sum(is.na(M_chrono$Date))

Purpose: Validates that race dates are present, which is essential for chronological analysis and season grouping.

5. Athlete Exclusion System

excluded_men <- c("Jonas Baumann", "Pål Golberg", "Mikael Gunnulfsen", "Renaud Jay", "Roman Schaad")
excluded_ladies <- c("Therese Johaug", "Victoria Carl")

Purpose: Removes specific athletes from analysis for various reasons:

Retirement: Athletes who have retired and won’t compete in 2026
Data Quality: Athletes with inconsistent or problematic data
Special Circumstances: Athletes with unique situations that would skew predictions

The exclusion is tracked and validated:

excluded_count_m <- sum(M_chrono$Skier %in% excluded_men)
actual_excluded_m <- M_chrono_original_rows - nrow(M_chrono)
if (actual_excluded_m != excluded_count_m) {
  warning("Mismatch in men's exclusion: expected ", excluded_count_m, ", actual ", actual_excluded_m)
}

This ensures the exclusion worked correctly and no unexpected data loss occurred.

6. World Cup Points System Setup

The section defines three different points systems used in cross-country skiing:

Standard World Cup Points (50 positions)

wc_points <- c(100,95,90,85,80,75,72,69,66,63,60,58,56,54,52,50,48,46,44,42,40,38,36,34,32,30,28,26,24,22,20,19,18,17,16,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1)

Purpose: Standard FIS World Cup points awarded for top 50 finishes in regular World Cup races.

Stage Race Points (30 positions)

stage_points <- c(50,47,44,41,38,35,32,30,28,26,24,22,20,18,16,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1)

Purpose: Reduced points for stage race components (like individual Tour de Ski stages).

Tour de Ski Points (49 positions)

tds_points <- c(300,285,270,255,240,216,207,198,189,180,174,168,162,156,150,144,138,132,126,120,114,108,102,96,90,84,78,72,66,60,57,54,51,48,45,42,39,36,33,30,27,24,21,18,15,12,9,6,3)

Purpose: Enhanced points for the prestigious Tour de Ski overall classification.

7. Points System Validation

if (length(wc_points) != 50) stop("World Cup points array should have 50 values")
if (!all(diff(wc_points) <= 0)) stop("World Cup points not in descending order")

Purpose: Validates that points arrays have the correct length and are in descending order (higher places get more points).

8. Safe Points Retrieval Function

get_points <- function(place, points_list) {
  if (is.na(place) || !is.finite(place)) {
    return(0)
  }
  if (place >= 1 && place <= length(points_list)) {
    return(points_list[place])
  }
  return(0)
}

Purpose: Creates a robust function for retrieving points that handles edge cases:

Invalid places (NA, infinite values): Returns 0 points
Out-of-range places (beyond points positions): Returns 0 points
Valid places: Returns correct points from the appropriate array

9. Function Testing

test_cases <- c(1, 10, 50, 51, -1, NA, Inf)
for (test_place in test_cases) {
  result <- get_points(test_place, wc_points)
  cat(sprintf("  Place %s -> %s points\n", 
              ifelse(is.na(test_place), "NA", as.character(test_place)), result))
}

Purpose: Tests the get_points() function with edge cases to ensure it handles all scenarios correctly before it’s used throughout the analysis.

10. Final Data Summary

cat(sprintf("Final dataset sizes: Men %d rows, Ladies %d rows\n", nrow(M_chrono), nrow(L_chrono)))
cat(sprintf("Men's unique skiers: %d\n", length(unique(M_chrono$Skier))))
cat(sprintf("Ladies unique skiers: %d\n", length(unique(L_chrono$Skier))))
cat(sprintf("Men's season range: %s - %s\n", min(M_chrono$Season, na.rm = TRUE), max(M_chrono$Season, na.rm = TRUE)))

Purpose: Provides a comprehensive summary of the loaded and cleaned data, including:

Total number of race results
Number of unique athletes in each dataset
Season range covered by the data

Key Design Principles

Fail Fast: Immediate validation prevents errors from propagating through the analysis
Comprehensive Logging: Detailed output helps with debugging and verification
Defensive Programming: Handles edge cases and unexpected data conditions
Data Integrity: Multiple validation layers ensure data quality
Transparency: Clear reporting of what data was excluded and why

Output Variables

M_chrono: Cleaned men’s chronometer data ready for processing
L_chrono: Cleaned ladies’ chronometer data ready for processing
wc_points, stage_points, tds_points: Validated points arrays for different race types
get_points(): Function for safe points retrieval throughout the analysis
excluded_men, excluded_ladies: Lists of excluded athletes for reference

This section establishes the foundation for all subsequent analysis by ensuring clean, validated, and properly formatted data.

Section: {r process-data} - Data Processing & Feature Engineering

Purpose

This section transforms the raw chronometer data into a modeling-ready format by adding World Cup points, filtering for relevant races, calculating cumulative statistics, and creating performance percentage metrics. The goal is to convert raw race results into meaningful features that can be used for predictive modeling.

Core Design Pattern

The section uses a single comprehensive function process_chrono_data() that can handle both men’s and ladies’ data consistently, ensuring identical processing logic and reducing code duplication.

Implementation Details

1. Function Structure and Input Validation

process_chrono_data <- function(chrono_df, data_name = "Unknown") {
  # Input validation
  if (nrow(chrono_df) == 0) {
    stop(sprintf("%s dataset is empty", data_name))
  }
  
  required_cols <- c("Event", "City", "Place", "Distance", "Date", "Race", "ID", "Season")
  missing_cols <- setdiff(required_cols, names(chrono_df))
  if (length(missing_cols) > 0) {
    stop(sprintf("Missing required columns in %s data: %s", data_name, paste(missing_cols, collapse = ", ")))
  }
}

Purpose: The function begins with robust input validation to catch data structure issues early. The data_name parameter allows for clear error messages distinguishing between men’s and ladies’ processing failures.

Design Rationale: Using a function ensures consistent processing between men’s and ladies’ data while making the code more maintainable and testable.

2. World Cup Points Assignment

df <- chrono_df %>%
  mutate(Points = case_when(
    Event == "Tour de Ski" & City == "Tour de Ski" ~ map_int(Place, ~ get_points(.x, tds_points)),
    Event == "Tour de Ski" ~ map_int(Place, ~ get_points(.x, stage_points)),
    TRUE ~ map_int(Place, ~ get_points(.x, wc_points))
  ))

Purpose: Assigns appropriate World Cup points based on race type using the points arrays established in the load-data section. The logic handles the Tour de Ski scoring system with two conditions:

Tour de Ski Overall (Event == "Tour de Ski" & City == "Tour de Ski"): Uses enhanced TdS points (300 for 1st place)
Tour de Ski Stages (Event == "Tour de Ski" without the City check): Uses stage points (50 for 1st place)
All Other Races: Uses standard WC points (100 for 1st place)

Technical Details: Uses map_int() to safely apply the get_points() function to each Place value, ensuring proper error handling for invalid places.

3. Points Assignment Validation

points_summary <- df %>%
  summarise(
    total_points = sum(Points, na.rm = TRUE),
    zero_points = sum(Points == 0, na.rm = TRUE),
    max_points = max(Points, na.rm = TRUE),
    races_with_points = sum(Points > 0, na.rm = TRUE)
  )

Purpose: Validates that points were assigned correctly by checking:

Total points: Ensures points were actually assigned
Zero points count: Identifies races where athletes finished outside points positions
Maximum points: Verifies the highest points value makes sense (should be 300 for TdS, 100 for WC, 50 for stages)
Scoring races: Counts how many race results earned points

This validation catches issues like incorrect points array usage or data formatting problems.

4. Race Type Filtering (Inclusion-Based)

relevant_events <- c("Offseason", "World Cup", "Nordic Opening", "Tour de Ski", "World Cup Final", "Ski Tour Canada")
df <- df %>%
  filter(Event %in% relevant_events)

Purpose: Uses an inclusion-based filter to keep only events that are relevant for World Cup season prediction:

World Cup: Standard World Cup races throughout the season
Tour de Ski: The prestigious multi-stage tour
World Cup Final: Season-ending finals
Nordic Opening: Season-opening races
Ski Tour Canada: Similar tour format to Tour de Ski
Offseason: Summer/fall preparation races that indicate form

Rationale: This inclusion approach is safer than exclusion because it explicitly defines what should be included for World Cup prediction, preventing accidentally including unknown event types that might be added to the data.

5. Team Event Removal

df <- df %>%
  filter(!Distance %in% c("Ts", "Rel"))

Purpose: Removes team-based competitions using distance codes:

“Ts”: Team Sprint events
“Rel”: Relay events

Rationale: Team events are excluded because:

Different skill requirements: Team events require different tactics and strategies than individual races
Team selection effects: Performance depends on team composition and relay strategy, not just individual ability
Individual vs. team performance: A skier’s individual World Cup potential may not correlate with team event results
Prediction target mismatch: We’re predicting individual World Cup season performance

6. Cumulative Statistics Calculation

df <- df %>%
  arrange(Skier, Date) %>%
  group_by(Skier, Season) %>%
  mutate(
    Cumulative_Points = cumsum(Points),
    Cumulative_Races = row_number(),
    Season_Race_Count = n()
  ) %>%
  ungroup()

Purpose: Creates running totals within each season for each skier:

Cumulative_Points: Running total of World Cup points earned through the season
Cumulative_Races: Number of races completed so far in the season
Season_Race_Count: Total number of races the skier participated in during that season

Technical Details:

arrange(Skier, Date): Ensures chronological order for accurate cumulative calculations
group_by(Skier, Season): Calculates cumulative stats separately for each skier’s season
row_number(): Provides race sequence number within the season
ungroup(): Removes grouping to prevent issues in subsequent operations

7. Maximum Possible Points Calculation

max_points_per_season <- df %>%
  group_by(Season) %>%
  summarise(
    Races_in_Season = length(unique(paste(Date, Event, City, Distance))),
    Max_Possible_Points = Races_in_Season * 100,
    .groups = 'drop'
  )

Purpose: Calculates the theoretical maximum points possible in each season by:

Counting unique races: Uses concatenated race identifiers to count distinct races
Assuming perfect performance: Multiplies by 100 (1st place in every regular WC race)
Season-specific calculation: Accounts for varying numbers of races across seasons

Rationale: This enables calculation of performance percentages that are comparable across seasons with different numbers of races.

8. Performance Percentage Calculation

df <- df %>%
  left_join(max_points_per_season, by = "Season") %>%
  mutate(
    Pct_of_Max_Points = ifelse(Max_Possible_Points > 0, 
                              Cumulative_Points / Max_Possible_Points, 
                              0)
  )

Purpose: Creates a normalized performance metric (Pct_of_Max_Points) that represents what percentage of the theoretical maximum points a skier achieved. This metric is crucial because:

Cross-season comparability: Different seasons have different numbers of races
Performance benchmarking: Provides a standard 0-100% scale for performance
Model input: Serves as a key predictor variable for future performance

Technical Details: Uses ifelse() to handle edge case where Max_Possible_Points might be zero.

9. Percentage Validation

pct_validation <- df %>%
  summarise(
    min_pct = min(Pct_of_Max_Points, na.rm = TRUE),
    max_pct = max(Pct_of_Max_Points, na.rm = TRUE),
    above_100_pct = sum(Pct_of_Max_Points > 1, na.rm = TRUE),
    negative_pct = sum(Pct_of_Max_Points < 0, na.rm = TRUE)
  )

if (pct_validation$above_100_pct > 0) {
  warning(sprintf("%d records have >100%% of max points - check calculation logic", 
                 pct_validation$above_100_pct))
}

Purpose: Validates the percentage calculations by checking for impossible values:

Above 100%: Would indicate calculation errors (can’t exceed theoretical maximum)
Negative percentages: Would indicate data or calculation problems
Range validation: Ensures percentages fall within expected 0-100% range

Error Handling: Issues warnings for impossible values that would indicate bugs in the calculation logic.

10. Final Data Quality Report

final_summary <- df %>%
  summarise(
    unique_skiers = length(unique(Skier)),
    unique_seasons = length(unique(Season)),
    total_races = n(),
    date_range = paste(min(Date, na.rm = TRUE), "to", max(Date, na.rm = TRUE))
  )

Purpose: Provides a comprehensive summary of the processed data for verification and documentation purposes.

Key Design Principles

Consistency: Single function ensures identical processing for men’s and ladies’ data
Validation at Every Step: Each transformation includes validation and summary statistics
Transparent Filtering: Clear logging of what data is removed and why
Robust Calculations: Handles edge cases like zero denominators and missing values
Auditable Process: Detailed logging enables verification of data transformations

Critical Features Created

Points: World Cup points assigned based on race type and finishing position
Cumulative_Points: Running total of points within each season
Cumulative_Races: Number of races completed in the season
Season_Race_Count: Total races participated in during the season
Pct_of_Max_Points: Normalized performance metric (0-100% of theoretical maximum)
Max_Possible_Points: Theoretical maximum points available in each season

Output Variables

M_processed: Men’s data with World Cup points, cumulative statistics, and performance percentages
L_processed: Ladies’ data with identical feature engineering applied

This section transforms raw race results into meaningful features that form the foundation for predictive modeling, with comprehensive validation ensuring data quality throughout the process.

Section: {r elo-prep} - ELO Data Preparation

Purpose

This section prepares ELO rating data for predictive modeling by creating lag features, handling missing values, and merging ELO ratings with the processed chronometer data. The goal is to incorporate historical performance ratings as predictive features while ensuring data quality and temporal consistency.

Core Design Philosophy

The section uses a systematic approach to handle missing ELO data by implementing quartile-based replacement strategies that preserve the distributional characteristics of the ratings within each season, preventing artificial bias in the predictive models.

Implementation Details

1. Missing Value Imputation Function

replace_na_with_quartile <- function(df, column_name, season_col = "Season") {
  df %>%
    group_by(!!sym(season_col)) %>%
    mutate(
      !!sym(column_name) := ifelse(
        is.na(!!sym(column_name)),
        quantile(!!sym(column_name), 0.25, na.rm = TRUE),
        !!sym(column_name)
      )
    ) %>%
    ungroup()
}

Purpose: Creates a robust function for handling missing ELO values by replacing them with the 25th percentile (first quartile) of the same season.

Design Rationale:

Conservative approach: Using Q1 instead of median assumes missing ELO ratings typically represent newer or less-established skiers
Season-specific: Accounts for rating inflation/deflation over time by calculating quartiles within each season
Preserves distribution: Maintains the overall shape of the ELO distribution rather than introducing artificial central tendency
Generic implementation: Uses !!sym() for dynamic column name handling, making the function reusable

Statistical Justification: The first quartile represents a “below-average but not terrible” rating, which is a reasonable assumption for skiers with missing ELO data who are likely newer to the circuit.

2. ELO Data Processing Function

prepare_elo_data <- function(processed_data, data_name = "Unknown") {
  cat(sprintf("\n--- Preparing ELO Data for %s ---\n", data_name))
  
  # Input validation
  if (nrow(processed_data) == 0) {
    stop(sprintf("%s processed data is empty", data_name))
  }
  
  original_rows <- nrow(processed_data)
  cat(sprintf("Input: %d rows\n", original_rows))
}

Purpose: Establishes a systematic function for ELO data preparation with comprehensive validation and logging.

Design Rationale: Following the same pattern as process_chrono_data(), this ensures consistent error handling and makes the processing pipeline maintainable.

3. Temporal Filtering for Modeling Relevance

# Filter for recent seasons (2016 onwards for modeling)
df <- processed_data %>%
  filter(Season >= 2016)

filtered_rows <- nrow(df)
cat(sprintf("After temporal filtering (Season >= 2016): %d rows (removed %d older rows)\n", 
            filtered_rows, original_rows - filtered_rows))

Purpose: Removes older data that may not be relevant for current performance prediction.

Rationale:

Equipment evolution: Ski technology and training methods have evolved significantly
Competition format changes: Rule changes and race formats have modified competitive dynamics
Data quality: More recent data tends to have better ELO rating coverage and accuracy
Model relevance: Training models on too-old data can reduce predictive accuracy for current conditions

4. Comprehensive Missing Value Analysis

# Analyze missing values before imputation
elo_columns <- c("Pelo", "Distance_Pelo", "Sprint_Pelo", "Prev_Pelo", "Prev_Distance", "Prev_Sprint")
missing_analysis <- df %>%
  summarise(across(all_of(elo_columns), ~ sum(is.na(.x)), .names = "missing_{.col}")) %>%
  pivot_longer(everything(), names_to = "column", values_to = "missing_count") %>%
  mutate(
    column = str_remove(column, "missing_"),
    missing_pct = round(missing_count / nrow(df) * 100, 2)
  )

cat("Missing value analysis:\n")
print(missing_analysis)

Purpose: Provides a comprehensive assessment of missing ELO data before imputation to understand the scope and patterns of missingness.

Analysis Components:

Column-wise counts: Shows which ELO ratings have the most missing values
Percentage calculation: Provides context for the severity of missingness
Pre-imputation baseline: Establishes what the data quality issues are before fixing them

This analysis helps validate that the imputation strategy is appropriate and identifies potential data quality issues.

5. Systematic Missing Value Replacement

# Apply quartile replacement to all ELO columns
elo_columns_to_fix <- c("Pelo", "Distance_Pelo", "Sprint_Pelo", "Prev_Pelo", "Prev_Distance", "Prev_Sprint")

for (col in elo_columns_to_fix) {
  if (col %in% names(df)) {
    before_imputation <- sum(is.na(df[[col]]))
    cat(sprintf("Fixing missing values in %s (%d missing)...\n", col, before_imputation))
    
    df <- replace_na_with_quartile(df, col)
    
    after_imputation <- sum(is.na(df[[col]]))
    cat(sprintf("  After imputation: %d missing (fixed %d values)\n", 
                after_imputation, before_imputation - after_imputation))
  } else {
    warning(sprintf("Column %s not found in %s data", col, data_name))
  }
}

Purpose: Systematically applies the quartile replacement strategy to all ELO-related columns with detailed logging of the imputation process.

Implementation Details:

Iterative processing: Handles each ELO column separately to track imputation success
Existence checking: Validates that columns exist before attempting imputation
Progress tracking: Reports before/after missing value counts for transparency
Error handling: Warns if expected columns are missing from the data

Column Types Handled:

Pelo: Current overall ELO rating
Distance_Pelo: Distance-specific ELO rating
Sprint_Pelo: Sprint-specific ELO rating
Prev_Pelo: Previous season’s overall ELO rating
Prev_Distance: Previous season’s distance ELO rating
Prev_Sprint: Previous season’s sprint ELO rating

6. Lag Feature Creation and Validation

# Create and validate lag features for temporal modeling
df <- df %>%
  arrange(Skier, Date) %>%
  group_by(Skier) %>%
  mutate(
    Lag_Pct_of_Max_Points = lag(Pct_of_Max_Points, 1),
    Lag_Cumulative_Points = lag(Cumulative_Points, 1),
    Lag_Season = lag(Season, 1)
  ) %>%
  ungroup()

# Validate lag feature creation
lag_analysis <- df %>%
  summarise(
    lag_pct_missing = sum(is.na(Lag_Pct_of_Max_Points)),
    lag_points_missing = sum(is.na(Lag_Cumulative_Points)),
    lag_season_missing = sum(is.na(Lag_Season))
  )

cat("Lag feature analysis:\n")
cat(sprintf("  Lag_Pct_of_Max_Points missing: %d\n", lag_analysis$lag_pct_missing))
cat(sprintf("  Lag_Cumulative_Points missing: %d\n", lag_analysis$lag_points_missing))
cat(sprintf("  Lag_Season missing: %d\n", lag_analysis$lag_season_missing))

Purpose: Creates lagged performance features that capture historical performance trends and validates the lag feature creation process.

Lag Features Created:

Lag_Pct_of_Max_Points: Previous race’s performance percentage
Lag_Cumulative_Points: Previous race’s cumulative season points
Lag_Season: Previous race’s season (for cross-season boundary detection)

Technical Implementation:

arrange(Skier, Date): Ensures chronological order for accurate lag calculation
group_by(Skier): Calculates lags separately for each skier to prevent cross-skier contamination
lag(..., 1): Creates one-period lag (previous race result)

7. Post-Imputation Validation

# Final validation of ELO data quality
final_missing_analysis <- df %>%
  summarise(across(all_of(elo_columns_to_fix), ~ sum(is.na(.x)), .names = "final_missing_{.col}"))

cat("\nFinal missing value analysis after imputation:\n")
print(final_missing_analysis)

# Check for any remaining missing values
total_remaining_missing <- sum(unlist(final_missing_analysis))
if (total_remaining_missing > 0) {
  warning(sprintf("%s: %d missing values remain after imputation", data_name, total_remaining_missing))
} else {
  cat("✓ All ELO missing values successfully imputed\n")
}

Purpose: Validates that the imputation process was successful and no missing values remain in critical ELO features.

Validation Checks:

Complete imputation: Ensures all missing ELO values were replaced
Data integrity: Confirms the imputation process didn’t introduce errors
Quality assurance: Provides final confirmation that data is ready for modeling

8. Final Data Summary and Quality Report

# Generate comprehensive data summary
final_summary <- df %>%
  summarise(
    unique_skiers = length(unique(Skier)),
    unique_seasons = length(unique(Season)),
    total_races = n(),
    date_range = paste(min(Date, na.rm = TRUE), "to", max(Date, na.rm = TRUE)),
    elo_range_pelo = paste(round(min(Pelo, na.rm = TRUE)), "to", round(max(Pelo, na.rm = TRUE))),
    avg_pct_max_points = round(mean(Pct_of_Max_Points, na.rm = TRUE), 3)
  )

cat(sprintf("\n=== %s ELO DATA PREPARATION COMPLETE ===\n", data_name))
cat(sprintf("Unique skiers: %d\n", final_summary$unique_skiers))
cat(sprintf("Seasons covered: %d\n", final_summary$unique_seasons))
cat(sprintf("Total race records: %d\n", final_summary$total_races))
cat(sprintf("Date range: %s\n", final_summary$date_range))
cat(sprintf("ELO rating range: %s\n", final_summary$elo_range_pelo))
cat(sprintf("Average performance: %s%% of max points\n", final_summary$avg_pct_max_points * 100))

Purpose: Provides comprehensive reporting on the final prepared dataset for verification and documentation.

Key Design Principles

Conservative Missing Value Handling: Uses Q1 replacement to avoid overestimating missing skiers’ abilities
Temporal Awareness: Recognizes that model relevance decreases with data age
Season-Specific Imputation: Accounts for rating inflation/changes over time
Comprehensive Validation: Validates every step of the data preparation process
Transparent Processing: Detailed logging enables verification and troubleshooting
Robust Error Handling: Handles edge cases like missing columns or empty datasets

Critical Features Created

Imputed ELO Ratings: All ELO columns with missing values replaced using season-specific quartiles
Lag Features: Previous race performance metrics for temporal modeling
Temporal Consistency: Data filtered to relevant time periods for model training
Quality Assurance: Comprehensive validation of all transformations

Output Variables

M_elo: Men’s data with imputed ELO ratings, lag features, and temporal filtering applied
L_elo: Ladies’ data with identical ELO preparation processing

Missing Value Strategy Justification

The choice of first quartile (Q1) replacement for missing ELO values is based on several assumptions:

New skiers: Missing ELO typically indicates newer athletes without established ratings
Conservative estimation: Q1 provides a below-average but not unrealistic starting point
Distribution preservation: Maintains the shape of the ELO distribution within each season
Model stability: Prevents artificial clustering around mean values that could skew predictions

This section ensures that ELO ratings are ready for use as predictive features while maintaining data integrity and providing comprehensive validation throughout the preparation process.

Section: {r comprehensive-feature-selection} - Multi-Method Feature Selection for Season Performance Prediction

Purpose

This section implements a rigorous multi-method approach to feature selection for predicting cross-country skiing season performance. The goal is to identify the most predictive variables for season success using four complementary statistical methods: correlation analysis, LASSO regularization, Boruta feature importance, and exhaustive AIC-based search. This comprehensive approach ensures robust feature selection that captures both linear and non-linear relationships while avoiding overfitting.

Core Design Philosophy

The section follows a systematic validation-first approach where every step includes comprehensive error checking and data quality assessment. Rather than assuming clean data, it validates inputs, checks for edge cases, and provides detailed logging throughout the feature selection process. The multi-method consensus approach reduces the risk of method-specific bias and increases confidence in the selected features.

Implementation Details

1. Input Validation and Dataset Preparation

# Input validation for ELO datasets
if (nrow(M_elo) == 0) {
  stop("Men's ELO dataset is empty")
}
if (nrow(L_elo) == 0) {
  stop("Ladies ELO dataset is empty")
}

cat(sprintf("Input datasets: Men %d rows, Ladies %d rows\n", nrow(M_elo), nrow(L_elo)))

Purpose: Validates that the ELO datasets exist and contain data before proceeding with feature selection.

Design Rationale: Failing fast with clear error messages prevents mysterious downstream failures and makes debugging easier.

2. Training Data Temporal Filtering

# Prepare training data - include more historical seasons to capture early breakthroughs
# Use data from 2016+ to include breakthrough seasons like Klæbo 2018
cat("Filtering training data (2016-2025, non-NA Pct_of_Max_Points)...\n")

# Check available seasons before filtering
men_seasons_available <- sort(unique(M_elo$Season))
ladies_seasons_available <- sort(unique(L_elo$Season))

cat(sprintf("Men's available seasons: %s\n", paste(range(men_seasons_available), collapse = " - ")))
cat(sprintf("Ladies available seasons: %s\n", paste(range(ladies_seasons_available), collapse = " - ")))

# Apply training filters with validation
train_men <- M_elo %>% 
  filter(Season <= 2025, Season >= 2016) %>% 
  filter(!is.na(Pct_of_Max_Points))

train_ladies <- L_elo %>% 
  filter(Season <= 2025, Season >= 2016) %>% 
  filter(!is.na(Pct_of_Max_Points))

Purpose: Creates training datasets by filtering for relevant seasons and excluding the prediction target year (2026).

Temporal Strategy:

2016+ cutoff: Includes modern era data while capturing breakthrough seasons (e.g., Klæbo’s 2018 breakthrough)
≤2025 filter: Includes 2025 season data since the “Prev_” features represent previous season performance that would predict 2026 outcomes
Target variable validation: Ensures all training observations have valid performance metrics

Rationale: Using 2016-2025 data balances having sufficient training data while ensuring relevance to modern cross-country skiing. Since features are lagged (“Prev_” variables), 2025 season observations contain the previous season data needed to predict 2026 performance.

3. Season Coverage Validation

# Check season coverage in training data
train_men_seasons <- sort(unique(train_men$Season))
train_ladies_seasons <- sort(unique(train_ladies$Season))

cat(sprintf("Men's training seasons: %s (%d seasons)\n", 
            paste(train_men_seasons, collapse = ", "), length(train_men_seasons)))
cat(sprintf("Ladies training seasons: %s (%d seasons)\n", 
            paste(train_ladies_seasons, collapse = ", "), length(train_ladies_seasons)))

if (length(train_men_seasons) < 3) {
  warning("Men's training data has fewer than 3 seasons - may affect model robustness")
}
if (length(train_ladies_seasons) < 3) {
  warning("Ladies training data has fewer than 3 seasons - may affect model robustness")
}

Purpose: Validates that sufficient seasons are available for robust model training and warns about potential issues.

Validation Logic: Requires at least 3 seasons of training data to ensure the model can capture temporal patterns and variability across different competitive seasons.

4. Feature Universe Definition and Availability Check

# Define and validate potential features
all_features <- c("Prev_Pelo", "Prev_Distance", "Prev_Distance_C", "Prev_Distance_F", 
                  "Prev_Sprint", "Prev_Sprint_C", "Prev_Sprint_F", "Prev_C", "Prev_F", 
                  "Prev_Pct_of_Max_Points", "Age")

# Check feature availability in training datasets
men_available_features <- intersect(all_features, names(train_men))
ladies_available_features <- intersect(all_features, names(train_ladies))

cat(sprintf("Men's available features: %d/%d\n", length(men_available_features), length(all_features)))
cat(sprintf("Ladies available features: %d/%d\n", length(ladies_available_features), length(all_features)))

Purpose: Defines the complete universe of potential predictive features and validates their availability in the training data.

Feature Categories:

Overall Performance: Prev_Pelo, Prev_Pct_of_Max_Points
Discipline-Specific: Prev_Distance, Prev_Sprint
Technique-Specific: Prev_C, Prev_F
Combined Factors: Prev_Distance_C, Prev_Distance_F, Prev_Sprint_C, Prev_Sprint_F
Demographic: Age

Validation Strategy: Rather than assuming features exist, the code explicitly checks availability and adapts to what’s actually present in the data.

5. Comprehensive Data Quality Assessment

# Check for missing values in features
men_feature_na_counts <- sapply(train_men[all_features_men], function(x) sum(is.na(x)))
ladies_feature_na_counts <- sapply(train_ladies[all_features_ladies], function(x) sum(is.na(x)))

if (any(men_feature_na_counts > 0)) {
  cat("Men's features with NAs:\n")
  print(men_feature_na_counts[men_feature_na_counts > 0])
  warning("Men's training data contains missing values in features")
}

# Check for infinite values
men_feature_inf_counts <- sapply(train_men[all_features_men], function(x) sum(!is.finite(x)))
ladies_feature_inf_counts <- sapply(train_ladies[all_features_ladies], function(x) sum(!is.finite(x)))

if (any(men_feature_inf_counts > 0)) {
  cat("Men's features with infinite values:\n")
  print(men_feature_inf_counts[men_feature_inf_counts > 0])
  warning("Men's training data contains infinite values")
}

Purpose: Performs comprehensive data quality checks to identify missing values, infinite values, and other data integrity issues.

Quality Checks:

Missing values: Counts NAs in each feature to identify incomplete data
Infinite values: Detects infinite or non-finite values that could break modeling
Target variable validation: Ensures the dependent variable has sufficient valid observations

Error Handling: Uses warnings rather than errors to allow the process to continue while flagging potential issues.

6. Method 1: Correlation Analysis with Validation

# 1. CORRELATION ANALYSIS with validation
tryCatch({
  if (length(all_features_men) < 2) {
    cat("Insufficient features for correlation analysis\n")
    cor_matrix_men <- NULL
    high_cor_men <- data.frame()
  } else {
    cor_matrix_men <- cor(train_men[all_features_men], use = "complete.obs")
    
    # Validate correlation matrix
    if (any(is.na(cor_matrix_men))) {
      warning("Correlation matrix contains NA values")
    }
    
    high_cor_men <- which(abs(cor_matrix_men) > 0.7 & upper.tri(cor_matrix_men), arr.ind = TRUE)
    if(nrow(high_cor_men) > 0) {
      cat("High correlations (|r| > 0.7):\n")
      for(i in 1:nrow(high_cor_men)) {
        row_name <- rownames(cor_matrix_men)[high_cor_men[i,1]]
        col_name <- colnames(cor_matrix_men)[high_cor_men[i,2]]
        cor_val <- cor_matrix_men[high_cor_men[i,1], high_cor_men[i,2]]
        cat(sprintf("  %s - %s: %.3f\n", row_name, col_name, cor_val))
      }
    } else {
      cat("✓ No high correlations found\n")
    }
  }
}, error = function(e) {
  cat("Error in correlation analysis:", e$message, "\n")
  cor_matrix_men <- NULL
  high_cor_men <- data.frame()
})

Purpose: Conducts correlation analysis to identify multicollinearity issues and understand linear relationships between features.

Technical Details:

Complete observations: Uses use = "complete.obs" to handle missing values appropriately
Multicollinearity detection: Identifies feature pairs with |r| > 0.7 that might cause modeling issues
Error handling: Uses tryCatch() to gracefully handle computational failures
Upper triangular: Avoids duplicate correlation reporting by using upper.tri()

Diagnostic Value: High correlations indicate potential redundancy and help inform feature selection decisions.

7. Method 2: LASSO Regularization with Cross-Validation

# 2. LASSO REGULARIZATION with validation
tryCatch({
  set.seed(42)
  
  # Prepare data for LASSO
  x_men <- as.matrix(train_men[all_features_men])
  y_men <- train_men$Pct_of_Max_Points
  
  # Validate data for LASSO
  if (any(!is.finite(x_men))) {
    warning("Non-finite values in feature matrix for LASSO")
  }
  if (any(!is.finite(y_men))) {
    warning("Non-finite values in target variable for LASSO")
  }
  
  cv_lasso_men <- cv.glmnet(x_men, y_men, alpha = 1, nfolds = 5)
  best_lambda_men <- cv_lasso_men$lambda.min
  lasso_coef_men <- coef(cv_lasso_men, s = best_lambda_men)
  
  lasso_selected_men <- rownames(lasso_coef_men)[lasso_coef_men[,1] != 0]
  lasso_selected_men <- lasso_selected_men[lasso_selected_men != "(Intercept)"]  # Remove intercept
  
  cat("LASSO selected features:\n")
  if (length(lasso_selected_men) > 0) {
    cat(paste("  ", lasso_selected_men, collapse = "\n"), "\n")
    cat("Coefficients:\n")
    significant_coefs <- lasso_coef_men[lasso_coef_men[,1] != 0, , drop = FALSE]
    print(round(significant_coefs, 4))
  } else {
    cat("  No features selected by LASSO\n")
  }
  
  cat(sprintf("Best lambda: %.6f\n", best_lambda_men))
  
}, error = function(e) {
  cat("Error in LASSO regularization:", e$message, "\n")
  lasso_selected_men <- character(0)
})

Purpose: Uses L1 regularization to perform automatic feature selection by shrinking coefficients of less important features to zero.

Technical Implementation:

Cross-validation: Uses 5-fold CV to select optimal regularization parameter (lambda)
Lambda selection: Uses lambda.min for optimal predictive performance
Coefficient extraction: Identifies features with non-zero coefficients as selected
Data validation: Checks for finite values before model fitting

Advantages: LASSO automatically handles multicollinearity and provides sparse solutions while maintaining predictive performance.

8. Method 3: Boruta Algorithm with Robust Error Handling

# 3. BORUTA ALGORITHM with validation
tryCatch({
  set.seed(42)
  
  # Prepare data for Boruta
  boruta_data_men <- train_men[c(all_features_men, "Pct_of_Max_Points")]
  
  # Check for complete cases
  complete_cases <- complete.cases(boruta_data_men)
  n_complete <- sum(complete_cases)
  
  if (n_complete < nrow(boruta_data_men)) {
    cat(sprintf("Warning: Using %d complete cases out of %d total\n", n_complete, nrow(boruta_data_men)))
    boruta_data_men <- boruta_data_men[complete_cases, ]
  }
  
  if (nrow(boruta_data_men) < 10) {
    cat("Insufficient complete cases for Boruta analysis\n")
  } else {
    boruta_men <- Boruta(Pct_of_Max_Points ~ ., data = boruta_data_men, 
                         doTrace = 0, maxRuns = 100)
    
    boruta_confirmed_men <- names(boruta_men$finalDecision[boruta_men$finalDecision == "Confirmed"])
    boruta_tentative_men <- names(boruta_men$finalDecision[boruta_men$finalDecision == "Tentative"])
    
    cat("Boruta confirmed features:\n")
    if (length(boruta_confirmed_men) > 0) {
      cat(paste("  ", boruta_confirmed_men, collapse = "\n"), "\n")
    } else {
      cat("  None\n")
    }
    
    cat("Boruta tentative features:\n")
    if (length(boruta_tentative_men) > 0) {
      cat(paste("  ", boruta_tentative_men, collapse = "\n"), "\n")
    } else {
      cat("  None\n")
    }
  }
  
}, error = function(e) {
  cat("Error in Boruta algorithm:", e$message, "\n")
  boruta_confirmed_men <- character(0)
  boruta_tentative_men <- character(0)
})

Purpose: Uses random forest-based feature importance with statistical significance testing to identify features that consistently outperform random variables.

Methodology:

Shadow features: Creates randomized copies of features as importance baseline
Statistical testing: Compares feature importance against shadow features using Wilcoxon tests
Iterative process: Runs up to 100 iterations to reach statistical decisions
Complete cases: Handles missing data by using only complete observations

Robust Implementation:

Sample size validation: Requires at least 10 complete cases for reliable results
Decision categories: Separates “Confirmed” (definitely important) from “Tentative” (unclear) features
Error recovery: Gracefully handles computational failures

9. Method 4: Exhaustive AIC-Based Search with Computational Limits

# 4. EXHAUSTIVE SEARCH WITH AIC with validation
tryCatch({
  
  max_features <- min(5, length(all_features_men))
  min_features <- max(2, min(length(all_features_men), 2))
  
  if (max_features < min_features) {
    cat("Insufficient features for exhaustive search\n")
  } else {
    cat(sprintf("Testing %d to %d feature combinations...\n", min_features, max_features))
    
    total_combinations <- 0
    successful_models <- 0
    
    for(k in min_features:max_features) {
      if (k > length(all_features_men)) break
      
      feature_combinations <- combn(all_features_men, k, simplify = FALSE)

Purpose: Performs exhaustive search across feature combinations to find the optimal subset based on AIC (Akaike Information Criterion).

Computational Strategy:

Limited scope: Restricts search to 2-5 features to maintain computational feasibility
AIC-based selection: Uses AIC to balance model fit against complexity
Exhaustive enumeration: Tests all possible combinations within size constraints
Progress tracking: Monitors successful model fits vs. total attempts

Design Rationale: While computationally intensive, exhaustive search guarantees finding the optimal subset within the specified constraints, providing a gold standard for comparison with other methods.

Key Design Principles

Validation-First Approach: Every operation includes comprehensive input validation and error checking
Graceful Degradation: Methods fail gracefully with informative messages rather than crashing the entire analysis
Comprehensive Logging: Detailed output enables troubleshooting and verification of results
Multi-Method Consensus: Combines multiple approaches to reduce method-specific bias
Computational Efficiency: Adapts complexity based on available data and computational constraints
Reproducibility: Uses fixed random seeds for consistent results across runs

Statistical Justification

Each method captures different aspects of feature importance:

Correlation: Linear relationships and multicollinearity assessment
LASSO: Automatic selection with regularization for high-dimensional data
Boruta: Non-linear relationships and statistical significance testing
Exhaustive AIC: Optimal subset selection within computational constraints

The combination provides a robust foundation for identifying the most predictive features while avoiding overfitting and method-specific artifacts.

Output Variables

lasso_selected_men/ladies: Features selected by LASSO regularization
boruta_confirmed_men/ladies: Features confirmed as important by Boruta algorithm
boruta_tentative_men/ladies: Features with tentative importance from Boruta
best_features_men/ladies: Optimal feature subset from exhaustive AIC search
cor_matrix_men/ladies: Correlation matrices for multicollinearity assessment

This section establishes a comprehensive feature selection foundation that informs subsequent modeling decisions through multiple validated approaches.

Section: {r gam-model} - GAM Model Construction for Season Performance Prediction

Purpose

This section builds Generalized Additive Models (GAM) for predicting cross-country skiing season performance using the consensus features identified in the previous section. GAMs are chosen for their ability to capture non-linear relationships while maintaining interpretability through smooth functions. The section implements robust model construction with comprehensive validation, error handling, and fallback strategies to ensure reliable model building even with imperfect data.

Core Design Philosophy

The section follows a defensive programming approach with multiple layers of validation and graceful degradation. Rather than assuming perfect inputs, it validates every step and provides meaningful fallback options when primary model construction fails. The GAM approach balances predictive power with interpretability, using smooth functions to capture complex non-linear relationships in skiing performance data.

Implementation Details

1. Input Validation and Prerequisites Check

# Validate inputs for GAM model building
if (!exists("final_features_men") || !exists("final_features_ladies")) {
  stop("Final features not defined - ensure feature selection completed successfully")
}

if (!exists("train_men") || !exists("train_ladies")) {
  stop("Training data not available - ensure data preparation completed successfully")
}

cat(sprintf("Input validation: Men %d features, Ladies %d features\n", 
            length(final_features_men), length(final_features_ladies)))

cat(sprintf("Training data: Men %d rows, Ladies %d rows\n", 
            nrow(train_men), nrow(train_ladies)))

Purpose: Validates that all necessary components from previous sections are available before attempting model construction.

Validation Checks:

Feature availability: Ensures feature selection process completed successfully
Data availability: Confirms training datasets exist and are populated
Dependency verification: Checks that all prerequisites are met before proceeding

Design Rationale: Failing fast with clear error messages prevents cryptic downstream failures and makes debugging easier when running the analysis pipeline.

2. Feature Validation and GAM Formula Construction

# Build GAM formula for Men using validated features
tryCatch({
  if(length(final_features_men) > 0) {
    # Validate features exist in training data
    missing_features_men <- setdiff(final_features_men, names(train_men))
    if (length(missing_features_men) > 0) {
      cat("Warning: Missing features in men's training data:", paste(missing_features_men, collapse = ", "), "\n")
      final_features_men <- intersect(final_features_men, names(train_men))
    }
    
    if (length(final_features_men) > 0) {
      smooth_terms_men <- paste("s(", final_features_men, ")", collapse = " + ")
      gam_formula_men <- as.formula(paste("Pct_of_Max_Points ~", smooth_terms_men))
      cat("Men's GAM Formula (Validated Features):\n")
      print(gam_formula_men)

Purpose: Constructs GAM formulas with smooth terms while validating that all specified features actually exist in the training data.

Technical Implementation:

Feature existence check: Validates features from selection process exist in training data
Dynamic formula construction: Builds GAM formula programmatically using selected features
Smooth term specification: Wraps each feature in s() for non-parametric smooth functions
Formula validation: Displays constructed formula for verification

Robust Design: Automatically removes missing features rather than failing, ensuring model construction can proceed with available features.

3. Sample Size Validation for GAM Stability

# Check for sufficient data points per feature
min_obs_per_feature <- 10
required_obs <- length(final_features_men) * min_obs_per_feature
if (nrow(train_men) < required_obs) {
  warning(sprintf("Limited observations for men's GAM (%d obs, %d features, recommend %d+ obs)", 
                 nrow(train_men), length(final_features_men), required_obs))
}

Purpose: Validates that sufficient training data exists for stable GAM estimation with the selected number of features.

Statistical Rationale:

Rule of thumb: Requires at least 10 observations per feature for stable smooth function estimation
Overfitting prevention: Warns when sample size may be insufficient for reliable parameter estimation
Model complexity awareness: Accounts for the fact that GAMs with smooth terms require more data than linear models

Design Principle: Provides warnings rather than stopping execution, allowing users to make informed decisions about model reliability.

4. GAM Model Construction with Error Handling

# Build GAM with error handling
men_gam_model <- gam(gam_formula_men, data = train_men)
cat(sprintf("✓ Men's GAM model built successfully with %d features\n", length(final_features_men)))

Purpose: Constructs the actual GAM model using the validated formula and training data.

Technical Details:

mgcv package: Uses the gam() function from mgcv for automatic smoothness selection
Default smoothing: Allows GAM to automatically select optimal smoothing parameters via GCV/REML
Non-parametric terms: Each feature gets its own smooth function to capture non-linear relationships

Model Specification: The GAM automatically determines:

Smoothing parameters for each term
Effective degrees of freedom for smooth functions
Optimal balance between fit and smoothness

5. Comprehensive Fallback Strategy

}, error = function(e) {
  cat("Error building men's GAM model:", e$message, "\n")
  cat("Attempting fallback to core features...\n")
  
  # Fallback to proven core features
  core_features_men <- intersect(c("Prev_Pct_of_Max_Points", "Prev_Distance", "Prev_Sprint"), names(train_men))
  if (length(core_features_men) >= 2) {
    tryCatch({
      core_smooth_terms_men <- paste("s(", core_features_men, ")", collapse = " + ")
      core_gam_formula_men <- as.formula(paste("Pct_of_Max_Points ~", core_smooth_terms_men))
      men_gam_model <- gam(core_gam_formula_men, data = train_men)
      cat(sprintf("✓ Men's fallback GAM model built with %d core features\n", length(core_features_men)))
    }, error = function(e2) {
      cat("Core features GAM also failed:", e2$message, "\n")
      cat("Attempting simple linear fallback...\n")
      
      # Ultimate fallback to linear model
      if (length(core_features_men) >= 1) {
        simple_formula_men <- as.formula(paste("Pct_of_Max_Points ~", paste(core_features_men, collapse = " + ")))
        men_gam_model <- lm(simple_formula_men, data = train_men)
        cat("✓ Men's linear fallback model built\n")
      } else {
        cat("✗ No suitable features available for men's model\n")
        men_gam_model <- NULL
      }
    })
  } else {
    cat("✗ Insufficient core features for men's fallback model\n")
    men_gam_model <- NULL
  }
})

Purpose: Provides multiple levels of fallback when primary GAM construction fails, ensuring some model is available for prediction.

Fallback Hierarchy:

Primary GAM: Full smooth model with consensus-selected features
Core features GAM: Reduced GAM with proven important features (Prev_Pct_of_Max_Points, Prev_Distance, Prev_Sprint)
Linear fallback: Simple linear model if GAM fitting fails entirely
Graceful failure: Returns NULL if no viable model can be constructed

Robust Design Principles:

Graceful degradation: Each failure triggers attempt at simpler model
Informative logging: Clear messages about which fallback is being used
Preserves functionality: Ensures analysis can continue even with model fitting issues

6. Model Summary and Validation Output

# Print model summary if successfully built
if (!is.null(men_gam_model)) {
  cat("\n--- Men's GAM Model Summary ---\n")
  
  # Basic model information
  if (inherits(men_gam_model, "gam")) {
    cat("Model type: GAM (Generalized Additive Model)\n")
    cat(sprintf("Number of smooth terms: %d\n", length(men_gam_model$smooth)))
    cat(sprintf("Deviance explained: %.1f%%\n", men_gam_model$dev.expl * 100))
    cat(sprintf("Adjusted R-squared: %.3f\n", summary(men_gam_model)$r.sq))
  } else if (inherits(men_gam_model, "lm")) {
    cat("Model type: Linear Model (GAM fallback)\n")
    cat(sprintf("Adjusted R-squared: %.3f\n", summary(men_gam_model)$adj.r.squared))
  }
  
  cat(sprintf("Residual degrees of freedom: %d\n", men_gam_model$df.residual))
  cat(sprintf("AIC: %.2f\n", AIC(men_gam_model)))
  
} else {
  cat("✗ Men's model construction failed completely\n")
}

Purpose: Provides comprehensive model diagnostics and performance metrics for the constructed GAM.

Diagnostic Information:

Model type identification: Clearly indicates whether GAM or linear fallback was used
Smooth term count: Shows complexity of non-linear relationships captured
Deviance explained: GAM-specific measure of variance explained
R-squared values: Standard measure of model fit quality
Information criteria: AIC for model comparison and selection

Adaptive Reporting: Adjusts output based on whether GAM or linear fallback was successfully constructed.

7. Parallel Construction for Ladies’ Model

The section implements identical logic for ladies’ data with the same validation, construction, and fallback strategies:

# Ladies GAM Model - identical structure to men's with appropriate variable names
ladies_gam_model <- NULL
# [Similar comprehensive validation and construction process]

Consistency Principle: Uses identical methodology for both men’s and ladies’ models to ensure comparable results and reduce code duplication.

Key Design Principles

Defensive Programming: Validates inputs at every step and handles errors gracefully
Graceful Degradation: Multiple fallback levels ensure some model is always available
Transparency: Comprehensive logging of all decisions and fallback triggers
Consistency: Identical methodology for men’s and ladies’ models
Adaptability: Adjusts complexity based on available data and successful model fitting
Interpretability: GAM approach balances predictive power with model interpretability

GAM Model Advantages

Non-linear Relationships: Captures complex performance curves without pre-specifying functional forms
Automatic Smoothing: Optimal smoothness selection prevents overfitting
Interpretability: Smooth functions can be plotted and interpreted easily
Flexibility: Handles different types of predictors (continuous ELO ratings, age, etc.)
Robust Fitting: mgcv package provides stable estimation algorithms

Output Variables

men_gam_model: Fitted GAM (or fallback linear model) for men’s season performance prediction
ladies_gam_model: Fitted GAM (or fallback linear model) for ladies’ season performance prediction
gam_formula_men/ladies: Model formulas used for construction
Model diagnostics: R-squared, deviance explained, AIC values for model assessment

Statistical Justification

GAMs are particularly well-suited for this application because:

Non-linear Performance Curves: Athletic performance often follows non-linear patterns with age, experience, and past performance
ELO Rating Relationships: The relationship between ELO ratings and season success may have diminishing returns or threshold effects
Automatic Complexity Control: GAM smoothness selection prevents overfitting while capturing important patterns
Interpretable Non-linearity: Smooth functions provide insight into how predictors influence performance

This section ensures robust model construction that can handle real-world data quality issues while maintaining statistical rigor and interpretability.

Section: {r odds-setup} - Categorical Outcome Preparation for Odds Modeling

Purpose

This section prepares the training data for odds-based modeling by creating categorical outcomes from season performance rankings. Rather than predicting continuous performance percentages, this prepares binary classifications for different achievement levels (Top 3, Top 5, Top 10, Top 30) that can be used to calculate probabilities and betting odds. The section transforms continuous performance metrics into season rankings and then into categorical success indicators.

Core Design Philosophy

The section follows a systematic validation approach where each transformation is verified through multiple checks. Rather than assuming data transformations work correctly, it validates ranking calculations, categorical variable creation, and outcome distributions at each step. This ensures that the categorical variables used for odds modeling accurately reflect the intended performance thresholds.

Implementation Details

1. Training Data Validation for Odds Calculations

# Validate training data availability for odds calculations
if (!exists("train_men") || !exists("train_ladies")) {
  stop("Training data not available - ensure previous sections completed successfully")
}

if (nrow(train_men) == 0) {
  stop("Men's training data is empty")
}
if (nrow(train_ladies) == 0) {
  stop("Ladies training data is empty") 
}

cat(sprintf("Training data for odds: Men %d rows, Ladies %d rows\n", nrow(train_men), nrow(train_ladies)))

Purpose: Validates that training datasets from previous sections are available and populated before proceeding with odds preparation.

Validation Checks:

Existence verification: Confirms training data objects exist
Non-empty validation: Ensures datasets contain observations
Size reporting: Documents dataset sizes for transparency

Design Rationale: Failing fast with clear error messages prevents downstream issues and makes debugging easier when running the analysis pipeline.

2. Required Column Validation

# Validate required columns exist
required_odds_cols <- c("Pct_of_Max_Points", "Season")
missing_men_cols <- setdiff(required_odds_cols, names(train_men))
missing_ladies_cols <- setdiff(required_odds_cols, names(train_ladies))

if (length(missing_men_cols) > 0) {
  stop(sprintf("Men's training data missing required columns for odds: %s", paste(missing_men_cols, collapse = ", ")))
}
if (length(missing_ladies_cols) > 0) {
  stop(sprintf("Ladies training data missing required columns for odds: %s", paste(missing_ladies_cols, collapse = ", ")))
}

Purpose: Ensures that essential columns needed for ranking and categorization are present in the training data.

Required Columns:

Pct_of_Max_Points: Season performance percentage used for ranking athletes
Season: Year identifier needed to rank within seasons rather than across all time

Error Handling: Uses descriptive error messages to identify exactly which columns are missing from which dataset.

3. Season-Based Ranking Calculation

# Add Place column based on rankings within each season with validation
tryCatch({
  df_place <- train_men %>%
    group_by(Season) %>%
    mutate(Place = rank(-Pct_of_Max_Points, ties.method = "min")) %>%
    ungroup()
  
  cat(sprintf("✓ Men's place rankings calculated: %d rows\n", nrow(df_place)))
}, error = function(e) {
  stop("Failed to calculate men's place rankings: ", e$message)
})

Purpose: Creates season-specific rankings based on performance percentage, establishing the foundation for categorical outcome variables.

Technical Implementation:

Grouped ranking: Uses group_by(Season) to rank within each season separately
Descending order: Ranks by -Pct_of_Max_Points so higher performance gets lower (better) rank numbers
Tie handling: Uses ties.method = "min" so tied performers get the best possible rank
Error wrapping: Uses tryCatch() for graceful error handling

Statistical Rationale: Season-specific ranking accounts for varying competition levels and conditions across different years, ensuring fair comparison within each competitive season.

4. Ranking Validation and Quality Checks

# Validate ranking ranges
men_place_range <- range(df_place$Place, na.rm = TRUE)
ladies_place_range <- range(df_place_ladies$Place, na.rm = TRUE)

cat(sprintf("Men's Place range: %d - %d\n", men_place_range[1], men_place_range[2]))
cat(sprintf("Ladies Place range: %d - %d\n", ladies_place_range[1], ladies_place_range[2]))

# Validate no negative or extreme values
if (any(train_ladies$Pct_of_Max_Points < 0, na.rm = TRUE)) {
  warning("Ladies data contains negative Pct_of_Max_Points values")
}
if (any(train_ladies$Pct_of_Max_Points > 2, na.rm = TRUE)) {
  warning("Ladies data contains very high Pct_of_Max_Points values (>200%)")
}

Purpose: Validates that ranking calculations produced sensible results and identifies potential data quality issues.

Validation Components:

Range checking: Ensures ranks start from 1 and are reasonable
Boundary validation: Flags negative or extremely high performance percentages
Distribution analysis: Examines the spread of rankings and performance values

Quality Assurance: Helps identify outliers or data errors that could affect odds calculations.

5. Season Distribution Analysis

# Check for balanced season representation
if (any(ladies_season_table < 5)) {
  seasons_low_n <- names(ladies_season_table[ladies_season_table < 5])
  warning(sprintf("Ladies seasons with <5 observations: %s", paste(seasons_low_n, collapse = ", ")))
}

# Sample rankings validation
cat("Sample ladies season rankings (first 15):\n")
sample_rankings <- df_place_ladies %>% 
  arrange(Season, Place) %>% 
  dplyr::select(Season, Skier, Pct_of_Max_Points, Place) %>%
  head(15)
print(sample_rankings)

Purpose: Ensures adequate sample sizes within each season for reliable odds calculation and provides transparency into ranking results.

Sample Size Validation:

Minimum threshold: Warns about seasons with <5 observations
Distribution checking: Shows how observations are spread across seasons
Sample display: Provides concrete examples of ranking calculations

Statistical Importance: Seasons with very few observations can lead to unreliable odds estimates for categorical outcomes.

6. Categorical Outcome Variable Creation

# Create categorical outcomes for different cutoffs with validation
tryCatch({
  df_place <- df_place %>%
    mutate(
      Win = factor(ifelse(Place <= 1, 1, 0)),        # Binary: 1=Win, 0=Not Win
      TopThree = factor(ifelse(Place <= 3, 1, 0)),  # Binary: 1=Top3, 0=Not Top3
      Top5 = factor(ifelse(Place <= 5, 1, 0)),
      Top10 = factor(ifelse(Place <= 10, 1, 0)),
      Top30 = factor(ifelse(Place <= 30, 1, 0))
    )
  
  cat("✓ Men's categorical outcomes created\n")
}, error = function(e) {
  stop("Failed to create men's categorical outcomes: ", e$message)
})

Purpose: Transforms continuous rankings into binary categorical variables representing different levels of season success.

Outcome Categories:

Win: Championship performance (season victory)
TopThree: Elite performance (podium finish equivalent)
Top5: Very strong performance (medal contention level)
Top10: Strong performance (consistently competitive)
Top30: Solid performance (scoring points level)

Technical Details:

Binary encoding: Uses 1 for success, 0 for non-achievement
Factor conversion: Ensures proper data type for modeling
Threshold-based: Clear cutoff points for each achievement level

7. Categorical Variable Validation

# Validate categorical outcome creation
cat("Ladies Place vs TopThree validation:\n")
topthree_crosstab <- table(df_place_ladies$Place, df_place_ladies$TopThree, useNA = "always")
print(topthree_crosstab[1:min(10, nrow(topthree_crosstab)), ])

# Validate factor levels
expected_levels <- c("0", "1")
targets <- c("Win", "TopThree", "Top5", "Top10", "Top30")

for (target in targets) {
  men_levels <- levels(df_place[[target]])
  ladies_levels <- levels(df_place_ladies[[target]])
  
  if (!all(expected_levels %in% men_levels)) {
    warning(sprintf("Men's %s missing expected levels: %s", target, paste(setdiff(expected_levels, men_levels), collapse = ", ")))
  }
}

Purpose: Validates that categorical variables were created correctly by checking the relationship between rankings and binary outcomes.

Validation Methods:

Cross-tabulation: Shows the relationship between Place rankings and TopThree classification
Factor level checking: Ensures all categorical variables have the expected “0” and “1” levels
Consistency verification: Confirms the cutoff logic worked correctly

Quality Assurance: Catches errors in threshold application or factor level creation that could affect odds modeling.

Key Design Principles

Season-Specific Analysis: Rankings calculated within seasons to account for varying competition levels
Multiple Achievement Levels: Creates outcomes for different performance thresholds to enable varied odds calculations
Comprehensive Validation: Every transformation is validated through multiple checking mechanisms
Transparent Processing: Detailed logging and sample displays enable verification of calculations
Robust Error Handling: Graceful handling of edge cases and data quality issues
Statistical Soundness: Uses appropriate ranking methods and threshold definitions

Categorical Outcome Rationale

The choice of Win, Top 3, 5, 10, and 30 cutoffs reflects different meaningful achievement levels in cross-country skiing:

Win: Season championship - ultimate performance level
Top 3: Podium finishes - elite performance level
Top 5: Medal contention - very strong competitive level
Top 10: Consistent competitiveness - strong international level
Top 30: Points scoring - solid World Cup level performance

These thresholds enable odds calculations for different betting markets and achievement predictions.

Output Variables

df_place: Men’s training data with Place rankings and categorical outcome variables
df_place_ladies: Ladies’ training data with Place rankings and categorical outcome variables
Categorical Variables: Win, TopThree, Top5, Top10, Top30 binary factors for each dataset
Validation Results: Cross-tabulations and summary statistics for verification

Statistical Justification

The season-based ranking approach is statistically sound because:

Temporal Consistency: Accounts for changes in competition level and field strength over time
Fair Comparison: Athletes compete within their era rather than against historical competitors
Sample Size: Provides adequate observations within each season for reliable odds estimation
Threshold Validity: Achievement levels reflect meaningful performance distinctions in the sport

This section establishes the foundation for probability and odds modeling by creating well-validated categorical outcomes that accurately represent different levels of season success.

Section: {r non-ml-feat} - Non-ML Feature Selection for Odds Models

Purpose

This section performs targeted feature selection specifically for odds modeling using traditional statistical methods rather than machine learning approaches. The goal is to identify the optimal feature sets for predicting binary outcomes (Win, Top 3, Top 5, Top 10, Top 30) using logistic regression models. This differs from the previous GAM-focused feature selection by emphasizing interpretability and statistical significance over complex non-linear relationships.

Core Design Philosophy

The section employs an exhaustive search approach with rigorous validation at every step. Rather than relying on automated feature selection algorithms, it systematically evaluates all possible feature combinations using AIC (Akaike Information Criterion) to find the optimal balance between model fit and complexity. This ensures that the selected features provide the best statistical foundation for odds calculations.

Implementation Details

1. Library Loading and Validation

# Load required libraries with validation
tryCatch({
  library(leaps)
  cat("✓ leaps library loaded\n")
}, error = function(e) {
  stop("Failed to load leaps library: ", e$message)
})

tryCatch({
  library(caret)
  cat("✓ caret library loaded\n")
}, error = function(e) {
  stop("Failed to load caret library: ", e$message)
})

Purpose: Ensures that essential statistical modeling libraries are available for feature selection and model evaluation.

Required Libraries:

leaps: Provides functions for subset selection in regression
caret: Classification and Regression Training package for model evaluation

Error Handling: Uses tryCatch() to provide clear error messages if required packages are not available.

2. Input Data Validation

# Validate input data availability
if (!exists("df_place") || !exists("df_place_ladies")) {
  stop("Training data with places not available - ensure odds-setup section completed successfully")
}

if (nrow(df_place) == 0) {
  stop("Men's training data with places is empty")
}
if (nrow(df_place_ladies) == 0) {
  stop("Ladies training data with places is empty")
}

cat(sprintf("Training data with outcomes: Men %d rows, Ladies %d rows\n", nrow(df_place), nrow(df_place_ladies)))

Purpose: Validates that the training data with categorical outcomes from the previous section is available and populated.

Validation Checks:

Existence verification: Confirms that processed training datasets exist
Non-empty validation: Ensures datasets contain observations for modeling
Size reporting: Documents dataset sizes for transparency

Dependency Management: Ensures the pipeline has run successfully through the odds-setup section.

3. Feature Universe Definition and Availability Check

# Define and validate features for odds models
features <- c("Prev_Pelo", "Prev_Distance", "Prev_Distance_C", "Prev_Distance_F", 
              "Prev_Sprint", "Prev_Sprint_C", "Prev_Sprint_F", "Prev_F", "Prev_C", "Prev_Pct_of_Max_Points")

# Check feature availability in training data
men_available_features <- intersect(features, names(df_place))
ladies_available_features <- intersect(features, names(df_place_ladies))

cat(sprintf("Men's available features: %d/%d\n", length(men_available_features), length(features)))
cat(sprintf("Ladies available features: %d/%d\n", length(ladies_available_features), length(features)))

Purpose: Defines the complete universe of potential predictive features and validates their availability in the training datasets.

Feature Categories:

Overall Performance: Prev_Pelo, Prev_Pct_of_Max_Points
Discipline-Specific: Prev_Distance, Prev_Sprint
Technique-Specific: Prev_C, Prev_F
Combined Factors: Prev_Distance_C, Prev_Distance_F, Prev_Sprint_C, Prev_Sprint_F

Adaptive Approach: Adjusts feature lists based on what’s actually available in the data rather than assuming all features exist.

4. Logistic Model Evaluation Function

# Function to evaluate binary logistic model with validation
evaluate_glm <- function(feature_set, data, target, gender_label = "Unknown") {
  tryCatch({
    # Validate inputs
    if (length(feature_set) == 0) {
      return(Inf)
    }
    
    # Check if features exist in data
    missing_features <- setdiff(feature_set, names(data))
    if (length(missing_features) > 0) {
      return(Inf)
    }
    
    # Check if target exists and has variation
    if (!target %in% names(data)) {
      return(Inf)
    }
    
    target_table <- table(data[[target]])
    if (length(target_table) < 2 || any(target_table < 5)) {
      return(Inf)  # Skip if not enough levels or insufficient observations
    }
    
    # Build and evaluate model
    formula_str <- as.formula(paste(target, "~", paste(feature_set, collapse = " + ")))
    model <- glm(formula_str, family = binomial, data = data)
    
    # Validate model convergence
    if (!model$converged) {
      return(Inf)
    }
    
    aic_value <- AIC(model)
    
    # Validate AIC value
    if (!is.finite(aic_value)) {
      return(Inf)
    }
    
    return(aic_value)
  }, error = function(e) {
    return(Inf)
  })
}

Purpose: Creates a robust function for evaluating logistic regression models with comprehensive validation and error handling.

Validation Components:

Input validation: Checks for valid feature sets and data structure
Feature existence: Verifies all features are present in the dataset
Target validation: Ensures the outcome variable exists and has sufficient variation
Sample size checking: Requires at least 5 observations per class for reliable modeling
Convergence validation: Ensures the logistic regression model converged successfully
AIC validation: Confirms the model selection criterion is finite and valid

Statistical Safeguards: Returns infinite AIC for invalid models, ensuring they won’t be selected in the optimization process.

5. Exhaustive Feature Search Implementation

# Exhaustive feature search function with validation
exhaustive_feature_search <- function(target, data_df, gender_label, available_features) {
  cat(sprintf("Searching %s features for %s...\n", gender_label, target))
  
  # Validate inputs
  if (!target %in% names(data_df)) {
    cat(sprintf("Target %s not found in %s data\n", target, gender_label))
    return(list(features = character(0), aic = Inf))
  }
  
  if (length(available_features) < 2) {
    cat(sprintf("Insufficient features for %s %s search\n", gender_label, target))
    return(list(features = character(0), aic = Inf))
  }
  
  best_aic <- Inf
  best_features <- NULL
  total_combinations <- 0
  successful_models <- 0
  
  # Search through feature combinations (2-5 features)
  max_features <- min(5, length(available_features))
  
  for(i in 2:max_features) {
    if (i > length(available_features)) break
    
    combinations <- combn(available_features, i, simplify = FALSE)
    total_combinations <- total_combinations + length(combinations)
    
    for(feature_set in combinations) {
      aic <- evaluate_glm(feature_set, data_df, target, gender_label)
      if(is.finite(aic)) {
        successful_models <- successful_models + 1
        if(aic < best_aic) {
          best_aic <- aic
          best_features <- feature_set
        }
      }
    }
  }
  
  cat(sprintf("  Tested %d combinations, %d successful models\n", total_combinations, successful_models))
  
  return(list(features = best_features, aic = best_aic))
}

Purpose: Implements comprehensive exhaustive search across all possible feature combinations to find the optimal set for each outcome.

Search Strategy:

Combinatorial approach: Tests all possible combinations of 2-5 features
AIC-based selection: Uses Akaike Information Criterion to balance fit and complexity
Computational limits: Restricts to maximum 5 features to maintain feasibility
Success tracking: Monitors how many models converged successfully

Methodological Advantages:

Guarantees optimality: Within the search space, finds the globally optimal feature set
Comprehensive evaluation: Tests every possible combination rather than using heuristics
Reproducible results: Deterministic search produces consistent results

6. Target Variable Distribution Validation

# Check target variable distributions
targets <- c("Win", "TopThree", "Top5", "Top10", "Top30")
for (target in targets) {
  if (target %in% names(df_place)) {
    men_table <- table(df_place[[target]])
    cat(sprintf("Men's %s distribution: %s\n", target, paste(names(men_table), men_table, sep="=", collapse=", ")))
  }
  
  if (target %in% names(df_place_ladies)) {
    ladies_table <- table(df_place_ladies[[target]])
    cat(sprintf("Ladies %s distribution: %s\n", target, paste(names(ladies_table), ladies_table, sep="=", collapse=", ")))
  }
}

# Validate sufficient data for modeling
min_obs_per_class <- 10
for (target in targets) {
  if (target %in% names(df_place)) {
    men_min_class <- min(table(df_place[[target]]))
    if (men_min_class < min_obs_per_class) {
      warning(sprintf("Men's %s has insufficient minority class observations (%d < %d)", 
                     target, men_min_class, min_obs_per_class))
    }
  }
}

Purpose: Validates that categorical outcome variables have sufficient observations in each class for reliable logistic regression modeling.

Distribution Analysis:

Class balance checking: Examines the distribution of binary outcomes
Sample size validation: Ensures at least 10 observations per class for statistical reliability
Warning system: Alerts about potential issues with minority class representation

Statistical Importance: Logistic regression requires adequate representation in both success and failure categories for stable parameter estimation.

7. Systematic Feature Search Execution

# Perform exhaustive feature search with validation
# Initialize result storage
best_features_odds_men <- list()
best_features_odds_ladies <- list()

# Men's feature search
for(target in targets) {
  if (target %in% names(df_place)) {
    result <- exhaustive_feature_search(target, df_place, "Men's", features_men)
    best_features_odds_men[[target]] <- result
  } else {
    cat(sprintf("Skipping men's %s - target not available\n", target))
    best_features_odds_men[[target]] <- list(features = character(0), aic = Inf)
  }
}

# Ladies feature search  
for(target in targets) {
  if (target %in% names(df_place_ladies)) {
    result <- exhaustive_feature_search(target, df_place_ladies, "Ladies", features_ladies)
    best_features_odds_ladies[[target]] <- result
  } else {
    cat(sprintf("Skipping ladies %s - target not available\n", target))
    best_features_odds_ladies[[target]] <- list(features = character(0), aic = Inf)
  }
}

Purpose: Systematically performs feature selection for each outcome variable and both gender categories.

Execution Strategy:

Parallel processing: Handles men’s and ladies’ data separately but with identical methodology
Comprehensive coverage: Searches features for all outcome categories (Win, Top 3, 5, 10, 30)
Graceful handling: Continues processing even if some targets are unavailable
Structured storage: Organizes results in nested lists for easy access

8. Search Results Validation and Reporting

# Validate search results
for(target in targets) {
  men_result <- best_features_odds_men[[target]]
  ladies_result <- best_features_odds_ladies[[target]]
  
  cat(sprintf("%s results:\n", target))
  
  if (length(men_result$features) > 0) {
    cat(sprintf("  Men: %s (AIC: %.2f)\n", paste(men_result$features, collapse = ", "), men_result$aic))
  } else {
    cat("  Men: No successful model found\n")
  }
  
  if (length(ladies_result$features) > 0) {
    cat(sprintf("  Ladies: %s (AIC: %.2f)\n", paste(ladies_result$features, collapse = ", "), ladies_result$aic))
  } else {
    cat("  Ladies: No successful model found\n")
  }
}

# Check for any successful models
successful_men_targets <- sum(sapply(best_features_odds_men, function(x) length(x$features) > 0))
successful_ladies_targets <- sum(sapply(best_features_odds_ladies, function(x) length(x$features) > 0))

cat(sprintf("Successful models: Men %d/%d targets, Ladies %d/%d targets\n", 
            successful_men_targets, length(targets), successful_ladies_targets, length(targets)))

Purpose: Provides comprehensive reporting of feature selection results and validates the success rate across all outcome categories.

Reporting Components:

Feature set documentation: Lists the optimal features selected for each outcome
AIC reporting: Shows the model selection criterion value for comparison
Success rate tracking: Monitors how many targets have viable models
Quality assurance: Identifies potential issues with feature selection process

Key Design Principles

Exhaustive Search: Guarantees finding the optimal feature combination within computational constraints
Statistical Rigor: Uses AIC for principled model selection balancing fit and complexity
Robust Validation: Comprehensive checking at every step to ensure reliable results
Computational Efficiency: Limits search space to maintain feasibility while ensuring thorough exploration
Transparent Process: Detailed logging enables verification and troubleshooting
Gender-Specific Modeling: Recognizes that optimal features may differ between men’s and ladies’ competitions

Methodological Advantages

Compared to Automated Feature Selection:

Reproducibility: Deterministic search produces consistent results
Interpretability: Clear statistical rationale for feature selection
Comprehensiveness: Evaluates all possible combinations rather than using heuristics
Validation: Built-in checks ensure model quality and reliability

Compared to Manual Feature Selection:

Objectivity: Removes human bias from feature selection process
Optimality: Finds the statistically best combination according to AIC
Efficiency: Systematically explores the full feature space
Documentation: Provides clear record of selection process and results

Output Variables

best_features_odds_men: Nested list containing optimal feature sets for each men’s outcome category
best_features_odds_ladies: Nested list containing optimal feature sets for each ladies’ outcome category
AIC values: Model selection criteria for comparing feature set quality
Success metrics: Counts of successful models for each gender and outcome

Statistical Justification

The exhaustive search approach is statistically sound because:

AIC Optimization: Uses established information criterion that balances goodness of fit with model complexity
Convergence Validation: Ensures all selected models have stable parameter estimates
Sample Size Requirements: Enforces minimum sample sizes for reliable logistic regression
Comprehensive Evaluation: Tests all feasible combinations to guarantee optimality

This section provides the statistical foundation for odds calculations by identifying the most predictive feature sets for each achievement level while maintaining model interpretability and statistical validity.

Section: {r ladies-odds} - Ladies-Specific Statistical Odds Calculation

Purpose

This section implements the complete statistical odds calculation pipeline specifically for ladies cross-country skiing, using the optimal feature sets identified in the previous feature selection phases. It trains dedicated logistic regression models for each categorical outcome (Win, TopThree, Top5, Top10, Top30) and generates probabilistic predictions for the 2026 ladies season, converting these into both decimal and American betting odds formats.

Implementation Details

1. Ladies Formula Validation

# Validate ladies-specific formulas exist
if (!exists("win_formula_ladies") || !inherits(win_formula_ladies, "formula")) {
  stop("win_formula_ladies not found or invalid")
}
if (!exists("topthree_formula_ladies") || !inherits(topthree_formula_ladies, "formula")) {
  stop("topthree_formula_ladies not found or invalid")
}

Purpose: Ensures all ladies-specific model formulas have been properly created from the optimal feature selection process. Each formula represents the best predictive model for its respective outcome category based on exhaustive search results.

Validation: Checks both existence and proper formula class inheritance to prevent runtime errors during model training.

2. Ladies Training Data Validation

# Check for required target variables in ladies data
required_targets <- c("Win", "TopThree", "Top5", "Top10", "Top30")
missing_targets <- setdiff(required_targets, names(df_place_ladies))
if (length(missing_targets) > 0) {
  stop("Missing target variables in df_place_ladies: ", paste(missing_targets, collapse = ", "))
}

# Validate ladies target variable distributions
for (target in required_targets) {
  target_dist <- table(df_place_ladies[[target]], useNA = "always")
  cat("Ladies", target, "distribution:\n")
  print(target_dist)
  
  # Check for extreme class imbalance
  if (any(target_dist < 5, na.rm = TRUE)) {
    warning("Very few observations for ladies ", target, " - model may be unstable")
  }
}

Purpose: Validates the ladies-specific training dataset (df_place_ladies) contains all required target variables and assesses class balance for each outcome category.

Class Balance Assessment: Warns when categories have fewer than 5 observations, which could lead to unstable logistic regression models.

3. Ladies Model Training with Convergence Monitoring

win_model_ladies <- glm(win_formula_ladies, family = binomial, data = df_place_ladies)

# Validate model convergence
if (!win_model_ladies$converged) {
  warning("Ladies Win model did not converge")
}

# Check for model fitting issues
if (any(is.na(coef(win_model_ladies)))) {
  warning("Ladies Win model has NA coefficients - possible multicollinearity")
}

Purpose: Trains separate logistic regression models for each ladies outcome category using the optimal feature sets. Each model is specifically calibrated for ladies skiing performance patterns.

Convergence Validation: Monitors model fitting quality and identifies potential issues such as:

Non-convergence: Indicates optimization difficulties
NA coefficients: Suggests multicollinearity or insufficient data

4. Ladies Predictions with Debugging

win_probs_ladies <- predict(win_model_ladies, pred_data_ladies, type = "response")

# Debugging: Show prediction values for Jessie Diggins
jessie_indices <- which(pred_data_ladies$Skier == "Jessie Diggins")
if (length(jessie_indices) > 0) {
  cat("\n--- DEBUGGING: Jessie Diggins Win Model Values ---\n")
  jessie_data <- pred_data_ladies[jessie_indices[1], , drop = FALSE]
  
  # Show the features used in the Win model for Jessie
  cat("Win model features for Jessie Diggins:\n")
  for (feature in win_features_ladies) {
    if (feature %in% names(jessie_data)) {
      value <- jessie_data[[feature]]
      cat(sprintf("  %s: %s\n", feature, 
                 ifelse(is.na(value), "NA", 
                       ifelse(is.numeric(value), round(value, 4), value))))
    } else {
      cat(sprintf("  %s: MISSING\n", feature))
    }
  }
  
  # Show Win prediction for Jessie
  jessie_win_prob <- win_probs_ladies[jessie_indices[1]]
  cat(sprintf("Win probability for Jessie Diggins: %.4f (%.2f%%)\n", 
             jessie_win_prob, jessie_win_prob * 100))
}

Purpose: Generates probability predictions for all ladies outcome categories and provides detailed debugging for Jessie Diggins to verify model inputs and outputs.

Debug Functionality:

Displays all Win model features and their values for Jessie Diggins
Shows her final Win probability prediction
Identifies any missing features in the prediction data
Handles NA values gracefully

5. Ladies Probability Normalization

# Win probabilities should sum to 100% (1.0)
total_win_prob_ladies <- sum(win_probs_ladies, na.rm = TRUE)
if (total_win_prob_ladies > 0) {
  win_probs_normalized_ladies <- (win_probs_ladies / total_win_prob_ladies) * 1.0
} else {
  win_probs_normalized_ladies <- win_probs_ladies
}

# Top3 probabilities should sum to 300% (3.0)
total_top3_prob_ladies <- sum(top3_probs_ladies, na.rm = TRUE)
if (total_top3_prob_ladies > 0) {
  top3_probs_normalized_ladies <- (top3_probs_ladies / total_top3_prob_ladies) * 3.0
} else {
  top3_probs_normalized_ladies <- top3_probs_ladies
}

Purpose: Normalizes ladies probabilities to ensure they sum to expected totals for each category, ensuring market consistency:

Win: 100% (one winner per race)
Top3: 300% (three podium positions)
Top5: 500% (five positions)
Top10: 1000% (ten positions)
Top30: 3000% (thirty point-scoring positions)

Statistical Rationale: Prevents arbitrage opportunities and ensures odds reflect realistic market expectations for ladies competitions.

6. Ladies Results Integration and Odds Conversion

results_ladies <- data.frame(
  Skier = pred_data_ladies$Skier,
  Nation = pred_data_ladies$Nation,
  Win_Prob = win_probs_normalized_ladies,
  Top3_Prob = top3_probs_normalized_ladies,
  Top5_Prob = top5_probs_normalized_ladies,
  Top10_Prob = top10_probs_normalized_ladies,
  Top30_Prob = top30_probs_normalized_ladies,
  Outside_Prob = 1 - (top30_probs_normalized_ladies / 30.0)
)

# Calculate decimal and American odds
Win_Decimal_Odds = ifelse(Win_Prob > 0, 1 / Win_Prob, Inf)
Win_American_Odds = ifelse(Win_Prob >= 0.5,
                          -Win_Prob/(1-Win_Prob) * 100,
                          (1-Win_Prob)/Win_Prob * 100)

Purpose: Creates comprehensive results dataframe with normalized probabilities and converts them into standard betting formats for ladies competitions.

Odds Formats:

Decimal Odds: European format (probability = 1/odds)
American Odds: US format (positive for underdogs, negative for favorites)

7. Ladies Excel Export with Market Segmentation

# Create separate workbooks for each outcome category
wb_win_ladies <- createWorkbook()
addWorksheet(wb_win_ladies, "Ladies_Win_Odds")
writeData(wb_win_ladies, "Ladies_Win_Odds", win_odds_ladies)

saveWorkbook(wb_win_ladies, "Ladies_Win_Odds_2026.xlsx", overwrite = TRUE)

Purpose: Exports ladies odds to separate Excel files for each betting market, facilitating easy distribution to different betting platforms or analysis tools.

Market Segmentation: Creates dedicated files for Win, Top3, Top5, Top10, and Top30 markets, allowing specialized analysis of each outcome category.

Statistical Justification

Gender-Specific Modeling

The ladies-specific implementation recognizes that men’s and women’s cross-country skiing have different:

Performance Distributions: Different competitive depth and spread
Feature Importance: Variables may have different predictive power
Class Balance: Different outcome frequency patterns
Seasonal Patterns: Varying competition schedules and intensities

Model Architecture Benefits

Using separate models for ladies competitions provides:

Optimized Predictions: Models trained specifically on ladies performance data
Feature Specificity: Optimal feature sets may differ between genders
Calibration Accuracy: Probabilities calibrated to ladies competition patterns
Market Precision: Odds that reflect actual ladies racing dynamics

Debugging and Validation

The Jessie Diggins debugging functionality serves multiple purposes:

Model Verification: Confirms correct feature values are being used
Quality Assurance: Validates predictions are reasonable for top athletes
Troubleshooting: Helps identify data pipeline issues
Transparency: Provides insight into model decision-making process

Error Handling and Robustness

Formula validation: Ensures all required models exist before training
Convergence monitoring: Tracks model fitting quality
Prediction validation: Checks probability ranges and distributions
Missing data handling: Graceful management of NA values
Export validation: Confirms successful file creation and data integrity

This section provides complete statistical odds calculation specifically calibrated for ladies cross-country skiing, ensuring accurate predictions and reliable betting odds for the 2026 season.

Section: {r breakout-identifier} - Breakthrough Athlete Identification System

Purpose

This section implements a comprehensive breakthrough analysis system designed to identify potential breakout cross-country skiing athletes for the 2026 season. It analyzes historical performance data to understand patterns of athlete breakthroughs, providing insights into which skiers might achieve unexpected success in the upcoming season.

Implementation Details

1. Data Validation Framework

# Validate training datasets
if (!exists("train_men") || !is.data.frame(train_men)) {
  stop("train_men dataset not found or invalid")
}
if (!exists("train_ladies") || !is.data.frame(train_ladies)) {
  stop("train_ladies dataset not found or invalid")
}

# Check for required columns
required_cols <- c("Skier", "Nation", "Season", "Pct_of_Max_Points", "Age")
missing_men <- setdiff(required_cols, names(train_men))
missing_ladies <- setdiff(required_cols, names(train_ladies))

if (length(missing_men) > 0) {
  stop("Missing required columns in train_men: ", paste(missing_men, collapse = ", "))
}
if (length(missing_ladies) > 0) {
  stop("Missing required columns in train_ladies: ", paste(missing_ladies, collapse = ", "))
}

Purpose: Ensures both men’s and women’s training datasets contain all necessary columns for breakthrough analysis. The validation prevents runtime errors and provides clear feedback about data structure issues.

Required Columns:

Skier: Athlete identification for tracking individual progressions
Nation: Country representation for geographic analysis
Season: Temporal tracking for breakthrough timing analysis
Pct_of_Max_Points: Performance metric as percentage of maximum possible points
Age: Athlete age for breakthrough age pattern analysis

2. Data Quality Validation

# Validate Pct_of_Max_Points ranges
invalid_pct_men <- sum(is.na(train_men$Pct_of_Max_Points) | 
                      train_men$Pct_of_Max_Points < 0 | 
                      train_men$Pct_of_Max_Points > 1 | 
                      !is.finite(train_men$Pct_of_Max_Points))

if (invalid_pct_men > nrow(train_men) * 0.1) {
  warning("More than 10% of men's Pct_of_Max_Points values are invalid")
}

Purpose: Validates the core performance metric used for breakthrough identification. The Pct_of_Max_Points should be a proportion between 0 and 1, representing what percentage of the maximum possible points an athlete achieved.

Quality Thresholds: Warns when more than 10% of values are invalid, which could indicate systematic data quality issues that might affect breakthrough identification accuracy.

3. Breakthrough Identification Logic

# Define breakthrough criteria and filter data
breakthrough_men <- train_men %>%
  filter(Pct_of_Max_Points > 0.5,  # Performance threshold for breakthrough
         !is.na(Skier),             # Must have athlete identification
         !is.na(Nation),            # Must have country data
         !is.na(Season),            # Must have temporal data
         !is.na(Age))               # Must have age data for analysis

# Count unique breakthrough athletes
unique_breakthrough_men <- length(unique(breakthrough_men$Skier))

Purpose: Identifies breakthrough performances using a 50% performance threshold. Athletes who achieve more than 50% of maximum possible points in a season are considered to have had breakthrough performances.

Breakthrough Criteria:

Performance Threshold: >50% of maximum points (Pct_of_Max_Points > 0.5)
Data Completeness: All required fields must be present
Uniqueness Tracking: Counts distinct athletes who achieved breakthrough status

Statistical Rationale: The 50% threshold represents a significant achievement level that separates consistently competitive athletes from those having exceptional seasons.

4. Historical Breakthrough Analysis

# Show recent breakthrough examples
recent_breakthrough_men <- breakthrough_men %>%
  arrange(desc(Season), desc(Pct_of_Max_Points)) %>%
  head(15)

cat("Recent men's breakthrough examples (Top 15):\n")
for (i in 1:min(15, nrow(recent_breakthrough_men))) {
  athlete <- recent_breakthrough_men[i, ]
  cat(sprintf("  %s (%s) - Season %d: %.1f%% of max points (Age %d)\n",
             athlete$Skier, athlete$Nation, athlete$Season,
             athlete$Pct_of_Max_Points * 100, athlete$Age))
}

Purpose: Provides concrete examples of historical breakthrough performances, helping to validate the identification criteria and understand breakthrough patterns.

Analysis Features:

Temporal Sorting: Shows most recent breakthroughs first
Performance Ranking: Within seasons, ranks by performance level
Comprehensive Details: Includes athlete, nation, season, performance percentage, and age
Sample Limitation: Shows top 15 to maintain readability while providing sufficient examples

5. Age Distribution Analysis

# Analyze age patterns
if (nrow(breakthrough_men) > 0) {
  age_range_men <- range(breakthrough_men$Age, na.rm = TRUE)
  mean_age_men <- mean(breakthrough_men$Age, na.rm = TRUE)
  
  cat("Men's breakthrough age analysis:\n")
  cat(sprintf("  Age range: %d - %d years\n", age_range_men[1], age_range_men[2]))
  cat(sprintf("  Mean age: %.1f years\n", mean_age_men))
} else {
  cat("No men's breakthrough data available for age analysis\n")
}

Purpose: Analyzes age patterns in breakthrough performances to understand typical breakthrough career stages and identify potential age-related breakthrough indicators.

Metrics Calculated:

Age Range: Minimum and maximum ages of breakthrough athletes
Mean Age: Average age when breakthroughs occur
Sample Size Protection: Handles cases with no breakthrough data gracefully

6. Gender-Specific Processing

The section implements identical logic for both men’s and women’s data, recognizing that breakthrough patterns may differ between genders:

# Separate processing for ladies
breakthrough_ladies <- train_ladies %>%
  filter(Pct_of_Max_Points > 0.5,
         !is.na(Skier), !is.na(Nation), !is.na(Season), !is.na(Age))

unique_breakthrough_ladies <- length(unique(breakthrough_ladies$Skier))

Purpose: Ensures gender-specific breakthrough analysis, acknowledging that men’s and women’s skiing may have different competitive dynamics and breakthrough patterns.

7. Comprehensive Error Handling

tryCatch({
  # Breakthrough identification logic
  cat("✓ Men's breakthrough analysis completed\n")
}, error = function(e) {
  stop("Error in men's breakthrough analysis: ", e$message)
})

Purpose: Provides robust error handling that isolates issues to specific components and provides clear debugging information.

Statistical and Analytical Justification

Breakthrough Definition

The 50% performance threshold is statistically meaningful because:

Competitive Significance: Represents sustained high-level performance rather than single-race success
Historical Validation: Captures athletes who made meaningful competitive improvements
Predictive Value: Identifies athletes with demonstrated capacity for exceptional performance
Cross-Season Consistency: Helps distinguish breakthrough seasons from career-best individual results

Age Analysis Importance

Understanding breakthrough age patterns provides:

Career Stage Insights: Identifies typical breakthrough timing in athlete development
Potential Identification: Helps focus on athletes in typical breakthrough age ranges
Development Patterns: Reveals sport-specific maturation timelines
Strategic Planning: Informs talent development and coaching strategies

Gender-Specific Analysis

Separate analysis by gender accounts for:

Biological Differences: Different physical development patterns
Competitive Structures: Varying competition depth and opportunity
Career Patterns: Different typical career lengths and peak periods
Sport Evolution: Gender-specific changes in training and competition

Error Handling and Robustness

Data Existence Validation: Confirms required datasets are available
Column Validation: Ensures all necessary variables are present
Data Quality Monitoring: Warns about potential data integrity issues
Missing Value Handling: Filters incomplete records that could skew analysis
Safe Execution: Uses tryCatch blocks to isolate and report errors clearly

Applications and Use Cases

This breakthrough identification system serves multiple purposes:

Talent Identification: Helps identify athletes with breakthrough potential
Predictive Modeling: Provides features for season performance predictions
Historical Analysis: Documents patterns of exceptional performance
Strategic Insights: Informs coaching and development decisions
Market Analysis: Supports betting and fantasy sports applications

The system provides a data-driven foundation for understanding athletic breakthroughs in cross-country skiing, combining statistical rigor with practical applications for the 2026 season predictions.

Section: {r feat-sel-break} - Feature Selection for Breakthrough Prediction

Purpose

This section implements a sophisticated feature selection system specifically designed for predicting breakthrough athletes in cross-country skiing. It uses machine learning techniques to identify the most predictive features for determining which athletes are likely to achieve breakthrough performance (≥50% of maximum points) in future seasons, providing crucial insights for talent identification and development.

Implementation Details

1. Core Function: `evaluate_breakthrough_predictors()`

The section centers around a comprehensive function that handles all aspects of breakthrough prediction feature selection:

evaluate_breakthrough_predictors <- function(data, target_name, age_col = "Age", 
                                           breakthrough_threshold = 0.5, 
                                           validation_split = 0.7, cv_folds = 5) {
  
  # Input validation
  if (!is.data.frame(data)) stop("Data must be a data.frame")
  if (nrow(data) == 0) stop("Data cannot be empty")
  
  # Define breakthrough based on performance threshold
  data$Will_Breakthrough = ifelse(is.na(data[[target_name]]), NA, 
                                 data[[target_name]] >= breakthrough_threshold)
  
  # Adaptive age filtering based on data availability
  breakthrough_count <- sum(data$Will_Breakthrough == TRUE, na.rm = TRUE)
  
  if (breakthrough_count < 10) {
    age_filter <- Inf  # Use all ages
    cat("Low breakthrough count - using all ages\n")
  } else if (breakthrough_count < 30) {
    age_filter <- 35   # Focus on athletes ≤35
    cat("Moderate breakthrough count - filtering to age ≤35\n")
  } else {
    age_filter <- 30   # Focus on younger athletes ≤30
    cat("Sufficient breakthrough count - filtering to age ≤30\n")
  }
}

Purpose: Provides a robust, adaptive framework for breakthrough prediction that automatically adjusts parameters based on data availability and distribution characteristics.

Key Features:

Adaptive Thresholding: Automatically adjusts breakthrough threshold from 50% to 40% if insufficient cases
Dynamic Age Filtering: Uses different age cutoffs based on breakthrough frequency
Robust Validation: Comprehensive input validation and error handling
Flexible Configuration: Parameterized for different sports or analysis requirements

2. Breakthrough Definition and Adaptive Logic

# Define breakthrough based on performance threshold
data$Will_Breakthrough = ifelse(is.na(data[[target_name]]), NA, 
                               data[[target_name]] >= breakthrough_threshold)

# Check class distribution and adjust if necessary
breakthrough_count <- sum(data$Will_Breakthrough == TRUE, na.rm = TRUE)
total_valid <- sum(!is.na(data$Will_Breakthrough))

if (breakthrough_count < 10 && breakthrough_threshold > 0.4) {
  cat("Insufficient breakthrough cases, lowering threshold to 40%\n")
  breakthrough_threshold <- 0.4
  data$Will_Breakthrough = ifelse(is.na(data[[target_name]]), NA, 
                                 data[[target_name]] >= breakthrough_threshold)
}

Purpose: Creates binary breakthrough targets while adapting to data constraints that are common in sports analytics where exceptional performances are rare.

Adaptive Logic:

Primary Threshold: 50% of maximum points (high achievement level)
Fallback Threshold: 40% of maximum points (if insufficient cases at 50%)
Minimum Sample Size: Requires at least 10 breakthrough cases for stable modeling

Statistical Rationale: Breakthrough prediction requires sufficient positive cases for machine learning algorithms to identify meaningful patterns. The adaptive approach ensures model stability while maintaining analytical rigor.

3. Age-Based Filtering Strategy

# Adaptive age filtering based on breakthrough frequency
if (breakthrough_count < 10) {
  age_filter <- Inf  # Use all ages
} else if (breakthrough_count < 30) {
  age_filter <- 35   # Focus on athletes ≤35
} else {
  age_filter <- 30   # Focus on younger athletes ≤30
}

filtered_data <- data[data[[age_col]] <= age_filter, ]

Purpose: Focuses analysis on athletes most likely to experience breakthroughs while maintaining sufficient sample sizes for reliable modeling.

Age Filter Logic:

All Ages: When breakthrough cases are very rare (<10 total)
Age ≤35: When breakthrough cases are limited (10-30 total)
Age ≤30: When sufficient breakthrough cases exist (>30 total)

Sports Science Rationale: Breakthrough performances typically occur earlier in athletic careers. Focusing on younger athletes improves prediction accuracy by reducing noise from established veterans unlikely to experience dramatic performance improvements.

4. Dual Model Feature Selection Approach

# Logistic Regression Feature Importance
log_model <- train(Will_Breakthrough ~ ., data = model_data, method = "glm", 
                   family = "binomial", trControl = train_control, 
                   preProcess = c("center", "scale"))

log_importance <- varImp(log_model)$importance

# Random Forest Feature Importance
rf_model <- train(Will_Breakthrough ~ ., data = model_data, method = "rf", 
                  trControl = train_control, tuneLength = 3)

rf_importance <- varImp(rf_model)$importance

# Combine importance scores
combined_importance <- data.frame(
  Feature = rownames(log_importance),
  Logistic_Importance = log_importance[, 1],
  RF_Importance = rf_importance[, 1]
) %>%
  mutate(
    Scaled_Log = as.numeric(scale(Logistic_Importance)),
    Scaled_RF = as.numeric(scale(RF_Importance)), 
    Avg_Importance = (Scaled_Log + Scaled_RF) / 2
  ) %>%
  arrange(desc(Avg_Importance))

Purpose: Combines insights from multiple machine learning approaches to create more robust feature importance rankings than either method alone.

Model Selection Rationale:

Logistic Regression: Provides interpretable linear relationships and handles small sample sizes well
Random Forest: Captures non-linear patterns and feature interactions
Combined Scoring: Averages scaled importance scores to leverage strengths of both approaches

Preprocessing:

Standardization: Centers and scales features for logistic regression
Cross-Validation: 5-fold CV with ROC metric for model evaluation
Hyperparameter Tuning: Automated tuning for random forest complexity

5. Feature Extraction and Validation

# Extract predictor variables (previous season features and age)
predictors <- c(names(data)[grepl("^Prev_", names(data))], age_col)
missing_predictors <- setdiff(predictors, names(data))

if (length(missing_predictors) > 0) {
  warning(paste("Missing predictors:", paste(missing_predictors, collapse = ", ")))
  predictors <- intersect(predictors, names(data))
}

cat("Number of predictors available:", length(predictors), "\n")
cat("Predictors being used:\n")
for (i in seq_along(predictors)) {
  cat(sprintf("  %d. %s\n", i, predictors[i]))
}

Purpose: Systematically identifies and validates features based on naming conventions while providing comprehensive feedback about data availability.

Feature Selection Logic:

Previous Season Features: Variables with “Prev_” prefix containing historical performance data
Age: Athlete age as a key demographic predictor
Validation: Checks feature availability and warns about missing data
Transparency: Lists all features being used for model training

6. Class Balance Monitoring and Validation

# Check class distribution after filtering
breakthrough_final <- sum(filtered_data$Will_Breakthrough == TRUE, na.rm = TRUE)
non_breakthrough_final <- sum(filtered_data$Will_Breakthrough == FALSE, na.rm = TRUE)
total_final <- breakthrough_final + non_breakthrough_final

cat("Final class distribution after filtering:\n")
cat(sprintf("  Breakthrough: %d (%.1f%%)\n", breakthrough_final, 
           (breakthrough_final/total_final)*100))
cat(sprintf("  Non-breakthrough: %d (%.1f%%)\n", non_breakthrough_final, 
           (non_breakthrough_final/total_final)*100))

# Warn about severe class imbalance
if (breakthrough_final / total_final < 0.05) {
  warning("Severe class imbalance detected - results may be unreliable")
}

Purpose: Monitors class distribution throughout the analysis pipeline to ensure model reliability and flag potential issues with imbalanced datasets.

Validation Metrics:

Class Proportions: Calculates percentage breakdown of breakthrough vs. non-breakthrough athletes
Imbalance Detection: Warns when breakthrough cases represent <5% of data
Sample Size Reporting: Provides transparency about final analysis sample sizes

7. Comprehensive Error Handling and Robustness

tryCatch({
  # Model training and evaluation
  result <- list(
    importance_ranking = combined_importance,
    logistic_model = log_model,
    rf_model = rf_model,
    breakthrough_threshold = breakthrough_threshold,
    age_filter = age_filter,
    final_sample_size = total_final,
    breakthrough_count = breakthrough_final
  )
}, error = function(e) {
  stop("Error in model training: ", e$message)
})

Purpose: Provides robust error handling that isolates failures and provides meaningful debugging information.

Robustness Features:

Nested Error Handling: Separate tryCatch blocks for each major operation
Progress Reporting: Detailed logging of each analysis step
Result Validation: Checks output validity before returning results
Graceful Degradation: Continues analysis when possible despite minor issues

Statistical and Methodological Justification

Breakthrough Prediction Challenges

Predicting athletic breakthroughs presents unique statistical challenges:

Rare Events: Breakthrough performances are inherently uncommon
Class Imbalance: Far more non-breakthrough than breakthrough athletes
Feature Complexity: Many potential predictors with complex interactions
Sample Size Constraints: Limited historical data for training

Adaptive Methodology Benefits

The adaptive approach addresses these challenges through:

Threshold Flexibility: Maintains statistical power by adjusting breakthrough definitions
Age Stratification: Focuses on athletes most likely to experience breakthroughs
Dual Model Approach: Leverages complementary strengths of different algorithms
Robust Validation: Ensures reliable results despite data constraints

Feature Importance Combination

The dual model approach provides several advantages:

Complementary Insights: Linear (logistic) and non-linear (random forest) perspectives
Reduced Overfitting: Averaging reduces sensitivity to individual model quirks
Stability: More robust rankings across different data samples
Interpretability: Maintains explainable results through logistic regression component

Applications and Use Cases

This feature selection system serves multiple analytical purposes:

Talent Identification: Identifies key predictors for breakthrough potential
Development Focus: Informs coaching priorities and training emphasis
Scouting Intelligence: Provides data-driven insights for talent acquisition
Performance Analysis: Reveals factors that distinguish breakthrough athletes
Predictive Modeling: Supplies features for comprehensive breakthrough prediction models

Error Handling and Quality Assurance

Input Validation: Comprehensive checks for data structure and content
Adaptive Parameters: Automatic adjustment based on data characteristics
Class Balance Monitoring: Continuous tracking of sample distribution
Model Validation: Cross-validation and performance metric monitoring
Result Verification: Output validation and quality checks

The feat-sel-break section provides a statistically sound and practically robust framework for identifying the most important features for predicting breakthrough athletic performance in cross-country skiing, with applications extending to talent development and strategic planning for the 2026 season.

Section: {r big-break} - Breakthrough Athlete Prediction for 2026

Purpose

This section implements a comprehensive breakthrough athlete prediction system for the 2026 cross-country skiing season. Building upon the feature selection work from {r feat-sel-break}, it applies trained machine learning models to identify which athletes are most likely to achieve breakthrough performance (≥50% of maximum possible points) in 2026, providing actionable insights for talent identification, coaching strategies, and performance planning.

Implementation Details

1. Core Prediction Function: `predict_2026_breakthroughs()`

The section centers around a robust prediction function that handles all aspects of breakthrough forecasting:

predict_2026_breakthroughs <- function(current_data, breakthrough_model, top_predictors, 
                                     age_col = "Age", min_age = 16, max_age = 35,
                                     min_participation = 1, breakthrough_threshold = 0.5) {
  
  # Input validation
  if (!is.data.frame(current_data)) stop("current_data must be a data.frame")
  if (nrow(current_data) == 0) stop("current_data cannot be empty")
  if (is.null(breakthrough_model)) stop("breakthrough_model cannot be NULL")
  
  # Validate required columns
  required_cols <- c("Skier", "Nation", age_col, "Pct_of_Max_Points", "Season")
  missing_cols <- setdiff(required_cols, names(current_data))
  if (length(missing_cols) > 0) {
    stop("Missing required columns: ", paste(missing_cols, collapse = ", "))
  }
}

Purpose: Provides a comprehensive framework for breakthrough prediction that validates inputs, processes data, and generates reliable forecasts for the 2026 season.

Key Parameters:

current_data: Most recent athlete performance data
breakthrough_model: Trained machine learning model from feature selection
top_predictors: Selected features identified as most predictive
age_col: Age column specification for demographic filtering
breakthrough_threshold: Performance threshold for breakthrough definition (default 50%)

2. Career History Analysis and Candidate Identification

# Identify athletes who have NEVER achieved breakthrough performance
career_breakthrough_check <- current_data %>%
  group_by(Skier) %>%
  summarise(
    ever_breakthrough = any(Pct_of_Max_Points >= breakthrough_threshold, na.rm = TRUE),
    max_career_performance = max(Pct_of_Max_Points, na.rm = TRUE),
    career_seasons = n_distinct(Season, na.rm = TRUE),
    current_age = first(Age, na.rm = TRUE),
    nation = first(Nation, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  filter(
    !ever_breakthrough,                    # Never achieved breakthrough
    current_age >= min_age,                # Within age range
    current_age <= max_age,
    career_seasons >= min_participation,   # Minimum participation
    !is.na(max_career_performance)        # Valid performance data
  )

cat("Breakthrough candidate identification:\n")
cat(sprintf("  Total athletes analyzed: %d\n", n_distinct(current_data$Skier)))
cat(sprintf("  Athletes who never achieved breakthrough: %d\n", nrow(career_breakthrough_check)))
cat(sprintf("  Age range: %d - %d years\n", min_age, max_age))

Purpose: Identifies genuine breakthrough candidates by excluding athletes who have already achieved breakthrough performance, focusing on those with untapped potential.

Candidate Criteria:

Never Breakthrough: Athletes who have never achieved ≥50% of maximum points
Age Range: Typically 16-35 years (focusing on developmental window)
Participation Threshold: Minimum racing experience requirement
Data Quality: Valid performance and demographic data

Strategic Rationale: True breakthrough prediction focuses on athletes who haven’t yet reached their potential, rather than predicting repeat performances from already-successful athletes.

3. Feature Engineering and Predictor Mapping

# Map current performance metrics to model-expected predictor variables
feature_mapping <- data.frame(
  current_name = c("Current_Elo", "Season_Races", "Best_Place", "Avg_Points"),
  model_name = c("Prev_Elo", "Prev_Races", "Prev_Best_Place", "Prev_Avg_Points")
)

# Apply feature mapping and validate predictor availability
model_data <- breakthrough_candidates
for (i in 1:nrow(feature_mapping)) {
  current_col <- feature_mapping$current_name[i]
  model_col <- feature_mapping$model_name[i]
  
  if (current_col %in% names(model_data)) {
    model_data[[model_col]] <- model_data[[current_col]]
    cat(sprintf("✓ Mapped %s -> %s\n", current_col, model_col))
  } else {
    warning(sprintf("Missing predictor: %s\n", current_col))
  }
}

# Validate all required predictors are available
missing_predictors <- setdiff(top_predictors, names(model_data))
if (length(missing_predictors) > 0) {
  warning("Missing model predictors: ", paste(missing_predictors, collapse = ", "))
  top_predictors <- intersect(top_predictors, names(model_data))
}

Purpose: Bridges the gap between current athlete data and historical model features, ensuring compatibility between prediction inputs and trained models.

Feature Mapping Logic:

Current → Previous: Maps current season metrics to “Prev_” format expected by models
Validation: Ensures all required predictors are available before prediction
Flexibility: Adapts to varying data availability across different athletes

Example Mappings:

Current_Elo → Prev_Elo (current rating becomes previous season rating)
Season_Races → Prev_Races (current race count becomes previous season experience)
Best_Place → Prev_Best_Place (current best result becomes previous season achievement)

4. Missing Data Imputation Strategy

# Robust imputation for missing predictor values
for (predictor in top_predictors) {
  if (predictor %in% names(model_data)) {
    missing_count <- sum(is.na(model_data[[predictor]]))
    if (missing_count > 0) {
      if (is.numeric(model_data[[predictor]])) {
        # Use median imputation for numeric variables
        impute_value <- median(model_data[[predictor]], na.rm = TRUE)
        model_data[[predictor]][is.na(model_data[[predictor]])] <- impute_value
        cat(sprintf("Imputed %d missing values for %s with median: %.3f\n", 
                   missing_count, predictor, impute_value))
      } else {
        # Use mode imputation for categorical variables
        mode_value <- names(sort(table(model_data[[predictor]]), decreasing = TRUE))[1]
        model_data[[predictor]][is.na(model_data[[predictor]])] <- mode_value
        cat(sprintf("Imputed %d missing values for %s with mode: %s\n", 
                   missing_count, predictor, mode_value))
      }
    }
  }
}

Purpose: Handles missing data systematically to ensure all breakthrough candidates can receive predictions, while maintaining statistical validity.

Imputation Strategy:

Numeric Variables: Median imputation (robust to outliers)
Categorical Variables: Mode imputation (most frequent category)
Transparency: Reports all imputation actions for data quality tracking

Statistical Justification: Median and mode imputation provide conservative estimates that don’t bias predictions toward extreme values while maintaining the ability to generate predictions for all candidates.

5. Machine Learning Model Application

# Generate breakthrough probability predictions
tryCatch({
  # Extract feature matrix for prediction
  prediction_matrix <- model_data[, top_predictors, drop = FALSE]
  
  # Validate prediction matrix
  if (nrow(prediction_matrix) == 0) {
    stop("No data available for prediction after preprocessing")
  }
  if (ncol(prediction_matrix) == 0) {
    stop("No valid predictors available for prediction")
  }
  
  # Generate predictions using trained model
  breakthrough_probabilities <- predict(breakthrough_model, 
                                      newdata = prediction_matrix, 
                                      type = "prob")
  
  # Extract breakthrough probability (positive class)
  if ("Yes" %in% names(breakthrough_probabilities)) {
    breakthrough_scores <- breakthrough_probabilities$Yes
  } else if (ncol(breakthrough_probabilities) == 2) {
    breakthrough_scores <- breakthrough_probabilities[, 2]
  } else {
    breakthrough_scores <- breakthrough_probabilities
  }
  
  cat(sprintf("✓ Generated predictions for %d athletes\n", length(breakthrough_scores)))
  cat(sprintf("Breakthrough probability range: %.3f - %.3f\n", 
             min(breakthrough_scores), max(breakthrough_scores)))
  
}, error = function(e) {
  stop("Error in breakthrough prediction: ", e$message)
})

Purpose: Applies the trained machine learning model to generate breakthrough probabilities for each candidate athlete.

Prediction Process:

Feature Matrix Construction: Creates input matrix from validated predictors
Model Application: Uses trained model to generate probability predictions
Probability Extraction: Handles different model output formats consistently
Quality Validation: Checks prediction ranges and data integrity

Output: Numerical breakthrough probabilities (0-1 scale) representing likelihood of achieving ≥50% performance in 2026.

6. Results Classification and Analysis

# Classify breakthrough likelihood levels
breakthrough_results <- model_data %>%
  mutate(
    breakthrough_probability = breakthrough_scores,
    likelihood_category = case_when(
      breakthrough_probability >= 0.7 ~ "Very High",
      breakthrough_probability >= 0.5 ~ "High", 
      breakthrough_probability >= 0.3 ~ "Medium",
      breakthrough_probability >= 0.1 ~ "Low",
      TRUE ~ "Very Low"
    ),
    points_needed_for_breakthrough = (breakthrough_threshold - max_career_performance) * 100,
    seasons_until_expected = case_when(
      breakthrough_probability >= 0.7 ~ "2026 (Very Likely)",
      breakthrough_probability >= 0.5 ~ "2026-2027 (Likely)",
      breakthrough_probability >= 0.3 ~ "2026-2028 (Possible)",
      TRUE ~ "Uncertain"
    )
  ) %>%
  arrange(desc(breakthrough_probability))

# Generate summary statistics
cat("\nBreakthrough Prediction Summary:\n")
summary_stats <- breakthrough_results %>%
  group_by(likelihood_category) %>%
  summarise(
    count = n(),
    avg_probability = mean(breakthrough_probability),
    avg_age = mean(current_age),
    avg_career_best = mean(max_career_performance * 100),
    .groups = 'drop'
  )

print(summary_stats)

Purpose: Transforms raw probability scores into actionable categories and provides comprehensive analysis of breakthrough potential.

Classification System:

Very High (≥70%): Almost certain breakthrough candidates
High (50-69%): Strong breakthrough potential
Medium (30-49%): Moderate breakthrough possibility
Low (10-29%): Limited breakthrough likelihood
Very Low (<10%): Minimal breakthrough probability

Analysis Metrics:

Points Gap: How many additional percentage points needed for breakthrough
Timeline Projection: Expected timeframe for breakthrough achievement
Demographic Patterns: Age and performance distributions by likelihood category

7. Specialized Analysis for Young Athletes

# Focus analysis on athletes under 25 (high development potential)
young_athletes <- breakthrough_results %>%
  filter(current_age < 25) %>%
  arrange(desc(breakthrough_probability))

cat(sprintf("\nYoung Athletes Analysis (Under 25):\n"))
cat(sprintf("  Total young candidates: %d\n", nrow(young_athletes)))

if (nrow(young_athletes) > 0) {
  cat("Top young breakthrough prospects:\n")
  top_young <- head(young_athletes, 10)
  for (i in 1:nrow(top_young)) {
    athlete <- top_young[i, ]
    cat(sprintf("  %d. %s (%s) - Age %d: %.1f%% probability (%s)\n",
               i, athlete$Skier, athlete$nation, athlete$current_age,
               athlete$breakthrough_probability * 100, athlete$likelihood_category))
  }
}

Purpose: Provides specialized analysis for young athletes who typically have the highest breakthrough potential due to natural development and career progression.

Youth Focus Rationale:

Development Window: Athletes under 25 are often still improving rapidly
Career Stage: Early career athletes have more room for breakthrough improvement
Investment Priority: Young prospects represent longer-term development opportunities

Statistical and Analytical Justification

Breakthrough Prediction Methodology

The prediction approach addresses key challenges in sports analytics:

Historical Context: Uses career-long performance data to identify true breakthrough candidates
Feature Engineering: Maps current metrics to historical model features appropriately
Missing Data: Handles incomplete data through robust imputation strategies
Model Validation: Ensures prediction reliability through comprehensive validation

Classification System Design

The five-tier likelihood system provides:

Actionable Categories: Clear decision-making boundaries for coaching and development
Risk Stratification: Different confidence levels for different investment decisions
Timeline Guidance: Expected breakthrough timeframes for planning purposes
Resource Allocation: Priority ranking for limited development resources

Age Stratification Benefits

Separate analysis for young athletes recognizes:

Development Patterns: Different breakthrough trajectories by age
Investment Horizons: Longer-term value for younger prospects
Risk Profiles: Different uncertainty levels across age groups
Strategic Planning: Age-appropriate development strategies

Applications and Use Cases

This breakthrough prediction system serves multiple stakeholders:

Talent Development: Identifies athletes for intensive development programs
Coaching Strategy: Informs individualized training and competition planning
Team Selection: Supports national team selection and development decisions
Investment Decisions: Guides sponsorship and resource allocation choices
Performance Planning: Sets realistic but ambitious performance targets

Error Handling and Quality Assurance

Input Validation: Comprehensive validation of all inputs and parameters
Data Quality Monitoring: Continuous tracking of data completeness and validity
Prediction Validation: Ensures all predictions are within expected ranges
Missing Data Management: Systematic approach to incomplete athlete records
Result Verification: Cross-validation of predictions against historical patterns

The big-break section provides a comprehensive, data-driven approach to identifying breakthrough athletes for 2026, combining machine learning predictions with domain expertise to support strategic decision-making in cross-country skiing talent development and performance planning.