Ski Jumping Weekly Predictions Model Documentation

Overview

This document explains how the ski jumping weekly predictions model works, based on the weekly-picks2.R script. The model predicts position probabilities (win, podium, top 5, top 10, top 30) for individual athletes and teams across multiple race types.

Competition Types

Individual Competitions

Men’s Individual: Individual male athletes competing
Ladies’ Individual: Individual female athletes competing

Team Competitions

Men’s Team: Teams of male athletes (nation-based)
Ladies’ Team: Teams of female athletes (nation-based)
Mixed Team: Teams with both male and female athletes (nation-based)

Data Sources

Individual Data

Men: men_chrono_elevation.csv - Historical individual men’s results with elevation data
Ladies: ladies_chrono_elevation.csv - Historical individual ladies’ results with elevation data
Startlists: startlist_weekend_men.csv, startlist_weekend_ladies.csv

Team Data

Men’s Teams: men_team_chrono.csv - Historical men’s team results
Ladies’ Teams: ladies_team_chrono.csv - Historical ladies’ team results
Mixed Teams: mixed_team_chrono.csv - Historical mixed team results
Team Startlists: startlist_team_weekend_men.csv, startlist_team_weekend_ladies.csv, startlist_mixed_team_weekend_teams.csv

Points System

All ski jumping events use the same points system:

Top 30 get points: 100, 80, 60, 50, 45, 40, 36, 32, 29, 26, 24, 22, 20, 18, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1

ELO Features

Individual Athletes

Hill-specific ELO ratings based on different hill sizes:

Normal_Elo_Pct - Performance on normal hills (K90-K99)
Large_Elo_Pct - Performance on large hills (K120-K130)
Flying_Elo_Pct - Performance on flying hills (K185+)
Elo_Pct - Overall ELO rating
Prev_Points_Weighted - Weighted average of recent points (last 5 races)

Team Competitions

Average ELO ratings of team members:

Avg_Normal_Elo_Pct - Team average normal hill ELO
Avg_Large_Elo_Pct - Team average large hill ELO
Avg_Flying_Elo_Pct - Team average flying hill ELO
Avg_Elo_Pct - Team average overall ELO

Race Filtering

Weekend Race Selection

# Individual races
men_races <- next_weekend_races %>%
  filter(Sex == "M", !grepl("Team", RaceType, ignore.case = TRUE))

ladies_races <- next_weekend_races %>%
  filter(Sex == "L", !grepl("Team", RaceType, ignore.case = TRUE))

# Team races  
men_teams <- next_weekend_races %>%
  filter(grepl("Team", RaceType, ignore.case = TRUE) & Sex == "M")

ladies_teams <- next_weekend_races %>%
  filter(grepl("Team", RaceType, ignore.case = TRUE) & Sex == "L")

# Mixed team races
mixed_teams <- next_weekend_races %>%
  filter(Sex == "Mixed")

Race Participation Probability

Individual Athletes

Calculates probability based on historical participation in the specific race type over the last 5 years:

get_race_probability <- function(chronos, participant, racetype, is_team = FALSE) {
  # Use 5 years of history or participant's first race date, whichever is later
  five_years_ago <- Sys.Date() - (5 * 365)
  start_date <- max(five_years_ago, first_race_date)
  
  # Count total races vs participant's races for this race type
  total_races <- count_distinct_races(chronos, racetype, start_date)
  participant_races <- count_participant_races(chronos, participant, racetype, start_date)
  
  probability <- min(1, participant_races / total_races)
  return(probability)
}

Team Competitions

Teams typically have probability = 1.0 (assumed to participate if entered)

Prediction Models

Points Prediction Model

Uses Generalized Additive Models (GAM) with feature selection:

Individual Model

explanatory_vars <- c("Prev_Points_Weighted", "Normal_Elo_Pct", "Large_Elo_Pct", 
                      "Flying_Elo_Pct", "Elo_Pct")

# Feature selection using regsubsets with BIC criterion
exhaustive_selection <- regsubsets(Points ~ ., data = race_df_75, method = "exhaustive")
best_bic_vars <- names(coef(exhaustive_selection, which.min(summary$bic)))

# GAM model with smooth terms
gam_formula <- as.formula(paste("Points ~", paste("s(", best_bic_vars[-1], ")", collapse=" + ")))
model <- gam(gam_formula, data = race_df_75)

Team Model

explanatory_vars <- c("Avg_Normal_Elo_Pct", "Avg_Large_Elo_Pct", 
                      "Avg_Flying_Elo_Pct", "Avg_Elo_Pct")

Position Probability Models

Creates separate GAM models for each position threshold (1, 3, 5, 10, 30):

for(threshold in position_thresholds) {
  # Binary response: did athlete finish in top N?
  race_df$position_achieved <- as.numeric(race_df$Place <= threshold)
  
  # Feature selection for this threshold
  pos_selection <- regsubsets(position_achieved ~ ., data = race_df_75, 
                             family = binomial, method = "exhaustive")
  pos_best_vars <- names(coef(pos_selection, which.min(summary$bic)))
  
  # GAM model with binomial family
  pos_gam_formula <- as.formula(paste("position_achieved ~", 
                                     paste("s(", pos_best_vars[-1], ")", collapse=" + ")))
  position_model <- gam(pos_gam_formula, data = race_df_75, family = binomial)
  
  # Store model for prediction
  position_models[[paste0("threshold_", threshold)]] <- position_model
}

Data Preprocessing

ELO Percentage Calculation

Converts absolute ELO ratings to percentages by dividing by historical maximum:

# Individual athletes
for(col in c("Normal_Elo", "Large_Elo", "Flying_Elo", "Elo")) {
  max_val <- max(race_df[[col]], na.rm = TRUE)
  result_df[[paste0(col, "_Pct")]] <- result_df[[col]] / max_val
}

# Team averages  
for(col in c("Avg_Normal_Elo", "Avg_Large_Elo", "Avg_Flying_Elo", "Avg_Elo")) {
  max_val <- max(race_df[[col]], na.rm = TRUE)
  result_df[[paste0(col, "_Pct")]] <- result_df[[col]] / max_val
}

Missing Value Handling

# Replace NAs with first quartile values
result_df <- result_df %>%
  mutate(across(ends_with("_Pct"), ~replace_na_with_quartile(.x)),
         Prev_Points_Weighted = replace_na(Prev_Points_Weighted, 0))

Probability Normalization

Multi-Step Process

Race Participation Adjustment: Multiply position probabilities by race participation probabilities
Target Sum Normalization: Scale probabilities to target sums (100% for win, 300% for podium, etc.)
Individual Capping: Cap individual probabilities at 100%
Excess Redistribution: Redistribute excess probability proportionally to other athletes

normalize_position_probabilities <- function(predictions, race_prob_col, position_thresholds) {
  for(threshold in position_thresholds) {
    prob_col <- paste0("prob_top", threshold)
    
    # Step 1: Apply race participation probability
    normalized[[prob_col]] <- normalized[[prob_col]] * normalized[[race_prob_col]]
    
    # Step 2: Normalize to target sum
    current_sum <- sum(normalized[[prob_col]], na.rm = TRUE)
    target_sum <- 100 * threshold
    scaling_factor <- target_sum / current_sum
    normalized[[prob_col]] <- normalized[[prob_col]] * scaling_factor
    
    # Step 3: Cap at 100%
    over_hundred <- which(normalized[[prob_col]] > 100)
    if(length(over_hundred) > 0) {
      excess <- sum(normalized[[prob_col]][over_hundred] - 100)
      normalized[[prob_col]][over_hundred] <- 100
      
      # Step 4: Redistribute excess
      under_hundred <- which(normalized[[prob_col]] < 100)
      if(length(under_hundred) > 0 && excess > 0) {
        under_sum <- sum(normalized[[prob_col]][under_hundred])
        redistrib_factor <- (under_sum + excess) / under_sum
        normalized[[prob_col]][under_hundred] <- normalized[[prob_col]][under_hundred] * redistrib_factor
      }
    }
  }
}

Hill Size Classification

Normal Hills: K90-K99 (most common World Cup venues)
Large Hills: K120-K130 (major championships, Four Hills Tournament)
Flying Hills: K185+ (ski flying competitions)

Output Format

Individual Results

Participant: Skier name
ID: FIS athlete ID
Nation: Country code
Race: Race number within weekend
Participation: Probability of racing (0-1)
Win: Win probability percentage
Podium: Top 3 probability percentage
Top5: Top 5 probability percentage
Top10: Top 10 probability percentage
Top30: Top 30 probability percentage

Team Results

Nation: Country/team name
Race: Race number within weekend
Participation: Probability of racing (typically 1.0)
Win: Win probability percentage
Podium: Top 3 probability percentage
Top5: Top 5 probability percentage
Top10: Top 10 probability percentage
Top30: Top 30 probability percentage

Key Differences from Alpine Skiing

Hill-specific ELO: Uses hill size categories (Normal/Large/Flying) instead of discipline-specific ELO
Team competitions: Includes sophisticated team prediction models using averaged athlete ELO ratings
Mixed competitions: Handles mixed-gender team events
Unified points system: Same points system across all event types
Race type filtering: Uses RaceType and HillSize instead of Distance for race categorization

Model Validation

Uses Brier score for model evaluation
Applies period adjustments for trend changes
Includes fallback models for insufficient data scenarios
Logs detailed probability sum checks to ensure mathematical consistency

Ski Jumping Weekly Predictions Model Documentation

Overview

Competition Types

Individual Competitions

Team Competitions

Data Sources

Individual Data

Team Data

Points System

ELO Features

Individual Athletes

Team Competitions

Race Filtering

Weekend Race Selection

Race Participation Probability

Individual Athletes

Team Competitions

Prediction Models

Points Prediction Model

Individual Model

Team Model

Position Probability Models

Data Preprocessing

ELO Percentage Calculation

Missing Value Handling

Probability Normalization

Multi-Step Process

Hill Size Classification

Output Format

Individual Results

Team Results

Key Differences from Alpine Skiing

Model Validation

CATALOG

FRIENDS