Ski Jumping Weekly Predictions Model Documentation
Overview
This document explains how the ski jumping weekly predictions model works, based on the weekly-picks2.R script. The model predicts position probabilities (win, podium, top 5, top 10, top 30) for individual athletes and teams across multiple race types.
Competition Types
Individual Competitions
- Men’s Individual: Individual male athletes competing
- Ladies’ Individual: Individual female athletes competing
Team Competitions
- Men’s Team: Teams of male athletes (nation-based)
- Ladies’ Team: Teams of female athletes (nation-based)
- Mixed Team: Teams with both male and female athletes (nation-based)
Data Sources
Individual Data
- Men:
men_chrono_elevation.csv- Historical individual men’s results with elevation data - Ladies:
ladies_chrono_elevation.csv- Historical individual ladies’ results with elevation data - Startlists:
startlist_weekend_men.csv,startlist_weekend_ladies.csv
Team Data
- Men’s Teams:
men_team_chrono.csv- Historical men’s team results - Ladies’ Teams:
ladies_team_chrono.csv- Historical ladies’ team results - Mixed Teams:
mixed_team_chrono.csv- Historical mixed team results - Team Startlists:
startlist_team_weekend_men.csv,startlist_team_weekend_ladies.csv,startlist_mixed_team_weekend_teams.csv
Points System
All ski jumping events use the same points system:
- Top 30 get points: 100, 80, 60, 50, 45, 40, 36, 32, 29, 26, 24, 22, 20, 18, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1
ELO Features
Individual Athletes
Hill-specific ELO ratings based on different hill sizes:
Normal_Elo_Pct- Performance on normal hills (K90-K99)Large_Elo_Pct- Performance on large hills (K120-K130)Flying_Elo_Pct- Performance on flying hills (K185+)Elo_Pct- Overall ELO ratingPrev_Points_Weighted- Weighted average of recent points (last 5 races)
Team Competitions
Average ELO ratings of team members:
Avg_Normal_Elo_Pct- Team average normal hill ELOAvg_Large_Elo_Pct- Team average large hill ELOAvg_Flying_Elo_Pct- Team average flying hill ELOAvg_Elo_Pct- Team average overall ELO
Race Filtering
Weekend Race Selection
# Individual races
men_races <- next_weekend_races %>%
filter(Sex == "M", !grepl("Team", RaceType, ignore.case = TRUE))
ladies_races <- next_weekend_races %>%
filter(Sex == "L", !grepl("Team", RaceType, ignore.case = TRUE))
# Team races
men_teams <- next_weekend_races %>%
filter(grepl("Team", RaceType, ignore.case = TRUE) & Sex == "M")
ladies_teams <- next_weekend_races %>%
filter(grepl("Team", RaceType, ignore.case = TRUE) & Sex == "L")
# Mixed team races
mixed_teams <- next_weekend_races %>%
filter(Sex == "Mixed")
Race Participation Probability
Individual Athletes
Calculates probability based on historical participation in the specific race type over the last 5 years:
get_race_probability <- function(chronos, participant, racetype, is_team = FALSE) {
# Use 5 years of history or participant's first race date, whichever is later
five_years_ago <- Sys.Date() - (5 * 365)
start_date <- max(five_years_ago, first_race_date)
# Count total races vs participant's races for this race type
total_races <- count_distinct_races(chronos, racetype, start_date)
participant_races <- count_participant_races(chronos, participant, racetype, start_date)
probability <- min(1, participant_races / total_races)
return(probability)
}
Team Competitions
Teams typically have probability = 1.0 (assumed to participate if entered)
Prediction Models
Points Prediction Model
Uses Generalized Additive Models (GAM) with feature selection:
Individual Model
explanatory_vars <- c("Prev_Points_Weighted", "Normal_Elo_Pct", "Large_Elo_Pct",
"Flying_Elo_Pct", "Elo_Pct")
# Feature selection using regsubsets with BIC criterion
exhaustive_selection <- regsubsets(Points ~ ., data = race_df_75, method = "exhaustive")
best_bic_vars <- names(coef(exhaustive_selection, which.min(summary$bic)))
# GAM model with smooth terms
gam_formula <- as.formula(paste("Points ~", paste("s(", best_bic_vars[-1], ")", collapse=" + ")))
model <- gam(gam_formula, data = race_df_75)
Team Model
explanatory_vars <- c("Avg_Normal_Elo_Pct", "Avg_Large_Elo_Pct",
"Avg_Flying_Elo_Pct", "Avg_Elo_Pct")
Position Probability Models
Creates separate GAM models for each position threshold (1, 3, 5, 10, 30):
for(threshold in position_thresholds) {
# Binary response: did athlete finish in top N?
race_df$position_achieved <- as.numeric(race_df$Place <= threshold)
# Feature selection for this threshold
pos_selection <- regsubsets(position_achieved ~ ., data = race_df_75,
family = binomial, method = "exhaustive")
pos_best_vars <- names(coef(pos_selection, which.min(summary$bic)))
# GAM model with binomial family
pos_gam_formula <- as.formula(paste("position_achieved ~",
paste("s(", pos_best_vars[-1], ")", collapse=" + ")))
position_model <- gam(pos_gam_formula, data = race_df_75, family = binomial)
# Store model for prediction
position_models[[paste0("threshold_", threshold)]] <- position_model
}
Data Preprocessing
ELO Percentage Calculation
Converts absolute ELO ratings to percentages by dividing by historical maximum:
# Individual athletes
for(col in c("Normal_Elo", "Large_Elo", "Flying_Elo", "Elo")) {
max_val <- max(race_df[[col]], na.rm = TRUE)
result_df[[paste0(col, "_Pct")]] <- result_df[[col]] / max_val
}
# Team averages
for(col in c("Avg_Normal_Elo", "Avg_Large_Elo", "Avg_Flying_Elo", "Avg_Elo")) {
max_val <- max(race_df[[col]], na.rm = TRUE)
result_df[[paste0(col, "_Pct")]] <- result_df[[col]] / max_val
}
Missing Value Handling
# Replace NAs with first quartile values
result_df <- result_df %>%
mutate(across(ends_with("_Pct"), ~replace_na_with_quartile(.x)),
Prev_Points_Weighted = replace_na(Prev_Points_Weighted, 0))
Probability Normalization
Multi-Step Process
- Race Participation Adjustment: Multiply position probabilities by race participation probabilities
- Target Sum Normalization: Scale probabilities to target sums (100% for win, 300% for podium, etc.)
- Individual Capping: Cap individual probabilities at 100%
- Excess Redistribution: Redistribute excess probability proportionally to other athletes
normalize_position_probabilities <- function(predictions, race_prob_col, position_thresholds) {
for(threshold in position_thresholds) {
prob_col <- paste0("prob_top", threshold)
# Step 1: Apply race participation probability
normalized[[prob_col]] <- normalized[[prob_col]] * normalized[[race_prob_col]]
# Step 2: Normalize to target sum
current_sum <- sum(normalized[[prob_col]], na.rm = TRUE)
target_sum <- 100 * threshold
scaling_factor <- target_sum / current_sum
normalized[[prob_col]] <- normalized[[prob_col]] * scaling_factor
# Step 3: Cap at 100%
over_hundred <- which(normalized[[prob_col]] > 100)
if(length(over_hundred) > 0) {
excess <- sum(normalized[[prob_col]][over_hundred] - 100)
normalized[[prob_col]][over_hundred] <- 100
# Step 4: Redistribute excess
under_hundred <- which(normalized[[prob_col]] < 100)
if(length(under_hundred) > 0 && excess > 0) {
under_sum <- sum(normalized[[prob_col]][under_hundred])
redistrib_factor <- (under_sum + excess) / under_sum
normalized[[prob_col]][under_hundred] <- normalized[[prob_col]][under_hundred] * redistrib_factor
}
}
}
}
Hill Size Classification
- Normal Hills: K90-K99 (most common World Cup venues)
- Large Hills: K120-K130 (major championships, Four Hills Tournament)
- Flying Hills: K185+ (ski flying competitions)
Output Format
Individual Results
- Participant: Skier name
- ID: FIS athlete ID
- Nation: Country code
- Race: Race number within weekend
- Participation: Probability of racing (0-1)
- Win: Win probability percentage
- Podium: Top 3 probability percentage
- Top5: Top 5 probability percentage
- Top10: Top 10 probability percentage
- Top30: Top 30 probability percentage
Team Results
- Nation: Country/team name
- Race: Race number within weekend
- Participation: Probability of racing (typically 1.0)
- Win: Win probability percentage
- Podium: Top 3 probability percentage
- Top5: Top 5 probability percentage
- Top10: Top 10 probability percentage
- Top30: Top 30 probability percentage
Key Differences from Alpine Skiing
- Hill-specific ELO: Uses hill size categories (Normal/Large/Flying) instead of discipline-specific ELO
- Team competitions: Includes sophisticated team prediction models using averaged athlete ELO ratings
- Mixed competitions: Handles mixed-gender team events
- Unified points system: Same points system across all event types
- Race type filtering: Uses RaceType and HillSize instead of Distance for race categorization
Model Validation
- Uses Brier score for model evaluation
- Applies period adjustments for trend changes
- Includes fallback models for insufficient data scenarios
- Logs detailed probability sum checks to ensure mathematical consistency