The Bradley-Terry Method: How Does It Work?

The Basic Method

I discovered the Bradley-Terry method from the KRACH system used in college hockey. A more thorough explanation can be found at the prior link. To summarize, the ratings are calculated to meet two principles:

When two teams play, their odds of winning are proportional to their ratings. If Team A’s rating is twice that of Team B, then if those teams play three times, Team A would be expected to win two and Team B one. Thus, expected winning percentage against any opponent can be computed as `W_(A,B)=R_A/(R_A+R_B)`.
Each team’s predicted record based on their rating and the ratings of their opponents is exactly equal to their actual record.

This system has a problem with teams who are either winless or undefeated; in order to match such a record, their ratings would have to be zero or infinite. To avoid this, each team is credited with three tie games (1/2 win and 1/2 loss each) against a hypothetical “average” team. (In KRACH, a rating of 100 is average, but this is arbitrary; scaling all the ratings by the same factor gives identical results. I use a rating of 1 as average instead.)

Prior to the 2011 football season, I used only one such tie game, but this tended to predict unrealistically high win percentages for unbeaten teams. The change to three fictional games tends to push down teams who have mediocre records against great schedules, as compressing the rating scale in this manner weakens more of their opponents' ratings than it improves. This mitigates one of the main criticisms of KRACH (which no longer uses the fictional game at all): a team can get to a fairly high rank purely by being in a tough conference, even if they do very poorly. This is exacerbated by the high percentage of games in conference in hockey (about 80%, compared to 65% for football or basketball).

An example calculation

Team A has played three games: a win over Team B (rating R_B = 2), a win over Team C (R_C = 3), and a loss to Team D (R_D = 4). Team A’s rating will affect the ratings of Teams B, C, and D, so all of these must be calculated iteratively, but for the moment we’ll assume R_B, R_C, and R_D are constants in calculating R_A.

Let’s start with a guess of R_A = 3. Team A’s expected wins:

Against B: 3 / (3 + 2) = 0.6
Against C: 3 / (3 + 3) = 0.5
Against D: 3 / (3 + 4) = 0.4286
Against fictional team: 3 * (3 / (3 + 1)) = 3 * 0.75
Total: 3.7786
Actual wins: 3.5 (2 + three fictional ties)

This is too high, so R_A must be less than 3. Let’s try 2.5:

Against B: 2.5 / (2.5 + 2) = 0.5556
Against C: 2.5 / (2.5 + 3) = 0.4545
Against D: 2.5 / (2.5 + 4) = 0.3846
Against fictional team: 3 * (2.5 / (2.5 + 1)) = 3 * 0.7143
Total: 3.5376

Still too high, but we’re pretty close.

Trial and error like the above would make calculation prohibitively difficult. Another way to interpret the rating is that a team’s rating is equal to their win ratio (wins divided by losses, including the fictional games) times their strength of schedule, defined as the rating of a team such that if their entire schedule (including the fictional games) were replaced by playing that team N times, their projected record would be exactly the same.

Strength of schedule can be calculated from the following formula:

`SOS = (sum R_o/(R_s+R_o))/(sum 1/(R_s+R_o))`

where R_s is your rating and R_o is your opponent’s, summed across all opponents. Because this is a nonlinear equation, an iterative process is used to calculate the ratings. Each team is assigned an initial guess of 1, then the strength of schedule and new rating for each team is caluclated using that guess. With the new ratings, each team’s strength of schedule is recomputed and a new set of ratings is generated. The process is repeated until the ratings converge.

The Margin-Aware Method

The basic method has a significant flaw: A single-point 3OT squeaker and a 50-point blowout count exactly the same. There are methods which look at margin of victory, but many of these ignore record entirely and look only at pure points. An example of this is Ken Pomeroy’s basketball ratings, which adjust for tempo and strength of schedule to compute offensive and defensive ratings based on points scored and allowed per possession against an average team. While this is highly useful, it’s possible for the system to rate a team very highly even if that team has a terrible record, simply because that team loses a lot of close games and wins a few massive blowouts. (Illinois in 2008 is perhaps the most notable example: with a losing record, they ranked 40th out of 341 teams.)

One way to avoid overvaluing blowouts is to reduce the value of each additional point of scoring margin as you move away from zero. The way I do this is to assign a number of “victory points” (between 0 and 1) to each team based on the margin of the game:

`V_A = 1 / (1 + e^(-M_A/alpha))`

where M_A is Team A’s net scoring margin for the game (negative for a loss) and α is a factor that adjusts the balance between a pure points rating and a pure W-L rating. As α approaches 0, the rating system closely approximates the basic, pure W-L rating, with only extremely close games getting less than full credit for a win. As α grows, a team’s victory point “record” becomes nearly linearly related to total scoring margin. After calculating each team’s victory point total, the basic rating system as above is applied using victory points instead of the true number of wins.

Prediction of win probability and margin of victory is based on the following two equations:

`W_A/W_B = (R_A/R_B)^(K_W)`
`M_(A,B) = K_M * ln(R_A/R_B)`

The value of α is chosen to have a game within α points be considered “close” and a margin greater than 5α or 6α be a major blowout. For now, I am using α=5 for basketball and 6.5 for football. K_M and K_W are then approximated based on the available data.

Home Field Adjustment

The above methods completely ignore home field for computation. However, a team that managed the same results against the same teams while playing all of its games on the road would most likely be better than one that hosted all of its games. Home-field advantage is accounted for by adjusting each opponent’s rating while computing strength of schedule. If a game was played on the road, that opponent’s rating is multiplied by H for the purposes of computing strength of schedule; if the game was a home game, the opponent’s rating is divided by H instead. Most neutral site games receive no adjustment, but some are considered “semi-home” games (for instance, Georgia facing Boise State in Atlanta) and get a reduced adjustment factor of `sqrt(H)` instead.

For predicitve purposes, the home team’s rating is multiplied by H (or `sqrt(H)` for semi-home games) before calculating win probability and expected margin.

The value of H is determined empirically by computing the ratings with different H values and comparing the expected number of home wins in true home games against the true number. Early in the season, the value from the previous season is used; I generally switch over to calculating based on the current season around the midpoint of conference play.

Preseason Adjustment

Starting with the 2014 football season, I’ve included an adjustment in the early season that biases teams toward their rating from the previous year. This is to prevent cases where two strong teams play each other in week 1 and the loser ends up severely punished for it until the data set is large enough for the winner’s true strength to be recognized.

The adjustment is handled in much the same manner as the fictional games used for normalization: additional fake tie games are added, but this time against an opponent with rating equal to the team’s previous-year rating instead of an average team. However, these games are gradually removed from the ratings as real data comes in. Initially, the preseason adjustment is given a weight of five games. For football, that weight is decreased by 2/3 of a game for each game the team has played this season until it is completely removed after eight games. For basketball, the decay rate is 1/2 game instead of 2/3, so it does not disappear until after the team’s 10th game.

SOS values shown on the ratings page do not include the games put in for preseason adjustment.

Team Pages

Each team page has a list of their games played so far, with the following information:

Opponent
Location (home, road, semi-home, semi-road, or neutral)
Location-adjusted margin-aware rating of opponent
Score
Victory points for and against (in margin-aware ratings)
A single-game performance metric which is defined as follows:

P = 10 log₅ R₁ + 5

where R₁ is the rating that would be given to the team if this game were the only data available about them. In theory, P can range from -5 to +15, but in practice results outside the 0-10 range are rare. 5 is average.

Victory point totals are given including the normalization and preseason adjustment, then the SOS and victory point ratio which combine to produce the margin-aware rating. The page currently does not include the basic-method rating.

Frequently Asked Questions

How are 1-AA teams handled in football?

With so few games and so many teams playing against an FCS/1-AA team, all 1-AA teams are lumped into a single entity. This is not especially fair to teams who play a strong 1-AA instead of a weak one, but the effect is much larger on lower-ranked teams than it is for highly-ranked teams (who are expected to beat either by a lot anyway). Due to limited crossover between 1-A and 1-AA, including all 1-AA teams in the same rating is not ideal; an excellent record in 1-AA could easily put a team above the 1-A average, which is unlikely to be accurate except for perhaps the very best 1-AA teams.

A better way might start with two different fictional teams, one representing an average 1-A team and another representing an average 1-AA, but figuring out how to set the ratings of these teams is non-trivial.

Games against teams below 1-AA are not included in the ratings at all. These are very rare and are usually limited to teams making the transition from 1-AA.
How are non-D1 teams handled in basketball?

Since non-D1 games are much rarer than 1-AA games in football, they are ignored entirely.
What does the basic method do better than the RPI?

The RPI has two serious flaws: strength of schedule is affected far too much by playing a team much stronger or much weaker than you, and defeating a very weak team can lower your rating (or, conversely, losing to a very strong team can improve your rating). The “RPI anchor” phenomenon is a well-known effect in college basketball, with serious impact on the selection committee’s decisions: play non-conference games against teams around #200 instead of #300, and your RPI will be much higher with only a minimal impact on your chances of losing a game.

Bradley-Terry avoids both of these flaws. The difference between facing a team you should beat 95% of the time and one you should beat 99% of the time is only 0.04 expected wins (before recalculating your rating), even though those teams are separated by a factor of 5. Since winning adds a full win to your actual record and less than that to your expected record, your rating must go up after a win (ignoring the effects of other teams' results changing your strength of schedule). You may not gain much by beating a team that is 250 spots below you in the rankings, but you will not lose anything. (The latter is not true of the margin-aware method; a close win against a team you were expected to blow out can bring you down since you are earning fewer victory points than expected.)
What does the margin-aware method do better than other points-based systems?

The “victory point” formula prevents one massive blowout from masking several losses. A conventional pure-points system sees a 30-point win and two 15-point losses as equivalent to 3 ties; here, with α=5, you get 1.0924 victory points, equivalent to about a 0.365 winning percentage. There are more complex systems with other advantages (Ken Pomeroy’s method allows you to compute separate offensive and defensive ratings and adjust for tempo); it’s best to look at many different points-based systems. Kenneth Massey maintains a ratings comparison page for both basketball and football which includes many different systems, both points-based and record-only. (The margin-aware ratings from here are listed under “Baker Bradley-Terry”.)
How are the ratings calculated?

I have a Python script that reads a text file with all the game results. For football I enter the results myself; for basketball, the file is generated from Ken Pomeroy’s database. The script iterates over all the teams repeatedly, updating ratings and strength of schedule until the data converges. Once the game results file is built, calculating the ratings only takes a few seconds.
How can I calculate the probability of Team A defeating Team B, and the average margin of victory, from the ratings?

For the basic ratings, first apply home field advantage by multiplying the appropriate team’s rating by H (or `sqrt(H)` for semi-home). Then `W_(A,B)=R_A/(R_A+R_B)`.

For margin-aware, again, apply home field advantage first. Then `W_(A,B)=R_A^(K_W)/(R_A^(K_W)+R_B^(K_W))` and `M_(A,B) = K_M*ln(R_A/R_B)`. For margin, on a neutral court you can simply subtract the two teams' margins against average teams; add K_M ln H for the home team or half that for a semi-home game.
Is there any pace adjustment?

No. Endgame strategy is largely dictated by time and score and is independent of the tempo of the rest of the game; a 41-40 lead at the end of a basketball game is not really any safer than a 101-100 lead. The faster the pace of a game, the more likely a team will achieve a larger lead but the harder it is to protect that lead.