Is it time to fix the Strictly Come Dancing scoring system?

Is there a difference between judges’ scores of 7-7-7-8 compared with 6-7-7-9 or 8-7-7-7? According to the scoring system used by Strictly Come Dancing, the answer is no; all add up to 29 points out of 40. As far as the judges are concerned, those three pairs of dancers performed equally well. However, that interpretation requires the subjective scoring system used by each judge to be identical. Yet, anyone who has watched even a single show will quickly identify that Craig Revel Horwood generally scores the dancers more critically than Anton Du Beke, and Craig uses a wider range of scores. So, is 7-7-7-8 really just as good as 6-7-7-9?

Before we move on, let’s acknowledge Strictly (or Dancing with the Stars) is an entertainment show and none of what is about to follow actually matters at all. The existing system “works” – people watch and engage with the process. But read on anyway.

Is there actually a problem?

To answer this question we need to decide on the role of the judges. Are they there as a mixture of pantomine villain/hero style light relief, or are they intended to offer professional insight to help the audience distinguish between good and bad performances, particularly from a technical standpoint? With their extensive experience and expertise in the world of dance, one would hope the latter.

The judges are entrusted with the responsibility of providing valuable feedback and guidance, their comments and scores serve as educational tools for the audience. Thanks to those comments, we know to look for “gapping”, “spatula hands” and “swivel”.

The judges’ remarks do offer constructive criticism that highlights the strengths of a performance and pinpoints areas for improvement. However, when the scoring system results in tied scores among multiple couples, the judges’ ability to effectively guide the audience becomes compromised. This lack of resolution diminishes the educational aspect of the judges’ role, leaving viewers with less clarity on the nuances of each performance, how big is that technical error really?

In the example above, the judges collective thoughts after years in the profession were that Ellie, Krishnan, Zara and Angela’s dances were all equally good and equally bad (top score is 40). If they can’t split those four, why can the audience? Obviously this is fine, it is a popularity as well as a dance contest. But still.

In the Strictly points system, the order in the leaderboard is converted directly into points which are then combined with the public vote to determine who ends up in the bottom 2 and faces elimination. The top couple each week by judges score earns a number of points equal to the number of dancers involved. When there is a tie, both dancers get the same number of points, which you would expect. However, unlike other scoring systems, the people below the tie get only one fewer points rather than the number of points their position should have got.

In the example above from seasons past, you would expect Daisy and Lesley to get 8 points each, Claudia only 6 and Naga and Ed would get 2 and 1 respectively rather than the 5 and 4 they actually received.

This points system also removes the magnitude of differences that are identified by the judges. Ed scored only 16 ponts for that dance compared with Naga’s 24, an 8 difference in score that converts into only 1 point difference

The narrower scoring range as used by Strictly serves to amplify the impact of the public vote. This heightened influence of the public vote puts at risk couples who would otherwise be considered safe. Consequently, the competition may lose some of its predictability, and the legitimacy of the results could be (often is) called into question. Controversy is a good thing. If people are talking about your show it will encourage more people to watch.

Fixing the problem

The easiest part to fix is the scores to points conversion. Zero system change required,, just give points based on rank order using the standard approach. Last place always gets 1 point. This would help to ensure that the best dancers as viewed by the judges were likely to avoid dance offs in the earlier weeks.

Breaking ties though might need more thought. Let’s face it, breaking of ties is probably not worth doing, but let’s suggest some options anyway

Option 1: Tiebreaker. Easiest to enact. Head judge (Shirley in the UK) breaks ties by establishing a rank order of any couples that are tied. It would add maybe 1-2 mins of drama (reac that in Craig’s voice for maximum effect) to the end of the show and might add some extra column inches in the form of people complaining about that ordering! Easy to do, not particularly statistically robust but at least follows a herirachal skill acknowledgement for justifcation.

Option 2: Ranked list. Scrap the whole point system and charge judges with giving a ranked list of their favourites. Then combine to give whole rank list and, again, ties broken by the head judge. Obviously this is loaded with problems, mostly how to fill the dead air time while the judges go through their decision. However, it is essentially what the viewing audience are expected to do (albeit by voting for their favourite(s)). In some ways, this system might reduce the clear order effect of scoring – going second is much worse than going first or last, and going after a good dancer will likely yield different scores than after a less good one.

Option 2.5: Divergence from dance 1. A variant of the above… In later episodes in the series, the scores quickly become horribly skewed toward 10s. Once you have said that one dance is worth the max score, it isn’t possible to say “actually that one is better”. Using a wider range of scores could help that. Some might even argue that no non-professional dancer should receive a 10. So, rather than waiting until the end to establishe a per-judge rank order, instead score as divergence for the mean. Make the first dance of every show receive 5-5-5-5 and the judges then score as better or worse than dance 1 of that week. This would retain the subjectivity value of different judges opinions but yield fewer ties as it gains granularity.

Option 3: Use a wider scoring range. Adding resolution by giving judges scores out of 20 or even 100 each is possible (but feels wrong). As anyone who has ever marked a student essay and tried to give a percentage score will attest, differntiating between a 68 and 69 is basically impossible. So adding too much potential for resolution here would actually end up with judges rounding their scores anyway. Of course part of the appeal of Strictly is the legacy. How could you comment on the “best ever male rumba” score if we suddenly changed to out of 100. Oh, and how many “paddles” would the judges need! But it is a dancing competition and the actual numbers are essentially meaningless so there is no reason not to have a wider scale.

Option 4: Z scores. All the previous options put equal weighting on each judge’s score. However, anyone who ever watches this show will appreciate that a “10” from Craig is harder to achieve than a 10 from Motsi or Anton.

Z scores, or standard scores, are a statistical measure that quantifies a data point’s relationship to the mean (average) of a group of scores. In the context of the dance competition, Z scores could be applied to the judges’ subjective scoring metrics to provide a more nuanced and fair evaluation.

How Z scores work is you calculate the average and also the spread around that average, the “standard deviation” (SD). A judge (e.g. Motsi) whose scores are all 8, 9, or 10s will have a standard deviation that is smaller than one (e.g. Craig) who will give 5, 6, 7, 8, 9 and occassionally 10. Using a judge’s own distribution (calculated from historical data) then the standardised score given by that judge is the (score – mean)/SD. A Z score of +1 is one SD above the mean, -1 is one below.

The advantage of using Z scores lies in their ability to account for variations in the judges’ scoring tendencies. Each judge may have a different scale, and Z scores normalize these differences, ensuring that the final scores are more reflective of a couple’s performance relative to their peers. This approach provides a fairer representation of the technical and artistic merits of each dance. This would effectively address the issue of a narrow spread of scores by introducing a higher level of granularity.

Let’s look at a specific example. In current series, Craig’s mean is 6.6 SD 1.9 whereas Antoine mean is 7.6 SD 1.7, In week four one couple got 6 from Craig and 8 from Antoinne whereas one was given 7 from Craig and 7 from Antoinne . These two couples both had 28 points. However, the first couple had a Z score from Craig of -0.32 [(6-6.6)/1.9] and +0.23 from Antoinne (8-7.6)/1.7] for a total of -0.08, whereas the 7-7 couple had +0.21 from Craig, -0.36 from Antoinne for a total Z of -0.14. Slightly worse… tie broken.

There is an easily predictable response to this option; “that’s stupidly complicated!” and you are, of course, correct. There is zero chance that Z scores will be brought in to a light entertainment show, however fun it would be hearing Claudia trying to explain the maths!

Strictly stats lessons?

All this and the conclusion is that the best scoring system is probably too complicated and the current one probably is best for the combination of good TV/easyness. Which is sad but inevitable.

Using Strictly to teach stats has potential. For educators out there, I think there is scope to talk about skew and kurtosis, to compare means vs modes and medians, and to lead into Z scores and inferrential statistics. It serves as a good illustrator of how subjective scoring leads to challenges but also a start point for ways to deal with those types data. Its a nice example of stats in action and a way to get your students to go home and talk about the scoring system problems of strictly with their family. You could also encourage your students to try the Mean MCQ quiz here.

FWIW I’ve no intention of touching Eurovision.