Probabilistic Card Trick Solution

Backwards reasoning fallacy

Suppose it is known that, on average, 50% of the students who start a course pass it. Is it correct to conclude the following?:

A courses that starts with 100 students will end up, on average, with 50 passes

A course that ends with 50 passes will, on average have started with 100 students

In fact a) is normally correct but b) is normally not. This has everything to do with the way we reason with so-called prior assumptions (this reasoning lies at the heart of the so-called Bayesian approach to probability).

To understand this fallacy we have to work with an example. The crucial prior knowledge in this case is the so-called ‘distribution’ of student numbers who start courses. Let’s suppose that these are courses in a particular college where the ‘average’ number of students per course is 180. We know that some courses will have more than 180 and some less –the distribution of student numbers looks like this:

This is a bell curve (a so called Normal distribution) whose average (mean) is 180.

Distributions like this are characterised not just by the mean, but also by the so-called ‘variance’, which is how ‘spread out’ the distribution is. The lower the variance the closer most of the data are to the mean. In this example the variance is 1000 which means that about 95% of the data lie within plus or minus 60 of the mean (that is ‘two standard deviations’ where the standard deviation is the square root of the variance).

Because the number of students who pass is influenced by the number who start, we represent this relationship as follows.

As in any model like this (it is a so-called Bayesian net or risk map) we need to define not only the distribution for the node representing ‘number of students starting course (which we have said is a Normal distribution with mean 180 and variance 1000) but also the distribution for the node representing number of students who pass. Since this latter number is dependent on the former what we actually need to define is the so-called conditional distribution. We know that on average the number of students who pass is 0.5 times the number who start. But we cannot say that this is a certain relationship. What we can reasonably say is that the mean of the distribution of the number who pass is 0.5 times the number who start. So, it seems reasonable again to use a Normal distribution whose mean is 0.5 times the number who start. We assume that the variance of this distribution is 500.

Thus, if we know that 100 students start the course then the (predicted) distribution for the number who will pass looks like this:

As you would expect the mean of the predicted distribution is about 50.

However, suppose we do not know the number who start but we know that 50 passed a particular course. In this case we use the model to reason backwards and it gives the following result:

Curiously the mean of the predicted distribution for the number who started this course is not 100 but is much higher – it is about 153. This seems to be wrong. But in fact, the fallacy is to assume that the mean should have been 100.

What is happening here is that the model is reasoning about our uncertainty in such a way that our prior assumptions are taken into consideration. Not only do we know that the average number of people who start a course is 180, but we also know that it is very unlikely that fewer than 120 people start a course (there are some such courses but they are rare). On the other hand, while on average a course with, say 150 starters will on average result in 75 passes, about 5% of the time the number of passes will be 50 or lower.

Hence, if we know that there is a very low number, 50, of passes on a course there are two ‘explanations’.

there might have been far fewer students start the course than the 180 we expected

the pass rate for the course might have been far lower than the 50% we expected.

What the model does is shift more on the latter than the former, so it says “I am prepared to believe there were fewer students start this course than expected but the stronger explanation for the low number of passes is that the pass rate was lower than expected”.

This is the Bayesian approach to reasoning. Is it rational? Imagine that, give or take a handful of students, there were always 180 students starting a course. If you discover that only 50 students had passed a course you would have to conclude that a much lower than expected pass rate was to blame. In this extreme case your prior beliefs about the number of starting students is very strong and cannot be shifted by other observations. If you had no such strong beliefs about the number of starting students then everything is different. For example starting with a so-called ‘ignorant’ prior distribution, if we discover 50 students passing a course then the model does indeed conclude that the are most likely to have started with about 100 students:

This type of problem is extremely common. For example, in a real-life project I was involved in we were tackling the problem of attrition rates for classes of military vehicles in combat. On the one hand we needed to know the likely number of vehicles left operational at the end of combat given certain combat scenarios. On the other hand, given a requirement for a minimum number of vehicles to be operational at the end of combat we needed to calculate the minimum number of vehicles to start with. Although the model involved many variables you can think of it in terms exactly like the above model where vehicles at start of combat replaces students at start of course and operational vehicles at end of combat replaces students who pass the course. As in the above example users of the model could not understand why it predicted 50 vehicles at the end of combat given 100 vehicles at the start, yet predicted over 150 vehicles at start given 50 at the end.

Since the prior distribution for vehicles had been provided by the users themselves the model was working correctly, even if it did not produce the results that they felt were ‘sensible’. In this case it was the strength of the prior distribution that the users had to review for correctness.