## Simpson's Paradox Visualized

Posted by on Monday, July 10, 2017 Under: Mathematics

A while ago I wrote an article about an interesting statistical phenomenon known as Simpson's Paradox. According to Simpson's Paradox, a company can have discriminatory hiring policies in spite of each of its individual departments being completely fair. A new medical treatment can work better than existing methods for both the young and the old, and yet it gives worse results when you don't know the age of the patient. And it can make a single data set produce opposite and contradictory results.

In that original article, I included a very simple numerical example of Simpson's Paradox. Although it provided a good demonstration of this effect, it was still not entirely obvious to all readers why Simpson's Paradox works or what is causing these seemingly bizarre contradictions.

So in today's article I will instead present you with a purely visual example of Simpson's Paradox, and hopefully this will make it a little more clear.

Let me begin by introducing a completely silly little game. I start by putting some red and yellow dots on a table - they could be checkers, or bingo markers, or different coins. All that matters is that there are two distinct kinds, and they are distributed on a flat surface.

In that original article, I included a very simple numerical example of Simpson's Paradox. Although it provided a good demonstration of this effect, it was still not entirely obvious to all readers why Simpson's Paradox works or what is causing these seemingly bizarre contradictions.

So in today's article I will instead present you with a purely visual example of Simpson's Paradox, and hopefully this will make it a little more clear.

Let me begin by introducing a completely silly little game. I start by putting some red and yellow dots on a table - they could be checkers, or bingo markers, or different coins. All that matters is that there are two distinct kinds, and they are distributed on a flat surface.

I allow you to select either the red or the yellow markers, and I am given the other set. Suppose that you select the red and I am left with the yellow. Through some random method, one of your red markers gets selected and one of my yellow markers gets selected. The object of the game is to have your random marker be higher up than mine.

Obviously with this layout, you will nearly always win if you select red, so red is the best choice.

Let us try the game again with another distribution of markers. This time there are more red markers than yellow, but it really makes no difference to the game.

Obviously with this layout, you will nearly always win if you select red, so red is the best choice.

Let us try the game again with another distribution of markers. This time there are more red markers than yellow, but it really makes no difference to the game.

As before, you are allowed to select either the red or the yellow markers. You select the red markers, and I receive the yellow markers. And once again, one of your markers is randomly chosen, and one of mine is randomly chosen and whoever's marker is highest wins the game. And again, it is clear that choosing red allows you to win nearly every time.

Now let us combine these two games into a single mega-game. I put both sets of markers together on the table in the following way:

Now let us combine these two games into a single mega-game. I put both sets of markers together on the table in the following way:

The game progresses as before, with you selecting the red or yellow markers. Once again a random marker of each colour is chosen, and whoever's marker is the highest wins the game.

Among the top set (ie the set from the first game) red is the better choice. Among the bottom set (the set from the second game) red is also the better choice. And so it would be reasonable to assume that since red is the best choice for each subset, it will be the best choice for the combined set. However by simply looking at the combined set, it becomes clear that the yellow markers will nearly always be the winning set! This is Simpson's Paradox.

Obviously this is a very silly little game, and is completely unimportant. But consider a related real world scenario. Suppose that each dot represents an employee at a major corporation. The vertical axis represents the employees salary, with higher markers corresponding to employees with higher salaries. Red markers are for male employees, and yellow markers signify female employees. The first set might be all the managers, and the second set might be all of the support staff.

Now this trivial game has important ramifications for determining if the company treats its employees fairly. The first set shows that male managers usually earn more than female managers. The second set shows that male support staff usually earn more than female staff. And yet overall, the company pays women more, because all managers earn more than every support worker, and there are more female managers than male in this company. So now we have a single set of data, and yet we can use it to show a company that is blatantly biased towards men or women depending solely on how we present the results. Which result is the truth? They both are. That is the nature of statistics and of Simpson's Paradox.

So hopefully that makes the effect a little clearer, especially for those people who are scared of numbers. In an age when there are so many unreliable sources of information, and so many groups trying to skew public opinion to fit their own personal biases, it is crucial that we as members of society are aware of how data can be manipulated in perfectly legitimate ways.

Among the top set (ie the set from the first game) red is the better choice. Among the bottom set (the set from the second game) red is also the better choice. And so it would be reasonable to assume that since red is the best choice for each subset, it will be the best choice for the combined set. However by simply looking at the combined set, it becomes clear that the yellow markers will nearly always be the winning set! This is Simpson's Paradox.

Obviously this is a very silly little game, and is completely unimportant. But consider a related real world scenario. Suppose that each dot represents an employee at a major corporation. The vertical axis represents the employees salary, with higher markers corresponding to employees with higher salaries. Red markers are for male employees, and yellow markers signify female employees. The first set might be all the managers, and the second set might be all of the support staff.

Now this trivial game has important ramifications for determining if the company treats its employees fairly. The first set shows that male managers usually earn more than female managers. The second set shows that male support staff usually earn more than female staff. And yet overall, the company pays women more, because all managers earn more than every support worker, and there are more female managers than male in this company. So now we have a single set of data, and yet we can use it to show a company that is blatantly biased towards men or women depending solely on how we present the results. Which result is the truth? They both are. That is the nature of statistics and of Simpson's Paradox.

So hopefully that makes the effect a little clearer, especially for those people who are scared of numbers. In an age when there are so many unreliable sources of information, and so many groups trying to skew public opinion to fit their own personal biases, it is crucial that we as members of society are aware of how data can be manipulated in perfectly legitimate ways.

In : Mathematics